grunt> cat /opt/dataset/input.txt keyword1 keyword2 keyword2 keyword4 keyword3 keyword1 keyword4 keyword4 A = LOAD '/opt/dataset/input.txt' using PigStorage('\n') as (line:chararray); B = foreach A generate TOKENIZE((chararray)$0); C = foreach B generate flatten($0) as word; D = group C by word; E = foreach D generate COUNT(C), group; dump B; ({(keyword1),(keyword2)}) ({(keyword2),(keyword4)}) ({(keyword3),(keyword1)}) ({(keyword4),(keyword4)}) dump C; (keyword1) (keyword2) (keyword2) (keyword4) (keyword3) (keyword1) (keyword4) (keyword4) dump D; (keyword1,{(keyword1),(keyword1)}) (keyword2,{(keyword2),(keyword2)}) (keyword3,{(keyword3)}) (keyword4,{(keyword4),(keyword4),(keyword4)}) dump E; (2,keyword1) (2,keyword2) (1,keyword3) (3,keyword4) store E into './wordcount';
<pre code_snippet_id="327646" snippet_file_name="blog_20140505_2_6349649" name="code" class="java">TOKENIZE Splits a string and outputs a bag of words. Syntax TOKENIZE(expression) Terms expression An expression with data type chararray. Usage Use the TOKENIZE function to split a string of words (all words in a single tuple) into a bag of words (each word in a single tuple). The following characters are considered to be word separators: space, double quote("), coma(,) parenthesis(()), star(*). Example In this example the strings in each row are split. A = LOAD 'data' AS (f1:chararray); DUMP A; (Here is the first string.) (Here is the second string.) (Here is the third string.) X = FOREACH A GENERATE TOKENIZE(f1); DUMP X; ({(Here),(is),(the),(first),(string.)}) ({(Here),(is),(the),(second),(string.)}) ({(Here),(is),(the),(third),(string.)})</pre><br> <br> <pre></pre> <br>
更多精彩内容:http://www.bianceng.cnhttp://www.bianceng.cn/Programming/extra/
以上是小编为您精心准备的的内容,在的博客、问答、公众号、人物、课程等栏目也有的相关内容,欢迎继续使用右上角搜索按钮进行搜索grunt
, string
, c:foreach
, keyword
, The
, HERE
, keywords
tokenize
pig wordcount、pig count、pig count计数、pig group count、pig group by count,以便于您获取更多的相关知识。
时间: 2025-01-19 13:18:14