利用hadoop mapreduce 做数据排序

　　我们的需求是想统计一个文件中用IK分词后每个词出现的次数，
然后按照出现的次数降序排列。也就是高频词统计。

　　由于hadoop在reduce之后就不能对结果做什么了，
所以只能分为两个job完成，第一个job统计次数，第二个job对第一个job的结果排序。第一个job的就是hadoop最简单的例子countwords，
我要说的是用hadoop对结果排序。假设第一个job的结果输出
如下：

　　part-r-0000文件内容：

　　a&">nbsp; 5

　　b 4

　　c 74

　　d 78

　　e 1

　　r 64

　　f 4

　　要做的就是按照每个词出现的次数降序排列。

　　**********************************分割线*****************************************
首先可能会出现这样的问题：

　　1.可能上一个job为多个reduce，也就是会产生多个结果文件，因为一个reduce就会生成一个结果文件，结果存放在上一个job输出目录下类似part-r-00的文件里。

　　2.需要排序的文件内容很大，所以需要考虑多个reduce的情况。

　　*********************************分割线*******************************怎么去设计mapreduce

　　1.在map阶段按照行读取文本，然后调用map方法时把上一个job的结果颠倒，也就是map后结果应该是这样的

　　5 a

　　4 b

　　74 c

　　................

　　.........................

　　4 f2.然后map后，hadoop会对结果进行分组，这时结果就会变成

　　(5：a)

　　(4:b,f)

　　(74:c)

　　3.然后按照reduce数目的
大小自定义分区函数，让结果形成多个区间，比如我
认为大于50的应该在一个区间，一共3个reduce，
那么最后的数据应该是三个区间，大于50的直接分到第一个分区0，25到50之间的分到第二个分区1，小于25的分到第三个分区2.因为分区数和reduce数是相同的，所以不同的分区对应不同的reduce，因为分区是从0开始的，分区是0的会分到第一个reduce处理,分区是1的会分到第2个reduce处理，依次类推。并且reduce对应着输出文件，所以，第一个reduce生成的文件就会是part-r-0000，第二个reduce对应的生成文件就会是part-r-0001，依次类推，所以reduce处理时只需要把key和value再倒过来直接输出。这样最后就会让形成数目最大的字符串就会在第一个生成文件里，排好的序就会文件命的顺序。代码如下：

　　*******************************分割线*****************************************map：

　　/**

　　* 把上一个mapreduce的结果的key和value颠倒，调到后就可以按照key排序了。

　　*

　　* @author zhangdonghao

　　*

　　*/

　　public class SortIntValueMapper extends

　　Mapper<LongWritable, Text, IntWritable, Text> {

　　private final static IntWritable wordCount = new IntWritable(1);

　　private Text word = new Text();

　　public SortIntValueMapper() {

　　super();

　　}

　　@Override

　　public void map(LongWritable key, Text value, Context context)

　　throws IOException, InterruptedException {

　　StringTokenizer
tokenizer = new StringTokenizer(value.toString());

　　while (tokenizer.hasMoreTokens()) {

　　word.set(tokenizer.nextToken().trim());

　　wordCount.set(Integer.valueOf(tokenizer.nextToken().trim()));

　　context.write(wordCount, word);

　　}

　　}

　　}reudce:

　　/**

　　* 把key和value颠倒过来输出

　　* @author zhangdonghao

　　*

　　*/

　　public class SortIntValueReduce extends

　　Reducer<IntWritable, Text, Text, IntWritable> {

　　private Text result = new Text();

　　@Override

　　public void reduce(IntWritable key, Iterable<Text> values, Context context)

　　throws IOException, InterruptedException {

　　for (Text val : values) {

　　result.set(val.toString());

　　context.write(result, key);

　　}

　　}

　　}Partitioner:

　　/**

　　* 按照key的大小来划分区间，当然，key 是 int值

　　*

　　* @author zhangdonghao

　　*

　　*/

　　public class KeySectionPartitioner<K, V> extends Partitioner<K, V> {

　　public KeySectionPartitioner() {

　　}

　　@Override

　　public int getPartition(K key, V value, int numReduceTasks) {

　　/**

　　* int值的hashcode还是自己本身的数值

　　*/

　　//这里我认为大于maxValue的就应该在第一个分区

　　int maxValue = 50;

　　int keySection = 0;

　　// 只有传过来的key值大于maxValue 并且numReduceTasks比如大于1个才需要分区，否则直接返回0

　　if (numReduceTasks > 1 && key.hashCode() < maxValue) {

　　int sectionValue = maxValue / (numReduceTasks - 1);

　　int count = 0;

　　while ((key.hashCode() - sectionValue * count) > sectionValue) {

　　count++;

　　}

　　keySection = numReduceTasks - 1 - count;

　　}

　　return keySection;

　　}

　　}Comparator：

　　/**

　　* int的key按照降序排列

　　*

　　* @author zhangdonghao

　　*

　　*/

　　public class IntKeyAscComparator extends WritableComparator {

　　protected IntKeyAscComparator() {

　　super(IntWritable.class, true);

　　}

　　@Override

　　public int compare(WritableComparable a, WritableComparable b) {

　　return -super.compare(a, b);

　　}

　　}job的关键设置：

　　/**

　　* 这里是map输出的key和value类型

　　*/

　　job.setOutputKeyClass(IntWritable.class);

　　job.setOutputValueClass(Text.class);

　　job.setMapperClass(SortIntValueMapper.class);

　　// job.setCombinerClass(WordCountReduce.class);

　　job.setReducerClass(SortIntValueReduce.class);

　　// key按照降序排列

　　job.setSortComparatorClass(IntKeyAscComparator.class);

　　job.setPartitionerClass(KeySectionPartitioner.class);

　　job.setInputFormatClass(TextInputFormat.class);

　　job.setOutputFormatClass(TextOutputFormat.class);

　　/**

　　*这里可以放输入目录数组，也就是可以把上一个job所
有的结果都放进去

　　**/

　　FileInputFormat.setInputPaths(job, inputPath);

　　FileOutputFormat.setOutputPath(job,outputPath);大概就是这样，亲测可用。(^__^)

时间： 2024-09-15 06:24:15

利用hadoop mapreduce 做数据排序

利用hadoop mapreduce 做数据排序的相关文章

用Excel做数据排序地常用办法与灵活技术

Hadoop MapReduce：数据科学家探索之路

Hadoop专业解决方案-第3章：MapReduce处理数据

MapReduce初级案例——数据排序

如何利用Apache Sqoop在DB2与Hadoop之间传递数据

hadoop mapreduce 数据分析丢数据

hadoop-求救高手，Mapreduce导入数据到Hadoop报ClassNotFoundException

《Hadoop实战手册》一1.11 利用Flume加载数据到HDFS中

《Hadoop MapReduce性能优化》一2.1　研究Hadoop参数