首页 技术 正文
技术 2022年11月14日
0 收藏 765 点赞 3,452 浏览 2588 个字

注,reduce之前已经shuff。

从groupby 理解mapper-reducer

mapper.py

#!/usr/bin/env python"""mapper.py"""import sys# input comes from STDIN (standard input)for line in sys.stdin:    # remove leading and trailing whitespace    line = line.strip()    # split the line into words    words = line.split()    # increase counters    for word in words:        # write the results to STDOUT (standard output);        # what we output here will be the input for the        # Reduce step, i.e. the input for reducer.py        #        # tab-delimited; the trivial word count is 1        print '%s\t%s' % (word, 1)

reducer.py

#!/usr/bin/env python"""reducer.py"""from operator import itemgetterimport syscurrent_word = Nonecurrent_count = 0word = None# input comes from STDINfor line in sys.stdin:    # remove leading and trailing whitespace    line = line.strip()    # parse the input we got from mapper.py    word, count = line.split('\t', 1)    # convert count (currently a string) to int    try:        count = int(count)    except ValueError:        # count was not a number, so silently        # ignore/discard this line        continue    # this IF-switch only works because Hadoop sorts map output    # by key (here: word) before it is passed to the reducer    if current_word == word:        current_count += count    else:        if current_word:            # write result to STDOUT            print '%s\t%s' % (current_word, current_count)        current_count = count        current_word = word# do not forget to output the last word if needed!if current_word == word:    print '%s\t%s' % (current_word, current_count)

Improved Mapper and Reducer code: using Python iterators and generators

mapper.py

#!/usr/bin/env python"""A more advanced Mapper, using Python iterators and generators."""import sysdef read_input(file):    for line in file:        # split the line into words        yield line.split()def main(separator='\t'):    # input comes from STDIN (standard input)    data = read_input(sys.stdin)    for words in data:        # write the results to STDOUT (standard output);        # what we output here will be the input for the        # Reduce step, i.e. the input for reducer.py        #        # tab-delimited; the trivial word count is 1        for word in words:            print '%s%s%d' % (word, separator, 1)if __name__ == "__main__":    main()

reducer.py

#!/usr/bin/env python"""A more advanced Reducer, using Python iterators and generators."""from itertools import groupbyfrom operator import itemgetterimport sysdef read_mapper_output(file, separator='\t'):    for line in file:        yield line.rstrip().split(separator, 1)def main(separator='\t'):    # input comes from STDIN (standard input)    data = read_mapper_output(sys.stdin, separator=separator)    # groupby groups multiple word-count pairs by word,    # and creates an iterator that returns consecutive keys and their group:    #   current_word - string containing a word (the key)    #   group - iterator yielding all ["<current_word>", "<count>"] items    for current_word, group in groupby(data, itemgetter(0)):        try:            total_count = sum(int(count) for current_word, count in group)            print "%s%s%d" % (current_word, separator, total_count)        except ValueError:            # count was not a number, so silently discard this item            passif __name__ == "__main__":    main()
相关推荐
python开发_常用的python模块及安装方法
adodb:我们领导推荐的数据库连接组件bsddb3:BerkeleyDB的连接组件Cheetah-1.0:我比较喜欢这个版本的cheeta…
日期:2022-11-24 点赞:878 阅读:8,999
Educational Codeforces Round 11 C. Hard Process 二分
C. Hard Process题目连接:http://www.codeforces.com/contest/660/problem/CDes…
日期:2022-11-24 点赞:807 阅读:5,511
下载Ubuntn 17.04 内核源代码
zengkefu@server1:/usr/src$ uname -aLinux server1 4.10.0-19-generic #21…
日期:2022-11-24 点赞:569 阅读:6,357
可用Active Desktop Calendar V7.86 注册码序列号
可用Active Desktop Calendar V7.86 注册码序列号Name: www.greendown.cn Code: &nb…
日期:2022-11-24 点赞:733 阅读:6,140
Android调用系统相机、自定义相机、处理大图片
Android调用系统相机和自定义相机实例本博文主要是介绍了android上使用相机进行拍照并显示的两种方式,并且由于涉及到要把拍到的照片显…
日期:2022-11-24 点赞:512 阅读:7,770
Struts的使用
一、Struts2的获取  Struts的官方网站为:http://struts.apache.org/  下载完Struts2的jar包,…
日期:2022-11-24 点赞:671 阅读:4,848