Skip to Main Content
Hadoop数据分析
book

Hadoop数据分析

by Benjamin Bengfort, Jenny Kim
April 2018
Intermediate to advanced content levelIntermediate to advanced
229 pages
6h 19m
Chinese
Posts & Telecom Press
Content preview from Hadoop数据分析
大数据操作系统
25
mapper
mapper
reducer
reducer
Block 2
The cat in the
hat ran fast.
2-7:在集群上执行的有两个 mapper 和两个 reducer 的单词计数作业的数据流
这些数据被传递到
shuffle
阶段和
sort
阶段,键(单词)被分组并排序,然后发送到适当的
reducer
。每个
reducer
接收以单词作为键、一串数字
1
作为值的输入。为了获得计数,它
简单地将这些数字
1
相加,并将单词作为键、计数作为值发出。示例中的输入和输出的数
据如下所示:
# WordCount reducer的输入
# 该数据由shufflesort计算
(".", [1, 1])
("cat", [1, 1])
("fast", [1, 1])
("hat", [1, 1])
("in", [1])
("no", [1])
("ran", [1])
("the", [1])
("wears", [1])
("The", [1, 1])
# 所有WordCount reducer的输出
(".", 2)
("cat", 2)
("fast", 2)
("hat", 2)
("in", 1)
("no", 1)
("ran", 1)
("the", 1)
("wears", 1)
("The", 2)
这种算法看似简单,但是它稍微复杂一点的实现常被用于文本处理。想象一下如何计算
《纽约时报》或
Google
图书语料库中最常出现的单词——这肯定需要某种大数据技术 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Kudu:构建高性能实时数据分析存储系统

Kudu:构建高性能实时数据分析存储系统

Jean-Marc Spaggiari, Mladen Kovacevic, Brock Noland, Ryan Bosshart
Java并发编程实战

Java并发编程实战

Brian Goetz, Tim Peierls
面向机器学习的自然语言标注

面向机器学习的自然语言标注

James Pustejovsky, Amber Stubbs

Publisher Resources

ISBN: 9787115479648