Skip to Main Content
Hadoop数据分析
book

Hadoop数据分析

by Benjamin Bengfort, Jenny Kim
April 2018
Intermediate to advanced content levelIntermediate to advanced
229 pages
6h 19m
Chinese
Posts & Telecom Press
Content preview from Hadoop数据分析
80
5
def reduce(self):
for key, values in self:
stripe = Counter()
# 将所有计数器相加
for value in values:
for token, count in value.iteritems():
# 为每一个令牌分别累加stripe
stripe[token] += count
self.emit(key, stripe)
stripe
mapper
reducer
有点复杂。在
mapper
中需要对所有令牌进行两层嵌套循环,并
且必须确保该条目不对其自身计数。内置的 enumerate 函数允许我们在两层循环中跟踪条
目的索引,跳过相同的索引而不是条目(如果条目在文本中重复出现,则该条目实际上可
能共现)。collections 库中的 Counter 是一个有用的数据结构,它本质上是字典,默认值
int。然后,
reducer
需要对字典中的每个元素进行求和,计算
mapper
中所有计数器的
总数。虽然输入相同,但现在的输出更紧凑:
run, ((run, 6), (see, 3), (spot, 6))
see, ((run, 3), (spot, 2))
spot, ((run, 6), (see, 2), (spot, 1))
stripe
方法不仅在其表示上更紧凑,而且也生成更少、更简单的中间键,从而优化了数据
sort
shuffle
等方面。然而,
stripe
对象更庞大,在处理时间和序列化方面的开销都更 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Kudu:构建高性能实时数据分析存储系统

Kudu:构建高性能实时数据分析存储系统

Jean-Marc Spaggiari, Mladen Kovacevic, Brock Noland, Ryan Bosshart
Java并发编程实战

Java并发编程实战

Brian Goetz, Tim Peierls
面向机器学习的自然语言标注

面向机器学习的自然语言标注

James Pustejovsky, Amber Stubbs

Publisher Resources

ISBN: 9787115479648