Skip to Main Content
Hadoop数据分析
book

Hadoop数据分析

by Benjamin Bengfort, Jenny Kim
April 2018
Intermediate to advanced content levelIntermediate to advanced
229 pages
6h 19m
Chinese
Posts & Telecom Press
Content preview from Hadoop数据分析
90
5
"""
Spark应用程序的主要分析过程
"""
# 从数据集加载停用词
with open('stopwords.txt', 'r') as words:
stopwords = frozenset([
word.strip() for word in words.read().split("\n")
])
# 将停用词广播到集群
stopwords = sc.broadcast(stopwords)
# 第一阶段: 分词并计算文档频率
# 请注意: 假设有一个包含(docid, text)对的语料库
docfreq = corpus.flatMap(partial(tokenize, stopwords=stopwords))
docfreq = docfreq.reduceByKey(add)
# 第二阶段: 计算词频,然后执行键空间更改
trmfreq = docfreq.map(lambda (key, tf): (key[1], (key[0], tf, 1)))
trmfreq = trmfreq.reduceByKey(term_frequency)
trmfreq = trmfreq.map(
lambda (word, (docid, tf, n)): ((word, docid), (tf, n))
)
# 第三阶段:为每个(worddocument)对计算TF-IDF
tfidfs ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Kudu:构建高性能实时数据分析存储系统

Kudu:构建高性能实时数据分析存储系统

Jean-Marc Spaggiari, Mladen Kovacevic, Brock Noland, Ryan Bosshart
Java并发编程实战

Java并发编程实战

Brian Goetz, Tim Peierls
面向机器学习的自然语言标注

面向机器学习的自然语言标注

James Pustejovsky, Amber Stubbs

Publisher Resources

ISBN: 9787115479648