Skip to Main Content
Spark高级数据分析(第2版)
book

Spark高级数据分析(第2版)

by Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills
June 2018
Beginner to intermediate content levelBeginner to intermediate
246 pages
6h 57m
Chinese
Posts & Telecom Press
Content preview from Spark高级数据分析(第2版)
132
7
主题列表是排好序的:
val topicPairs = medline.flatMap(t => {
t.sorted.combinations(2)
}).toDF("pairs")
topicPairs.createOrReplaceTempView("topic_pairs")
val cooccurs = spark.sql("""
SELECT pairs, COUNT(*) cnt
FROM topic_pairs
GROUP BY pairs""")
cooccurs.cache()
cooccurs.count()
由于我们的数据中有
14 548
个主题,总共可能有
14 548*14 547/2 = 105 814 878
个无序的
伴生二元组。然而,伴生组的计数结果显示数据集中实际上只有
213 745
组,只占可能数
量的很小一部分。如果考察一下数据中最常出现的伴生二元组,我们可以得到如下结果:
cooccurs.createOrReplaceTempView("cooccurs")
spark.sql("""
SELECT pairs, cnt
FROM cooccurs
ORDER BY cnt DESC
LIMIT 10""").collect().foreach(println)
...
[WrappedArray(Demography, Population Dynamics),288]
[WrappedArray(Government ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

大数据项目管理:从规划到实现

大数据项目管理:从规划到实现

Ted Malaska, Jonathan Seidman
管理Kubernetes

管理Kubernetes

Brendan Burns, Craig Tracey

Publisher Resources

ISBN: 9787115482525