Skip to Main Content
Spark高级数据分析(第2版)
book

Spark高级数据分析(第2版)

by Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills
June 2018
Beginner to intermediate content levelBeginner to intermediate
246 pages
6h 57m
Chinese
Posts & Telecom Press
Content preview from Spark高级数据分析(第2版)
纽约出租车轨迹的空间和时间数据分析
165
val taxiDone = taxiClean.where(
"dropoffX != 0 and dropoffY != 0 and pickupX != 0 and pickupY != 0"
).cache()
现在重新在
taxiDone
RDD
上运行分析,得到如下结果:
taxiDone.
groupBy(boroughUDF($"dropoffX", $"dropoffY")).
count().
show()
...
+-----------------------+--------+
|UDF(dropoffX, dropoffY)| count|
+-----------------------+--------+
| Queens| 670912|
| NA| 62778|
| Brooklyn| 714659|
| Staten Island| 3333|
| Manhattan|12971314|
| Bronx| 67333|
+-----------------------+--------+
过滤掉起点或终点为零的记录后,
5
个行政区的输出记录只是减少了一些,但
NA
对应的记
录大部分被去掉了,剩下的那些终点落在郊区的记录条数现在看起来比较合理了。
8.5
 基于
Spark
的会话分析
前面提到的一个目标是要研究出租车乘客下车区域与出租车等待下一单生意的等待时间之
间的关系。现在
taxiDone
数据集包含了每个出租车司机的所有载客数据,但这些记录分布 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

大数据项目管理:从规划到实现

大数据项目管理:从规划到实现

Ted Malaska, Jonathan Seidman
管理Kubernetes

管理Kubernetes

Brendan Burns, Craig Tracey

Publisher Resources

ISBN: 9787115482525