"dropoffX != 0 and dropoffY != 0 and pickupX != 0 and pickupY != 0"
).cache()
现在重新在
taxiDone
RDD
上运行分析,得到如下结果:
taxiDone.
groupBy(boroughUDF($"dropoffX", $"dropoffY")).
count().
show()
...
+-----------------------+--------+
|UDF(dropoffX, dropoffY)| count|
+-----------------------+--------+
| Queens| 670912|
| NA| 62778|
| Brooklyn| 714659|
| Staten Island| 3333|
| Manhattan|12971314|
| Bronx| 67333|
+-----------------------+--------+
过滤掉起点或终点为零的记录后,
5
个行政区的输出记录只是减少了一些,但
NA
对应的记
录大部分被去掉了,剩下的那些终点落在郊区的记录条数现在看起来比较合理了。
8.5
基于
Spark
的会话分析
前面提到的一个目标是要研究出租车乘客下车区域与出租车等待下一单生意的等待时间之
间的关系。现在
taxiDone
数据集包含了每个出租车司机的所有载客数据,但这些记录分布 ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month, and much more.