Skip to Main Content
Hadoop数据分析
book

Hadoop数据分析

by Benjamin Bengfort, Jenny Kim
April 2018
Intermediate to advanced content levelIntermediate to advanced
229 pages
6h 19m
Chinese
Posts & Telecom Press
Content preview from Hadoop数据分析
分布式分析和模式
75
来看一个按照每个订单中的产品数量和日期为订单排序的作业,这将使用之前学过的所有
键空间变换方法:
# 将订单加载到一个RDD,解 析 CSV
orders = sc.textFile("orders.csv").map(split)
# 键分配:(orderid, customerid, date), products
orders = orders.map(lambda r: ((r[0], r[1], r[2]), r[3:]))
# 计算订单大小,并将键拆分为orderiddate
orders = orders.map(lambda (k, v): ((k[0], parse_date(k[2])), len(v)))
# 交换键和值,排序
orders = orders.map(lambda (k, v): ((v, k[1]), k[0]))
# 根据键将订单排序
orders = orders.sortByKey(ascending=False)
# 再次交换键和值,以便再次使用订单ID作为键
orders = orders.map(lambda (k,v ): (v, k))
# 根据订单大小和日期,获取前10个订单ID
print orders.take(10)
这个例子对于所需完成的任务可能有点冗长,但它确实演示了如下每种类型的变换。
(1)
如第
4
章讨论的,首先使用 split 方法从一个
CSV
文件中加载数据集。
(2)
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Kudu:构建高性能实时数据分析存储系统

Kudu:构建高性能实时数据分析存储系统

Jean-Marc Spaggiari, Mladen Kovacevic, Brock Noland, Ryan Bosshart
Java并发编程实战

Java并发编程实战

Brian Goetz, Tim Peierls
面向机器学习的自然语言标注

面向机器学习的自然语言标注

James Pustejovsky, Amber Stubbs

Publisher Resources

ISBN: 9787115479648