book

SPARK學習手冊

Name: SPARK學習手冊
ISBN: 9789864760466

by Holden Karau, Andy Konwinski, Patrick We

September 2016

Intermediate to advanced

288 pages

6h 6m

Chinese

GoTop Information, Inc.

Read now

Unlock full access

Content preview from SPARK學習手冊

檔案系統 |

一些輸入格式（例如 SequenceFiles）允許我們只對鍵值對中的值進行壓縮，這對於搜尋

方面的應用是很有幫助的。其他的輸入格式有各自的壓縮控制流程：舉例來說，許多在

Twitter Elephant Bird 套件中的格式都是使用 LZO 壓縮。

檔案系統

Spark 對眾多的檔案系統支援存取操作，我們可以使用任何一種想使用的檔案系統。

本地 /「一般」檔案系統

雖然 Spark 支援從本地系統讀取檔案，但它需要

這個檔案在叢集中的所有運算節點中

都有相同的路徑

。

一些網路檔案系統，例如 NFS、AFS 與 MapR 的 NFS 對使用者來說就像一般的檔案系

統。如果資料存在上述的檔案系統之一，可以用

file://

宣告將那些資料當作輸入來源；

一旦檔案系統掛載在每個節點中的相同路徑（請參考範例 5-29）， Spark 就會處理那些

檔案。

範例

5-29　Scala

從本地系統讀取壓縮文字

val rdd = sc.textFile("file:///home/holden/happypandas.gz")

如果你的檔案沒有在叢集的所有節點內，可以在驅動程式中先不透過 Spark 從本地端讀

取檔案，隨後呼叫

parallelize

分散檔案到所有工作節點。這個方法可能會相當的慢，

所以我們建議你將檔案存在分散式的檔案系統中，例如 HDFS、NFS 或是 S3。

Amazon S3

Amazon S3 是儲存大量資料集時越來越盛行的選項。當你的運算節點位於 Amazon EC2

內，S3 的存取速度會特別的快。但如果必須透過公共網路傳輸檔案，那效能會變得相當 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9789864760466

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

SPARK學習手冊

by Holden Karau, Andy Konwinski, Patrick We

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

成為卓越程式設計師的38項必修法則

高性能Spark

持續交付｜使用Java

机器学习实战：基于Scikit-Learn、Keras 和TensorFlow （原书第2 版）

Publisher Resources