book

Learning Spark

by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia

February 2015

Intermediate to advanced

276 pages

7h 18m

English

O'Reilly Media, Inc.

Read now

Unlock full access

AudienceHow This Book Is OrganizedSupporting BooksConventions Used in This BookCode ExamplesSafari® Books OnlineHow to Contact UsContent UpdatesMay 8, 2015Acknowledgments
What Is Apache Spark?A Unified StackSpark CoreSpark SQLSpark StreamingMLlibGraphXCluster ManagersWho Uses Spark, and for What?Data Science TasksData Processing ApplicationsA Brief History of SparkSpark Versions and ReleasesStorage Layers for Spark
Downloading SparkIntroduction to Spark’s Python and Scala ShellsIntroduction to Core Spark ConceptsStandalone ApplicationsInitializing a SparkContextBuilding Standalone ApplicationsConclusion
RDD BasicsCreating RDDsRDD OperationsTransformationsActionsLazy EvaluationPassing Functions to SparkPythonScalaJavaCommon Transformations and ActionsBasic RDDsConverting Between RDD TypesPersistence (Caching)Conclusion
MotivationCreating Pair RDDsTransformations on Pair RDDsAggregationsGrouping DataJoinsSorting DataActions Available on Pair RDDsData Partitioning (Advanced)Determining an RDD’s PartitionerOperations That Benefit from PartitioningOperations That Affect PartitioningExample: PageRankCustom PartitionersConclusion
MotivationFile FormatsText FilesJSONComma-Separated Values and Tab-Separated ValuesSequenceFilesObject FilesHadoop Input and Output FormatsFile CompressionFilesystemsLocal/“Regular” FSAmazon S3HDFSStructured Data with Spark SQLApache HiveJSONDatabasesJava Database ConnectivityCassandraHBaseElasticsearchConclusion
IntroductionAccumulatorsAccumulators and Fault ToleranceCustom AccumulatorsBroadcast VariablesOptimizing BroadcastsWorking on a Per-Partition BasisPiping to External ProgramsNumeric RDD OperationsConclusion
IntroductionSpark Runtime ArchitectureThe DriverExecutorsCluster ManagerLaunching a ProgramSummaryDeploying Applications with spark-submitPackaging Your Code and DependenciesA Java Spark Application Built with MavenA Scala Spark Application Built with sbtDependency ConflictsScheduling Within and Between Spark ApplicationsCluster ManagersStandalone Cluster ManagerHadoop YARNApache MesosAmazon EC2Which Cluster Manager to Use?Conclusion
Configuring Spark with SparkConfComponents of Execution: Jobs, Tasks, and StagesFinding InformationSpark Web UIDriver and Executor LogsKey Performance ConsiderationsLevel of ParallelismSerialization FormatMemory ManagementHardware ProvisioningConclusion

Linking with Spark SQLUsing Spark SQL in ApplicationsInitializing Spark SQLBasic Query ExampleDataFramesCachingLoading and Saving DataApache HiveData Sources/ParquetJSONFrom RDDsJDBC/ODBC ServerWorking with BeelineLong-Lived Tables and QueriesUser-Defined FunctionsSpark SQL UDFsHive UDFsSpark SQL PerformancePerformance Tuning OptionsConclusion
A Simple ExampleArchitecture and AbstractionTransformationsStateless TransformationsStateful TransformationsOutput OperationsInput SourcesCore SourcesAdditional SourcesMultiple Sources and Cluster Sizing24/7 OperationCheckpointingDriver Fault ToleranceWorker Fault ToleranceReceiver Fault ToleranceProcessing GuaranteesStreaming UIPerformance ConsiderationsBatch and Window SizesLevel of ParallelismGarbage Collection and Memory UsageConclusion
OverviewSystem RequirementsMachine Learning BasicsExample: Spam ClassificationData TypesWorking with VectorsAlgorithmsFeature ExtractionStatisticsClassification and RegressionClusteringCollaborative Filtering and RecommendationDimensionality ReductionModel EvaluationTips and Performance ConsiderationsPreparing FeaturesConfiguring AlgorithmsCaching RDDs to ReuseRecognizing SparsityLevel of ParallelismPipeline APIConclusion

Content preview from Learning Spark

Chapter 4. Working with Key/Value Pairs

This chapter covers how to work with RDDs of key/value pairs, which are a common data type required for many operations in Spark. Key/value RDDs are commonly used to perform aggregations, and often we will do some initial ETL (extract, transform, and load) to get our data into a key/value format. Key/value RDDs expose new operations (e.g., counting up reviews for each product, grouping together data with the same key, and grouping together two different RDDs).

We also discuss an advanced feature that lets users control the layout of pair RDDs across nodes: partitioning. Using controllable partitioning, applications can sometimes greatly reduce communication costs by ensuring that data will be accessed together and will be on the same node. This can provide significant speedups. We illustrate partitioning using the PageRank algorithm as an example. Choosing the right partitioning for a distributed dataset is similar to choosing the right data structure for a local one—in both cases, data layout can greatly affect performance.

Motivation

Spark provides special operations on RDDs containing key/value pairs. These RDDs are called pair RDDs. Pair RDDs are a useful building block in many programs, as they expose operations that allow you to act on each key in parallel or regroup data across the network. For example, pair RDDs have a reduceByKey() method that can aggregate data separately for each key, and a join() method that can merge two RDDs ...