book

Learning Spark

by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia

February 2015

Intermediate to advanced

276 pages

7h 18m

English

O'Reilly Media, Inc.

Read now

Unlock full access

AudienceHow This Book Is OrganizedSupporting BooksConventions Used in This BookCode ExamplesSafari® Books OnlineHow to Contact UsContent UpdatesMay 8, 2015Acknowledgments
What Is Apache Spark?A Unified StackSpark CoreSpark SQLSpark StreamingMLlibGraphXCluster ManagersWho Uses Spark, and for What?Data Science TasksData Processing ApplicationsA Brief History of SparkSpark Versions and ReleasesStorage Layers for Spark
Downloading SparkIntroduction to Spark’s Python and Scala ShellsIntroduction to Core Spark ConceptsStandalone ApplicationsInitializing a SparkContextBuilding Standalone ApplicationsConclusion
RDD BasicsCreating RDDsRDD OperationsTransformationsActionsLazy EvaluationPassing Functions to SparkPythonScalaJavaCommon Transformations and ActionsBasic RDDsConverting Between RDD TypesPersistence (Caching)Conclusion
MotivationCreating Pair RDDsTransformations on Pair RDDsAggregationsGrouping DataJoinsSorting DataActions Available on Pair RDDsData Partitioning (Advanced)Determining an RDD’s PartitionerOperations That Benefit from PartitioningOperations That Affect PartitioningExample: PageRankCustom PartitionersConclusion
MotivationFile FormatsText FilesJSONComma-Separated Values and Tab-Separated ValuesSequenceFilesObject FilesHadoop Input and Output FormatsFile CompressionFilesystemsLocal/“Regular” FSAmazon S3HDFSStructured Data with Spark SQLApache HiveJSONDatabasesJava Database ConnectivityCassandraHBaseElasticsearchConclusion
IntroductionAccumulatorsAccumulators and Fault ToleranceCustom AccumulatorsBroadcast VariablesOptimizing BroadcastsWorking on a Per-Partition BasisPiping to External ProgramsNumeric RDD OperationsConclusion
IntroductionSpark Runtime ArchitectureThe DriverExecutorsCluster ManagerLaunching a ProgramSummaryDeploying Applications with spark-submitPackaging Your Code and DependenciesA Java Spark Application Built with MavenA Scala Spark Application Built with sbtDependency ConflictsScheduling Within and Between Spark ApplicationsCluster ManagersStandalone Cluster ManagerHadoop YARNApache MesosAmazon EC2Which Cluster Manager to Use?Conclusion
Configuring Spark with SparkConfComponents of Execution: Jobs, Tasks, and StagesFinding InformationSpark Web UIDriver and Executor LogsKey Performance ConsiderationsLevel of ParallelismSerialization FormatMemory ManagementHardware ProvisioningConclusion

Linking with Spark SQLUsing Spark SQL in ApplicationsInitializing Spark SQLBasic Query ExampleDataFramesCachingLoading and Saving DataApache HiveData Sources/ParquetJSONFrom RDDsJDBC/ODBC ServerWorking with BeelineLong-Lived Tables and QueriesUser-Defined FunctionsSpark SQL UDFsHive UDFsSpark SQL PerformancePerformance Tuning OptionsConclusion
A Simple ExampleArchitecture and AbstractionTransformationsStateless TransformationsStateful TransformationsOutput OperationsInput SourcesCore SourcesAdditional SourcesMultiple Sources and Cluster Sizing24/7 OperationCheckpointingDriver Fault ToleranceWorker Fault ToleranceReceiver Fault ToleranceProcessing GuaranteesStreaming UIPerformance ConsiderationsBatch and Window SizesLevel of ParallelismGarbage Collection and Memory UsageConclusion
OverviewSystem RequirementsMachine Learning BasicsExample: Spam ClassificationData TypesWorking with VectorsAlgorithmsFeature ExtractionStatisticsClassification and RegressionClusteringCollaborative Filtering and RecommendationDimensionality ReductionModel EvaluationTips and Performance ConsiderationsPreparing FeaturesConfiguring AlgorithmsCaching RDDs to ReuseRecognizing SparsityLevel of ParallelismPipeline APIConclusion

Content preview from Learning Spark

Preface

As parallel data analysis has grown common, practitioners in many fields have sought easier tools for this task. Apache Spark has quickly emerged as one of the most popular, extending and generalizing MapReduce. Spark offers three main benefits. First, it is easy to use—you can develop applications on your laptop, using a high-level API that lets you focus on the content of your computation. Second, Spark is fast, enabling interactive use and complex algorithms. And third, Spark is a general engine, letting you combine multiple types of computations (e.g., SQL queries, text processing, and machine learning) that might previously have required different engines. These features make Spark an excellent starting point to learn about Big Data in general.

This introductory book is meant to get you up and running with Spark quickly. You’ll learn how to download and run Spark on your laptop and use it interactively to learn the API. Once there, we’ll cover the details of available operations and distributed execution. Finally, you’ll get a tour of the higher-level libraries built into Spark, including libraries for machine learning, stream processing, and SQL. We hope that this book gives you the tools to quickly tackle data analysis problems, whether you do so on one machine or hundreds.

Audience

This book targets data scientists and engineers. We chose these two groups because they have the most to gain from using Spark to expand the scope of problems they can solve. Spark’s ...