book

Spark: The Definitive Guide

by Bill Chambers, Matei Zaharia

February 2018

Intermediate to advanced

606 pages

14h 54m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Sandbox

Preface
About the AuthorsWho This Book Is ForConventions Used in This BookUsing Code ExamplesO’Reilly SafariHow to Contact UsAcknowledgments
I. Gentle Overview of Big Data and Spark
1. What Is Apache Spark?
Apache Spark’s PhilosophyContext: The Big Data ProblemHistory of SparkThe Present and Future of SparkRunning SparkDownloading Spark LocallyLaunching Spark’s Interactive ConsolesRunning Spark in the CloudData Used in This Book
2. A Gentle Introduction to Spark
Spark’s Basic ArchitectureSpark ApplicationsSpark’s Language APIsSpark’s APIsStarting SparkThe SparkSessionDataFramesPartitionsTransformationsLazy EvaluationActionsSpark UIAn End-to-End ExampleDataFrames and SQLConclusion
3. A Tour of Spark’s Toolset
Running Production ApplicationsDatasets: Type-Safe Structured APIsStructured StreamingMachine Learning and Advanced AnalyticsLower-Level APIsSparkRSpark’s Ecosystem and PackagesConclusion
II. Structured APIs—DataFrames, SQL, and Datasets
4. Structured API Overview
DataFrames and DatasetsSchemasOverview of Structured Spark TypesDataFrames Versus DatasetsColumnsRowsSpark TypesOverview of Structured API ExecutionLogical PlanningPhysical PlanningExecutionConclusion
5. Basic Structured Operations
SchemasColumns and ExpressionsColumnsExpressionsRecords and RowsCreating RowsDataFrame TransformationsCreating DataFramesselect and selectExprConverting to Spark Types (Literals)Adding ColumnsRenaming ColumnsReserved Characters and KeywordsCase SensitivityRemoving ColumnsChanging a Column’s Type (cast)Filtering RowsGetting Unique RowsRandom SamplesRandom SplitsConcatenating and Appending Rows (Union)Sorting RowsLimitRepartition and CoalesceCollecting Rows to the DriverConclusion
6. Working with Different Types of Data
Where to Look for APIsConverting to Spark TypesWorking with BooleansWorking with NumbersWorking with StringsRegular ExpressionsWorking with Dates and TimestampsWorking with Nulls in DataCoalesceifnull, nullIf, nvl, and nvl2dropfillreplaceOrderingWorking with Complex TypesStructsArrayssplitArray Lengtharray_containsexplodeMapsWorking with JSONUser-Defined FunctionsConclusion
7. Aggregations
Aggregation FunctionscountcountDistinctapprox_count_distinctfirst and lastmin and maxsumsumDistinctavgVariance and Standard Deviationskewness and kurtosisCovariance and CorrelationAggregating to Complex TypesGroupingGrouping with ExpressionsGrouping with MapsWindow FunctionsGrouping SetsRollupsCubeGrouping MetadataPivotUser-Defined Aggregation FunctionsConclusion

8. Joins
Join ExpressionsJoin TypesInner JoinsOuter JoinsLeft Outer JoinsRight Outer JoinsLeft Semi JoinsLeft Anti JoinsNatural JoinsCross (Cartesian) JoinsChallenges When Using JoinsJoins on Complex TypesHandling Duplicate Column NamesHow Spark Performs JoinsCommunication StrategiesConclusion
9. Data Sources
The Structure of the Data Sources APIRead API StructureBasics of Reading DataWrite API StructureBasics of Writing DataCSV FilesCSV OptionsReading CSV FilesWriting CSV FilesJSON FilesJSON OptionsReading JSON FilesWriting JSON FilesParquet FilesReading Parquet FilesWriting Parquet FilesORC FilesReading Orc FilesWriting Orc FilesSQL DatabasesReading from SQL DatabasesQuery PushdownWriting to SQL DatabasesText FilesReading Text FilesWriting Text FilesAdvanced I/O ConceptsSplittable File Types and CompressionReading Data in ParallelWriting Data in ParallelWriting Complex TypesManaging File SizeConclusion
10. Spark SQL
What Is SQL?Big Data and SQL: Apache HiveBig Data and SQL: Spark SQLSpark’s Relationship to HiveHow to Run Spark SQL QueriesSpark SQL CLISpark’s Programmatic SQL InterfaceSparkSQL Thrift JDBC/ODBC ServerCatalogTablesSpark-Managed TablesCreating TablesCreating External TablesInserting into TablesDescribing Table MetadataRefreshing Table MetadataDropping TablesCaching TablesViewsCreating ViewsDropping ViewsDatabasesCreating DatabasesSetting the DatabaseDropping DatabasesSelect Statementscase…when…then StatementsAdvanced TopicsComplex TypesFunctionsSubqueriesMiscellaneous FeaturesConfigurationsSetting Configuration Values in SQLConclusion
11. Datasets
When to Use DatasetsCreating DatasetsIn Java: EncodersIn Scala: Case ClassesActionsTransformationsFilteringMappingJoinsGrouping and AggregationsConclusion
III. Low-Level APIs
12. Resilient Distributed Datasets (RDDs)
What Are the Low-Level APIs?When to Use the Low-Level APIs?How to Use the Low-Level APIs?About RDDsTypes of RDDsWhen to Use RDDs?Datasets and RDDs of Case ClassesCreating RDDsInteroperating Between DataFrames, Datasets, and RDDsFrom a Local CollectionFrom Data SourcesManipulating RDDsTransformationsdistinctfiltermapsortRandom SplitsActionsreducecountfirstmax and mintakeSaving FilessaveAsTextFileSequenceFilesHadoop FilesCachingCheckpointingPipe RDDs to System CommandsmapPartitionsforeachPartitionglomConclusion
13. Advanced RDDs
Key-Value Basics (Key-Value RDDs)keyByMapping over ValuesExtracting Keys and ValueslookupsampleByKeyAggregationscountByKeyUnderstanding Aggregation ImplementationsOther Aggregation MethodsCoGroupsJoinsInner JoinzipsControlling PartitionscoalescerepartitionrepartitionAndSortWithinPartitionsCustom PartitioningCustom SerializationConclusion
14. Distributed Shared Variables
Broadcast VariablesAccumulatorsBasic ExampleCustom AccumulatorsConclusion
IV. Production Applications
15. How Spark Runs on a Cluster
The Architecture of a Spark ApplicationExecution ModesThe Life Cycle of a Spark Application (Outside Spark)Client RequestLaunchExecutionCompletionThe Life Cycle of a Spark Application (Inside Spark)The SparkSessionLogical InstructionsA Spark JobStagesTasksExecution DetailsPipeliningShuffle PersistenceConclusion
16. Developing Spark Applications
Writing Spark ApplicationsA Simple Scala-Based AppWriting Python ApplicationsWriting Java ApplicationsTesting Spark ApplicationsStrategic PrinciplesTactical TakeawaysConnecting to Unit Testing FrameworksConnecting to Data SourcesThe Development ProcessLaunching ApplicationsApplication Launch ExamplesConfiguring ApplicationsThe SparkConfApplication PropertiesRuntime PropertiesExecution PropertiesConfiguring Memory ManagementConfiguring Shuffle BehaviorEnvironmental VariablesJob Scheduling Within an ApplicationConclusion
17. Deploying Spark
Where to Deploy Your Cluster to Run Spark ApplicationsOn-Premises Cluster DeploymentsSpark in the CloudCluster ManagersStandalone ModeSpark on YARNConfiguring Spark on YARN ApplicationsSpark on MesosSecure Deployment ConfigurationsCluster Networking ConfigurationsApplication SchedulingMiscellaneous ConsiderationsConclusion
18. Monitoring and Debugging
The Monitoring LandscapeWhat to MonitorDriver and Executor ProcessesQueries, Jobs, Stages, and TasksSpark LogsThe Spark UISpark REST APISpark UI History ServerDebugging and Spark First AidSpark Jobs Not StartingErrors Before ExecutionErrors During ExecutionSlow Tasks or StragglersSlow AggregationsSlow JoinsSlow Reads and WritesDriver OutOfMemoryError or Driver UnresponsiveExecutor OutOfMemoryError or Executor UnresponsiveUnexpected Nulls in ResultsNo Space Left on Disk ErrorsSerialization ErrorsConclusion
19. Performance Tuning
Indirect Performance EnhancementsDesign ChoicesObject Serialization in RDDsCluster ConfigurationsSchedulingData at RestShuffle ConfigurationsMemory Pressure and Garbage CollectionDirect Performance EnhancementsParallelismImproved FilteringRepartitioning and CoalescingUser-Defined Functions (UDFs)Temporary Data Storage (Caching)JoinsAggregationsBroadcast VariablesConclusion
V. Streaming
20. Stream Processing Fundamentals
What Is Stream Processing?Stream Processing Use CasesAdvantages of Stream ProcessingChallenges of Stream ProcessingStream Processing Design PointsRecord-at-a-Time Versus Declarative APIsEvent Time Versus Processing TimeContinuous Versus Micro-Batch ExecutionSpark’s Streaming APIsThe DStream APIStructured StreamingConclusion
21. Structured Streaming Basics
Structured Streaming BasicsCore ConceptsTransformations and ActionsInput SourcesSinksOutput ModesTriggersEvent-Time ProcessingStructured Streaming in ActionTransformations on StreamsSelections and FilteringAggregationsJoinsInput and OutputWhere Data Is Read and Written (Sources and Sinks)Reading from the Kafka SourceWriting to the Kafka SinkHow Data Is Output (Output Modes)When Data Is Output (Triggers)Streaming Dataset APIConclusion
22. Event-Time and Stateful Processing
Event TimeStateful ProcessingArbitrary Stateful ProcessingEvent-Time BasicsWindows on Event TimeTumbling WindowsHandling Late Data with WatermarksDropping Duplicates in a StreamArbitrary Stateful ProcessingTime-OutsOutput ModesmapGroupsWithStateflatMapGroupsWithStateConclusion
23. Structured Streaming in Production
Fault Tolerance and CheckpointingUpdating Your ApplicationUpdating Your Streaming Application CodeUpdating Your Spark VersionSizing and Rescaling Your ApplicationMetrics and MonitoringQuery StatusRecent ProgressSpark UIAlertingAdvanced Monitoring with the Streaming ListenerConclusion
VI. Advanced Analytics and Machine Learning
24. Advanced Analytics and Machine Learning Overview
A Short Primer on Advanced AnalyticsSupervised LearningRecommendationUnsupervised LearningGraph AnalyticsThe Advanced Analytics ProcessSpark’s Advanced Analytics ToolkitWhat Is MLlib?High-Level MLlib ConceptsMLlib in ActionFeature Engineering with TransformersEstimatorsPipelining Our WorkflowTraining and EvaluationPersisting and Applying ModelsDeployment PatternsConclusion
25. Preprocessing and Feature Engineering
Formatting Models According to Your Use CaseTransformersEstimators for PreprocessingTransformer PropertiesHigh-Level TransformersRFormulaSQL TransformersVectorAssemblerWorking with Continuous FeaturesBucketingScaling and NormalizationStandardScalerWorking with Categorical FeaturesStringIndexerConverting Indexed Values Back to TextIndexing in VectorsOne-Hot EncodingText Data TransformersTokenizing TextRemoving Common WordsCreating Word CombinationsConverting Words into Numerical RepresentationsWord2VecFeature ManipulationPCAInteractionPolynomial ExpansionFeature SelectionChiSqSelectorAdvanced TopicsPersisting TransformersWriting a Custom TransformerConclusion
26. Classification
Use CasesTypes of ClassificationBinary ClassificationMulticlass ClassificationMultilabel ClassificationClassification Models in MLlibModel ScalabilityLogistic RegressionModel HyperparametersTraining ParametersPrediction ParametersExampleModel SummaryDecision TreesModel HyperparametersTraining ParametersPrediction ParametersRandom Forest and Gradient-Boosted TreesModel HyperparametersTraining ParametersPrediction ParametersNaive BayesModel HyperparametersTraining ParametersPrediction ParametersEvaluators for Classification and Automating Model TuningDetailed Evaluation MetricsOne-vs-Rest ClassifierMultilayer PerceptronConclusion
27. Regression
Use CasesRegression Models in MLlibModel ScalabilityLinear RegressionModel HyperparametersTraining ParametersExampleTraining SummaryGeneralized Linear RegressionModel HyperparametersTraining ParametersPrediction ParametersExampleTraining SummaryDecision TreesModel HyperparametersTraining ParametersExampleRandom Forests and Gradient-Boosted TreesModel HyperparametersTraining ParametersExampleAdvanced MethodsSurvival Regression (Accelerated Failure Time)Isotonic RegressionEvaluators and Automating Model TuningMetricsConclusion
28. Recommendation
Use CasesCollaborative Filtering with Alternating Least SquaresModel HyperparametersTraining ParametersPrediction ParametersExampleEvaluators for RecommendationMetricsRegression MetricsRanking MetricsFrequent Pattern MiningConclusion
29. Unsupervised Learning
Use CasesModel Scalabilityk-meansModel HyperparametersTraining ParametersExamplek-means Metrics SummaryBisecting k-meansModel HyperparametersTraining ParametersExampleBisecting k-means SummaryGaussian Mixture ModelsModel HyperparametersTraining ParametersExampleGaussian Mixture Model SummaryLatent Dirichlet AllocationModel HyperparametersTraining ParametersPrediction ParametersExampleConclusion
30. Graph Analytics
Building a GraphQuerying the GraphSubgraphsMotif FindingGraph AlgorithmsPageRankIn-Degree and Out-Degree MetricsBreadth-First SearchConnected ComponentsStrongly Connected ComponentsAdvanced TasksConclusion
31. Deep Learning
What Is Deep Learning?Ways of Using Deep Learning in SparkDeep Learning LibrariesMLlib Neural Network SupportTensorFramesBigDLTensorFlowOnSparkDeepLearning4JDeep Learning PipelinesA Simple Example with Deep Learning PipelinesSetupImages and DataFramesTransfer LearningApplying Popular ModelsConclusion
VII. Ecosystem
32. Language Specifics: Python (PySpark) and R (SparkR and sparklyr)
PySparkFundamental PySpark DifferencesPandas IntegrationR on SparkSparkRsparklyrConclusion
33. Ecosystem and Community
Spark PackagesAn Abridged List of Popular PackagesUsing Spark PackagesExternal PackagesCommunitySpark SummitLocal MeetupsConclusion
Index
About the Authors

Content preview from Spark: The Definitive Guide

Chapter 9. Data Sources

This chapter formally introduces the variety of other data sources that you can use with Spark out of the box as well as the countless other sources built by the greater community. Spark has six “core” data sources and hundreds of external data sources written by the community. The ability to read and write from all different kinds of data sources and for the community to create its own contributions is arguably one of Spark’s greatest strengths. Following are Spark’s core data sources:

CSV
JSON
Parquet
ORC
JDBC/ODBC connections
Plain-text files

As mentioned, Spark has numerous community-created data sources. Here’s just a small sample:

Cassandra
HBase
MongoDB
AWS Redshift
XML
And many, many others

The goal of this chapter is to give you the ability to read and write from Spark’s core data sources and know enough to understand what you should look for when integrating with third-party data sources. To achieve this, we will focus on the core concepts that you need to be able to recognize and understand.

The Structure of the Data Sources API

Before proceeding with how to read and write from certain formats, let’s visit the overall organizational structure of the data source APIs.

Read API Structure

The core structure for reading data is as follows:

DataFrameReader.format(...).option("key", "value").schema(...).load()

We will use this format to read from all of our data sources. format is optional because by default Spark will ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Kafka: The Definitive Guide, 2nd Edition

Publisher Resources

ISBN: 9781491912201Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design