book

Spark in Action, Second Edition

Name: Spark in Action, Second Edition
Author: Jean-Georges Perrin
ISBN: 9781617295522

by Jean-Georges Perrin

June 2020

Intermediate to advanced

576 pages

15h 41m

English

Manning Publications

Read now

Unlock full access

Copyright
brief contents
contents
front matter
forewordThe analytics operating systemprefaceacknowledgmentsabout this bookWho should read this bookWhat will you learn in this book?How this book is organizedAbout the codeliveBook discussion forumabout the authorabout the cover illustration
Part 1. The theory crippled by awesome examples
1. So, what is Spark, anyway?
1.1 The big picture: What Spark is and what it does1.1.1 What is Spark?1.1.2 The four pillars of mana1.2 How can you use Spark?1.2.1 Spark in a data processing/engineering scenario1.2.2 Spark in a data science scenario1.3 What can you do with Spark?1.3.1 Spark predicts restaurant quality at NC eateries1.3.2 Spark allows fast data transfer for Lumeris1.3.3 Spark analyzes equipment logs for CERN1.3.4 Other use cases1.4 Why you will love the dataframe1.4.1 The dataframe from a Java perspective1.4.2 The dataframe from an RDBMS perspective1.4.3 A graphical representation of the dataframe1.5 Your first example1.5.1 Recommended software1.5.2 Downloading the code1.5.3 Running your first applicationCommand lineEclipse1.5.4 Your first codeSummary
2. Architecture and flow
2.1 Building your mental model2.2 Using Java code to build your mental model2.3 Walking through your application2.3.1 Connecting to a master2.3.2 Loading, or ingesting, the CSV file2.3.3 Transforming your data2.3.4 Saving the work done in your dataframe to a databaseSummary
3. The majestic role of the dataframe
3.1 The essential role of the dataframe in Spark3.1.1 Organization of a dataframe3.1.2 Immutability is not a swear word3.2 Using dataframes through examples3.2.1 A dataframe after a simple CSV ingestion3.2.2 Data is stored in partitions3.2.3 Digging in the schema3.2.4 A dataframe after a JSON ingestion3.2.5 Combining two dataframes3.3 The dataframe is a Dataset<Row>3.3.1 Reusing your POJOs3.3.2 Creating a dataset of strings3.3.3 Converting back and forthCreate the datasetCreate the dataframe3.4 Dataframe’s ancestor: the RDDSummary
4. Fundamentally lazy
4.1 A real-life example of efficient laziness4.2 A Spark example of efficient laziness4.2.1 Looking at the results of transformations and actions4.2.2 The transformation process, step by step4.2.3 The code behind the transformation/action process4.2.4 The mystery behind the creation of 7 million datapoints in 182 msThe mystery behind the timing of actions4.3 Comparing to RDBMS and traditional applications4.3.1 Working with the teen birth rates dataset4.3.2 Analyzing differences between a traditional app and a Spark app4.4 Spark is amazing for data-focused applications4.5 Catalyst is your app catalyzerSummary
5. Building a simple app for deployment
5.1 An ingestionless example5.1.1 Calculating π5.1.2 The code to approximate π5.1.3 What are lambda functions in Java?5.1.4 Approximating π by using lambda functions5.2 Interacting with Spark5.2.1 Local mode5.2.2 Cluster modeSubmitting a job to SparkSetting the cluster’s master in your application5.2.3 Interactive mode in Scala and PythonScala shellPython shellSummary

6. Deploying your simple app
6.1 Beyond the example: The role of the components6.1.1 Quick overview of the components and their interactions6.1.2 Troubleshooting tips for the Spark architecture6.1.3 Going further6.2 Building a cluster6.2.1 Building a cluster that works for you6.2.2 Setting up the environment6.3 Building your application to run on the cluster6.3.1 Building your application’s uber JAR6.3.2 Building your application by using Git and Maven6.4 Running your application on the cluster6.4.1 Submitting the uber JAR6.4.2 Running the application6.4.3 the Spark user interfaceSummary
Part 2. Ingestion
7. Ingestion from files
7.1 Common behaviors of parsers7.2 Complex ingestion from CSV7.2.1 Desired output7.2.2 Code7.3 Ingesting a CSV with a known schema7.3.1 Desired output7.3.2 Code7.4 Ingesting a JSON file7.4.1 Desired output7.4.2 Code7.5 Ingesting a multiline JSON file7.5.1 Desired output7.5.2 Code7.6 Ingesting an XML file7.6.1 Desired output7.6.2 Code7.7 Ingesting a text file7.7.1 Desired output7.7.2 Code7.8 File formats for big data7.8.1 The problem with traditional file formats7.8.2 Avro is a schema-based serialization format7.8.3 ORC is a columnar storage format7.8.4 Parquet is also a columnar storage format7.8.5 Comparing Avro, ORC, and Parquet7.9 Ingesting Avro, ORC, and Parquet files7.9.1 Ingesting Avro7.9.2 Ingesting ORC7.9.3 Ingesting Parquet7.9.4 Reference table for ingesting Avro, ORC, or ParquetSummary
8. Ingestion from databases
8.1 Ingestion from relational databases8.1.1 Database connection checklist8.1.2 Understanding the data used in the examples8.1.3 Desired output8.1.4 Code8.1.5 Alternative code8.2 The role of the dialect8.2.1 What is a dialect, anyway?8.2.2 JDBC dialects provided with Spark8.2.3 Building your own dialect8.3 Advanced queries and ingestion8.3.1 Filtering by using a WHERE clause8.3.2 Joining data in the database8.3.3 Performing Ingestion and partitioning8.3.4 Summary of advanced features8.4 Ingestion from Elasticsearch8.4.1 Data flow8.4.2 The New York restaurants dataset digested by Spark8.4.3 Code to ingest the restaurant dataset from ElasticsearchSummary
9 Advanced ingestion: finding data sources and building your own
9.1 What is a data source?9.2 Benefits of a direct connection to a data source9.2.1 Temporary files9.2.2 Data quality scripts9.2.3 Data on demand9.3 Finding data sources at Spark Packages9.4 Building your own data source9.4.1 Scope of the example project9.4.2 Your data source API and options9.5 Behind the scenes: Building the data source itself9.6 Using the register file and the advertiser class9.7 Understanding the relationship between the data and schema9.7.1 The data source builds the relation9.7.2 Inside the relation9.8 Building the schema from a JavaBean9.9 Building the dataframe is magic with the utilities9.10 The other classesSummary
10. Ingestion through structured streaming
10.1 What’s streaming?10.2 Creating your first stream10.2.1 Generating a file stream10.2.2 Consuming the records10.2.3 Getting records, not lines10.3 Ingesting data from network streams10.4 Dealing with multiple streams10.5 Differentiating discretized and structured streamingSummary
Part 3. Transforming your data
11. Working with SQL
11.1 Working with Spark SQL11.2 The difference between local and global views11.3 Mixing the dataframe API and Spark SQL11.4 Don’t DELETE it!11.5 Going further with SQLSummary
12 Transforming your data
12.1 What is data transformation?12.2 Process and example of record-level transformation12.2.1 Data discovery to understand the complexity12.2.2 Data mapping to draw the process12.2.3 Writing the transformation code12.2.4 Reviewing your data transformation to ensure a quality processWhat about sorting?Wrapping up your first Spark transformation12.3 Joining datasets12.3.1 A closer look at the datasets to join12.3.2 Building the list of higher education institutions per countyInitialization of SparkLoading and preparing the data12.3.3 Performing the joinsJoining the FIPS county identifier with the higher ed dataset using a joinJoining the census data to get the county name12.4 Performing more transformationsSummary
13. Transforming entire documents
13.1 Transforming entire documents and their structure13.1.1 Flattening your JSON document13.1.2 Building nested documents for transfer and storage13.2 The magic behind static functions13.3 Performing more transformationsSummary
14. Extending transformations with user-defined functions
14.1 Extending Apache Spark14.2 Registering and calling a UDF14.2.1 Registering the UDF with Spark14.2.2 Using the UDF with the dataframe API14.2.3 Manipulating UDFs with SQL14.2.4 Implementing the UDF14.2.5 Writing the service itself14.3 Using UDFs to ensure a high level of data quality14.4 Considering UDFs’ constraintsSummary
15. Aggregating your data
15.1 Aggregating data with Spark15.1.1 A quick reminder on aggregations15.1.2 Performing basic aggregations with SparkPerforming an aggregation using the dataframe APIPerforming an aggregation using Spark SQL15.2 Performing aggregations with live data15.2.1 Preparing your dataset15.2.2 Aggregating data to better understand the schoolsWhat is the average enrollment for each school?What is the evolution of the number of students?What is the higher enrollment per school and year?What is the minimal absenteeism per school?Which are the five schools with the least and most absenteeism?15.3 Building custom aggregations with UDAFsSummary
Part 4. Going further
16. Cache and checkpoint: Enhancing Spark’s performances
16.1 Caching and checkpointing can increase performance16.1.1 The usefulness of Spark caching16.1.2 The subtle effectiveness of Spark checkpointing16.1.3 Using caching and checkpointing16.2 Caching in action16.3 Going further in performance optimizationSummary
17. Exporting data and building full data pipelines
17.1 Exporting data17.1.1 Building a pipeline with NASA datasets17.1.2 Transforming columns to datetime17.1.3 Transforming the confidence percentage to confidence level17.1.4 Exporting the data17.1.5 Exporting the data: What really happened?17.2 Delta Lake: Enjoying a database close to your system17.2.1 Understanding why a database is needed17.2.2 Using Delta Lake in your data pipeline17.2.3 Consuming data from Delta LakeNumber of meetings per departmentNumber of meetings per type of organizer17.3 Accessing cloud storage services from SparkAmazon S3Google Cloud StorageIBM COSMicrosoft Azure Blob StorageOVH Object StorageSummary
18. Exploring deployment constraints: Understanding the ecosystem
18.1 Managing resources with YARN, Mesos, and Kubernetes18.1.1 The built-in standalone mode manages resources18.1.2 YARN manages resources in a Hadoop environment18.1.3 Mesos is a standalone resource manager18.1.4 Kubernetes orchestrates containers18.1.5 Choosing the right resource manager18.2 Sharing files with Spark18.2.1 Accessing the data contained in files18.2.2 Sharing files through distributed filesystems18.2.3 Accessing files on shared drives or file server18.2.4 Using file-sharing services to distribute files18.2.5 Other options for accessing files in Spark18.2.6 Hybrid solution for sharing files with Spark18.3 Making sure your Spark application is secure18.3.1 Securing the network components of your infrastructure18.3.2 Securing Spark’s disk usageSummary
Appendixes.
Appendix A. Installing Eclipse
A.1 EclipseA.2 Running Eclipse for the first time
Appendix B. Installing Maven
B.1 Installation on WindowsB.2 Installation on macOS
Appendix C. Installing Git
C.1 Installing Git on WindowsC.2 Installing Git on macOSC.3 Installing Git on Ubuntu$ sudo apt install gitC.4 Installing Git on RHEL / Amazon EMR$ sudo yum install -y gitC.5 Other tools to consider
Appendix D. Downloading the code and getting started with Eclipse
D.1 Downloading the source code from the command lineD.2 Getting started in Eclipse
Appendix E. A history of enterprise data
E.1 The enterprise problemE.2 The solution is--hmmm, was--the data warehouseE.3 The ephemeral data lakeE.4 Lightning-fast cluster computingE.5 Java rules, but we’re okay with Python
Appendix F. Getting help with relational databases
F.1 IBM InformixF.1.1 Installing Informix on macOSF.1.2 Installing Informix on WindowsF.2 MariaDBF.2.1 Installing MariaDB on macOSF.2.2 Installing MariaDB on WindowsF.3 MySQL (Oracle)F.3.1 Installing MySQL on macOSF.3.2 Installing MySQL on WindowsF.3.3 Loading the Sakila databaseF.4 PostgreSQLF.4.1 Installing PostgreSQL on macOS and WindowsF.4.2 Installing PostgreSQL on LinuxF.4.3 GUI clients for PostgreSQL
Appendix G. Static functions ease your transformations
G.1.1 Functions per categoryG.1.1 Popular functionsG.1.2 Aggregate functionsG.1.3 Arithmetical functionsG.1.4 Array manipulation functionsG.1.5 Binary operationsG.1.6 Byte functionsG.1.7 Comparison functionsG.1.8 Compute functionG.1.9 Conditional operationsG.1.10 Conversion functionsG.1.11 Data shape functionsG.1.12 Date and time functionsG.1.13 Digest functionsG.1.14 Encoding functionsG.1.15 Formatting functionsG.1.16 JSON functionsG.1.17 List functionsG.1.18 Map functionsG.1.19 Mathematical functionsG.1.20 Navigation functionsG.1.21 Parsing functionsG.1.22 Partition functionsG.1.23 Rounding functionsG.1.24 Sorting functionsG.1.25 Statistical functionsG.1.26 Streaming functionsG.1.27 String functionsG.1.28 Technical functionsG.1.29 Trigonometry functionsG.1.30 UDF helpersG.1.31 Validation functionsG.1.32 Deprecated functionsG.2 Function appearance per version of SparkG.2.1 Functions in Spark v3.0.0G.2.2 Functions in Spark v2.4.0G.2.3 Functions in Spark v2.3.0G.2.4 Functions in Spark v2.2.0G.2.5 Functions in Spark v2.1.0G.2.6 Functions in Spark v2.0.0G.2.7 Functions in Spark v1.6.0G.2.8 Functions in Spark v1.5.0G.2.9 Functions in Spark v1.4.0G.2.10 Functions in Spark v1.3.0
Appendix H. Maven quick cheat sheet
H.1 Source of packagesH.2 Useful commandsH.3 Typical Maven life cycleH.4 Useful configurationH.4.1 Built-in propertiesH.4.2 Building an uber JARH.4.3 Including the source codeH.4.4 Executing from Maven
Appendix I. Reference for transformations and actions
I.1 TransformationsI.2 Actions
Appendix J. Enough Scala
J.1 What is ScalaJ.2 Scala to Java conversionJ.2.1 General conversionsJ.2.2 Maps: Conversion from Scala to Java
Appendix K. Installing Spark in production and a few tips
K.1 InstallationK.1.1 Installing Spark on WindowsK.1.2 Installing Spark on macOSK.1.3 Installing Spark on UbuntuFigure K.1 Getting the real download URL for Apache Spark so you can copy it to your command lineK.1.4 Installing Spark on AWS EMRK.2 Understanding the installationK.3 ConfigurationK.3.1 Properties syntaxK.3.2 Application configurationK.3.3 Runtime configurationK.3.4 Other configuration points
Appendix L. Reference for ingestion
L.1 Spark datatypesL.2 Options for CSV ingestionL.3 Options for JSON ingestionL.4 Options for XML ingestionL.5 Methods for building a full dialectL.6 Options for ingesting and writing data from/to a databaseL.7 Options for ingesting and writing data from/to Elasticsearch
Appendix M. Reference for joins
M.1 Setting up the decorumM.2 Performing an inner joinM.3 Performing an outer joinM.4 Performing a left, or left-outer, joinM.5 Performing a right, or right-outer, joinM.6 Performing a left-semi joinM.7 Performing a left-anti joinM.9 Performing a cross-join
Appendix N. Installing Elasticsearch and sample data
N.1 Installing the softwareN.1.1 All platformsN.1.2 macOS with HomebrewN.2 Installing the NYC restaurant datasetN.3 Understanding Elasticsearch terminologyN.4 Working with useful commandsN.4.1 Get the server statusN.4.2 Display the structureN.4.3 Count documents
Appendix O. Generating streaming data
O.1 Need for generating streaming dataO.2 A simple streamO.3 Joined dataO.4 Types of fields
Appendix P. Reference for streaming
P.1 Output modeP.2 SinksP.3 Sinks, output modes, and optionsP.4 Examples of using the various sinksP.4.1 Output in a fileP.4.2 Output to a Kafka topicP.4.3 Processing streamed records through foreachP.4.4 Output in memory and processing from memory
Appendix Q. Reference for exporting data
Q.1 Specifying the way to save dataQ.2 Spark export formatsQ.3 Options for the main formatsQ.3.1 Exporting as CSVQ.3.2 Exporting as JSONQ.3.3 Exporting as ParquetQ.3.4 Exporting as ORCQ.3.5 Exporting as XMLQ.3.6 Exporting as textQ.4 Exporting data to datastoresQ.4.1 Exporting data to a database via JDBCQ.4.2 Exporting data to ElasticsearchQ.4.3 Exporting data to Delta Lake
Appendix R. Finding help when you’re stuck
R.1 Small annoyances here and thereR.1.1 Service sparkDriver failed after 16 retries . . .R.1.2 Requirement failedR.1.3 Class cast exceptionR.1.4 Corrupt record in ingestionR.1.5 Cannot find winutils.exeR.2 Help in the outside worldR.2.1 User mailing listR.2.2 Stack Overflow
index
NumericsABCDEFGHIJKLMNOPQRSTUVWXYZ

Content preview from Spark in Action, Second Edition

7. Ingestion from files

This chapter covers

Common behaviors of parsers
Ingesting from CSV, JSON, XML, and text files
Understanding the difference between one-line and multiline JSON records
Understanding the need for big data-specific file formats

Ingestion is the first step of your big data pipeline. You will have to onboard the data in your instance of Spark, whether it is in local mode or cluster mode. As you know by now, data in Spark is transient, meaning that when you shut down Spark, it’s all gone. You will learn how to import data from standard files including CSV, JSON, XML, and text.

In this chapter, after learning about common behaviors among various parsers, you’ll use made-up datasets to illustrate specific cases, as well as ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781617295522Publisher Support Publisher Website

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Spark in Action, Second Edition

by Jean-Georges Perrin

7. Ingestion from files

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.