Spark in Action, Second Edition

Book description

The Spark distributed data processing platform provides an easy-to-implement tool for ingesting, streaming, and processing data from any source. In Spark in Action, Second Edition, you’ll learn to take advantage of Spark’s core features and incredible processing speed, with applications including real-time computation, delayed evaluation, and machine learning. Spark skills are a hot commodity in enterprises worldwide, and with Spark’s powerful and flexible Java APIs, you can reap all the benefits without first learning Scala or Hadoop.

About the Technology
Analyzing enterprise data starts by reading, filtering, and merging files and streams from many sources. The Spark data processing engine handles this varied volume like a champ, delivering speeds 100 times faster than Hadoop systems. Thanks to SQL support, an intuitive interface, and a straightforward multilanguage API, you can use Spark without learning a complex new ecosystem.

About the Book
Spark in Action, Second Edition, teaches you to create end-to-end analytics applications. In this entirely new book, you’ll learn from interesting Java-based examples, including a complete data pipeline for processing NASA satellite data. And you’ll discover Java, Python, and Scala code samples hosted on GitHub that you can explore and adapt, plus appendixes that give you a cheat sheet for installing tools and understanding Spark-specific terms.

What's Inside
  • Writing Spark applications in Java
  • Spark application architecture
  • Ingestion through files, databases, streaming, and Elasticsearch
  • Querying distributed datasets with Spark SQL


About the Reader
This book does not assume previous experience with Spark, Scala, or Hadoop.

About the Author
Jean-Georges Perrin is an experienced data and software architect. He is France’s first IBM Champion and has been honored for 12 consecutive years.

Quotes
This book reveals the tools and secrets you need to drive innovation in your company or community.
- Rob Thomas, IBM

An indispensable, well-paced, and in-depth guide. A must-have for anyone into big data and real-time stream processing.
- Anupam Sengupta, GuardHat Inc.

This book will help spark a love affair with distributed processing.
- Conor Redmond, InComm Product Control

Currently the best book on the subject!
- Markus Breuer, Materna IPS

Table of contents

  1. Copyright
  2. brief contents
  3. contents
  4. front matter
    1. foreword
      1. The analytics operating system
    2. preface
    3. acknowledgments
    4. about this book
      1. Who should read this book
      2. What will you learn in this book?
      3. How this book is organized
      4. About the code
      5. liveBook discussion forum
    5. about the author
    6. about the cover illustration
  5. Part 1. The theory crippled by awesome examples
  6. 1. So, what is Spark, anyway?
    1. 1.1 The big picture: What Spark is and what it does
      1. 1.1.1 What is Spark?
      2. 1.1.2 The four pillars of mana
    2. 1.2 How can you use Spark?
      1. 1.2.1 Spark in a data processing/engineering scenario
      2. 1.2.2 Spark in a data science scenario
    3. 1.3 What can you do with Spark?
      1. 1.3.1 Spark predicts restaurant quality at NC eateries
      2. 1.3.2 Spark allows fast data transfer for Lumeris
      3. 1.3.3 Spark analyzes equipment logs for CERN
      4. 1.3.4 Other use cases
    4. 1.4 Why you will love the dataframe
      1. 1.4.1 The dataframe from a Java perspective
      2. 1.4.2 The dataframe from an RDBMS perspective
      3. 1.4.3 A graphical representation of the dataframe
    5. 1.5 Your first example
      1. 1.5.1 Recommended software
      2. 1.5.2 Downloading the code
      3. 1.5.3 Running your first application
        1. Command line
        2. Eclipse
      4. 1.5.4 Your first code
    6. Summary
  7. 2. Architecture and flow
    1. 2.1 Building your mental model
    2. 2.2 Using Java code to build your mental model
    3. 2.3 Walking through your application
      1. 2.3.1 Connecting to a master
      2. 2.3.2 Loading, or ingesting, the CSV file
      3. 2.3.3 Transforming your data
      4. 2.3.4 Saving the work done in your dataframe to a database
    4. Summary
  8. 3. The majestic role of the dataframe
    1. 3.1 The essential role of the dataframe in Spark
      1. 3.1.1 Organization of a dataframe
      2. 3.1.2 Immutability is not a swear word
    2. 3.2 Using dataframes through examples
      1. 3.2.1 A dataframe after a simple CSV ingestion
      2. 3.2.2 Data is stored in partitions
      3. 3.2.3 Digging in the schema
      4. 3.2.4 A dataframe after a JSON ingestion
      5. 3.2.5 Combining two dataframes
    3. 3.3 The dataframe is a Dataset<Row>
      1. 3.3.1 Reusing your POJOs
      2. 3.3.2 Creating a dataset of strings
      3. 3.3.3 Converting back and forth
        1. Create the dataset
        2. Create the dataframe
    4. 3.4 Dataframe’s ancestor: the RDD
    5. Summary
  9. 4. Fundamentally lazy
    1. 4.1 A real-life example of efficient laziness
    2. 4.2 A Spark example of efficient laziness
      1. 4.2.1 Looking at the results of transformations and actions
      2. 4.2.2 The transformation process, step by step
      3. 4.2.3 The code behind the transformation/action process
      4. 4.2.4 The mystery behind the creation of 7 million datapoints in 182 ms
      5. The mystery behind the timing of actions
    3. 4.3 Comparing to RDBMS and traditional applications
      1. 4.3.1 Working with the teen birth rates dataset
      2. 4.3.2 Analyzing differences between a traditional app and a Spark app
    4. 4.4 Spark is amazing for data-focused applications
    5. 4.5 Catalyst is your app catalyzer
    6. Summary
  10. 5. Building a simple app for deployment
    1. 5.1 An ingestionless example
      1. 5.1.1 Calculating π
      2. 5.1.2 The code to approximate π
      3. 5.1.3 What are lambda functions in Java?
      4. 5.1.4 Approximating π by using lambda functions
    2. 5.2 Interacting with Spark
      1. 5.2.1 Local mode
      2. 5.2.2 Cluster mode
        1. Submitting a job to Spark
        2. Setting the cluster’s master in your application
      3. 5.2.3 Interactive mode in Scala and Python
        1. Scala shell
        2. Python shell
    3. Summary
  11. 6. Deploying your simple app
    1. 6.1 Beyond the example: The role of the components
      1. 6.1.1 Quick overview of the components and their interactions
      2. 6.1.2 Troubleshooting tips for the Spark architecture
      3. 6.1.3 Going further
    2. 6.2 Building a cluster
      1. 6.2.1 Building a cluster that works for you
      2. 6.2.2 Setting up the environment
    3. 6.3 Building your application to run on the cluster
      1. 6.3.1 Building your application’s uber JAR
      2. 6.3.2 Building your application by using Git and Maven
    4. 6.4 Running your application on the cluster
      1. 6.4.1 Submitting the uber JAR
      2. 6.4.2 Running the application
      3. 6.4.3 the Spark user interface
    5. Summary
  12. Part 2. Ingestion
  13. 7. Ingestion from files
    1. 7.1 Common behaviors of parsers
    2. 7.2 Complex ingestion from CSV
      1. 7.2.1 Desired output
      2. 7.2.2 Code
    3. 7.3 Ingesting a CSV with a known schema
      1. 7.3.1 Desired output
      2. 7.3.2 Code
    4. 7.4 Ingesting a JSON file
      1. 7.4.1 Desired output
      2. 7.4.2 Code
    5. 7.5 Ingesting a multiline JSON file
      1. 7.5.1 Desired output
      2. 7.5.2 Code
    6. 7.6 Ingesting an XML file
      1. 7.6.1 Desired output
      2. 7.6.2 Code
    7. 7.7 Ingesting a text file
      1. 7.7.1 Desired output
      2. 7.7.2 Code
    8. 7.8 File formats for big data
      1. 7.8.1 The problem with traditional file formats
      2. 7.8.2 Avro is a schema-based serialization format
      3. 7.8.3 ORC is a columnar storage format
      4. 7.8.4 Parquet is also a columnar storage format
      5. 7.8.5 Comparing Avro, ORC, and Parquet
    9. 7.9 Ingesting Avro, ORC, and Parquet files
      1. 7.9.1 Ingesting Avro
      2. 7.9.2 Ingesting ORC
      3. 7.9.3 Ingesting Parquet
      4. 7.9.4 Reference table for ingesting Avro, ORC, or Parquet
    10. Summary
  14. 8. Ingestion from databases
    1. 8.1 Ingestion from relational databases
      1. 8.1.1 Database connection checklist
      2. 8.1.2 Understanding the data used in the examples
      3. 8.1.3 Desired output
      4. 8.1.4 Code
      5. 8.1.5 Alternative code
    2. 8.2 The role of the dialect
      1. 8.2.1 What is a dialect, anyway?
      2. 8.2.2 JDBC dialects provided with Spark
      3. 8.2.3 Building your own dialect
    3. 8.3 Advanced queries and ingestion
      1. 8.3.1 Filtering by using a WHERE clause
      2. 8.3.2 Joining data in the database
      3. 8.3.3 Performing Ingestion and partitioning
      4. 8.3.4 Summary of advanced features
    4. 8.4 Ingestion from Elasticsearch
      1. 8.4.1 Data flow
      2. 8.4.2 The New York restaurants dataset digested by Spark
      3. 8.4.3 Code to ingest the restaurant dataset from Elasticsearch
    5. Summary
  15. 9 Advanced ingestion: finding data sources and building your own
    1. 9.1 What is a data source?
    2. 9.2 Benefits of a direct connection to a data source
      1. 9.2.1 Temporary files
      2. 9.2.2 Data quality scripts
      3. 9.2.3 Data on demand
    3. 9.3 Finding data sources at Spark Packages
    4. 9.4 Building your own data source
      1. 9.4.1 Scope of the example project
      2. 9.4.2 Your data source API and options
    5. 9.5 Behind the scenes: Building the data source itself
    6. 9.6 Using the register file and the advertiser class
    7. 9.7 Understanding the relationship between the data and schema
      1. 9.7.1 The data source builds the relation
      2. 9.7.2 Inside the relation
    8. 9.8 Building the schema from a JavaBean
    9. 9.9 Building the dataframe is magic with the utilities
    10. 9.10 The other classes
    11. Summary
  16. 10. Ingestion through structured streaming
    1. 10.1 What’s streaming?
    2. 10.2 Creating your first stream
      1. 10.2.1 Generating a file stream
      2. 10.2.2 Consuming the records
      3. 10.2.3 Getting records, not lines
    3. 10.3 Ingesting data from network streams
    4. 10.4 Dealing with multiple streams
    5. 10.5 Differentiating discretized and structured streaming
    6. Summary
  17. Part 3. Transforming your data
  18. 11. Working with SQL
    1. 11.1 Working with Spark SQL
    2. 11.2 The difference between local and global views
    3. 11.3 Mixing the dataframe API and Spark SQL
    4. 11.4 Don’t DELETE it!
    5. 11.5 Going further with SQL
    6. Summary
  19. 12 Transforming your data
    1. 12.1 What is data transformation?
    2. 12.2 Process and example of record-level transformation
      1. 12.2.1 Data discovery to understand the complexity
      2. 12.2.2 Data mapping to draw the process
      3. 12.2.3 Writing the transformation code
      4. 12.2.4 Reviewing your data transformation to ensure a quality process
      5. What about sorting?
      6. Wrapping up your first Spark transformation
    3. 12.3 Joining datasets
      1. 12.3.1 A closer look at the datasets to join
      2. 12.3.2 Building the list of higher education institutions per county
        1. Initialization of Spark
        2. Loading and preparing the data
      3. 12.3.3 Performing the joins
        1. Joining the FIPS county identifier with the higher ed dataset using a join
        2. Joining the census data to get the county name
    4. 12.4 Performing more transformations
    5. Summary
  20. 13. Transforming entire documents
    1. 13.1 Transforming entire documents and their structure
      1. 13.1.1 Flattening your JSON document
      2. 13.1.2 Building nested documents for transfer and storage
    2. 13.2 The magic behind static functions
    3. 13.3 Performing more transformations
    4. Summary
  21. 14. Extending transformations with user-defined functions
    1. 14.1 Extending Apache Spark
    2. 14.2 Registering and calling a UDF
      1. 14.2.1 Registering the UDF with Spark
      2. 14.2.2 Using the UDF with the dataframe API
      3. 14.2.3 Manipulating UDFs with SQL
      4. 14.2.4 Implementing the UDF
      5. 14.2.5 Writing the service itself
    3. 14.3 Using UDFs to ensure a high level of data quality
    4. 14.4 Considering UDFs’ constraints
    5. Summary
  22. 15. Aggregating your data
    1. 15.1 Aggregating data with Spark
      1. 15.1.1 A quick reminder on aggregations
      2. 15.1.2 Performing basic aggregations with Spark
        1. Performing an aggregation using the dataframe API
        2. Performing an aggregation using Spark SQL
    2. 15.2 Performing aggregations with live data
      1. 15.2.1 Preparing your dataset
      2. 15.2.2 Aggregating data to better understand the schools
        1. What is the average enrollment for each school?
        2. What is the evolution of the number of students?
        3. What is the higher enrollment per school and year?
        4. What is the minimal absenteeism per school?
        5. Which are the five schools with the least and most absenteeism?
    3. 15.3 Building custom aggregations with UDAFs
    4. Summary
  23. Part 4. Going further
  24. 16. Cache and checkpoint: Enhancing Spark’s performances
    1. 16.1 Caching and checkpointing can increase performance
      1. 16.1.1 The usefulness of Spark caching
      2. 16.1.2 The subtle effectiveness of Spark checkpointing
      3. 16.1.3 Using caching and checkpointing
    2. 16.2 Caching in action
    3. 16.3 Going further in performance optimization
    4. Summary
  25. 17. Exporting data and building full data pipelines
    1. 17.1 Exporting data
      1. 17.1.1 Building a pipeline with NASA datasets
      2. 17.1.2 Transforming columns to datetime
      3. 17.1.3 Transforming the confidence percentage to confidence level
      4. 17.1.4 Exporting the data
      5. 17.1.5 Exporting the data: What really happened?
    2. 17.2 Delta Lake: Enjoying a database close to your system
      1. 17.2.1 Understanding why a database is needed
      2. 17.2.2 Using Delta Lake in your data pipeline
      3. 17.2.3 Consuming data from Delta Lake
        1. Number of meetings per department
        2. Number of meetings per type of organizer
    3. 17.3 Accessing cloud storage services from Spark
      1. Amazon S3
      2. Google Cloud Storage
      3. IBM COS
      4. Microsoft Azure Blob Storage
      5. OVH Object Storage
    4. Summary
  26. 18. Exploring deployment constraints: Understanding the ecosystem
    1. 18.1 Managing resources with YARN, Mesos, and Kubernetes
      1. 18.1.1 The built-in standalone mode manages resources
      2. 18.1.2 YARN manages resources in a Hadoop environment
      3. 18.1.3 Mesos is a standalone resource manager
      4. 18.1.4 Kubernetes orchestrates containers
      5. 18.1.5 Choosing the right resource manager
    2. 18.2 Sharing files with Spark
      1. 18.2.1 Accessing the data contained in files
      2. 18.2.2 Sharing files through distributed filesystems
      3. 18.2.3 Accessing files on shared drives or file server
      4. 18.2.4 Using file-sharing services to distribute files
      5. 18.2.5 Other options for accessing files in Spark
      6. 18.2.6 Hybrid solution for sharing files with Spark
    3. 18.3 Making sure your Spark application is secure
      1. 18.3.1 Securing the network components of your infrastructure
      2. 18.3.2 Securing Spark’s disk usage
    4. Summary
  27. Appendixes.
  28. Appendix A. Installing Eclipse
    1. A.1 Eclipse
    2. A.2 Running Eclipse for the first time
  29. Appendix B. Installing Maven
    1. B.1 Installation on Windows
    2. B.2 Installation on macOS
  30. Appendix C. Installing Git
    1. C.1 Installing Git on Windows
    2. C.2 Installing Git on macOS
    3. C.3 Installing Git on Ubuntu
      1. $ sudo apt install git
    4. C.4 Installing Git on RHEL / Amazon EMR
      1. $ sudo yum install -y git
    5. C.5 Other tools to consider
  31. Appendix D. Downloading the code and getting started with Eclipse
    1. D.1 Downloading the source code from the command line
    2. D.2 Getting started in Eclipse
  32. Appendix E. A history of enterprise data
    1. E.1 The enterprise problem
    2. E.2 The solution is--hmmm, was--the data warehouse
    3. E.3 The ephemeral data lake
    4. E.4 Lightning-fast cluster computing
    5. E.5 Java rules, but we’re okay with Python
  33. Appendix F. Getting help with relational databases
    1. F.1 IBM Informix
      1. F.1.1 Installing Informix on macOS
      2. F.1.2 Installing Informix on Windows
    2. F.2 MariaDB
      1. F.2.1 Installing MariaDB on macOS
      2. F.2.2 Installing MariaDB on Windows
    3. F.3 MySQL (Oracle)
      1. F.3.1 Installing MySQL on macOS
      2. F.3.2 Installing MySQL on Windows
      3. F.3.3 Loading the Sakila database
    4. F.4 PostgreSQL
      1. F.4.1 Installing PostgreSQL on macOS and Windows
      2. F.4.2 Installing PostgreSQL on Linux
      3. F.4.3 GUI clients for PostgreSQL
  34. Appendix G. Static functions ease your transformations
    1. G.1.1 Functions per category
      1. G.1.1 Popular functions
      2. G.1.2 Aggregate functions
      3. G.1.3 Arithmetical functions
      4. G.1.4 Array manipulation functions
      5. G.1.5 Binary operations
      6. G.1.6 Byte functions
      7. G.1.7 Comparison functions
      8. G.1.8 Compute function
      9. G.1.9 Conditional operations
      10. G.1.10 Conversion functions
      11. G.1.11 Data shape functions
      12. G.1.12 Date and time functions
      13. G.1.13 Digest functions
      14. G.1.14 Encoding functions
      15. G.1.15 Formatting functions
      16. G.1.16 JSON functions
      17. G.1.17 List functions
      18. G.1.18 Map functions
      19. G.1.19 Mathematical functions
      20. G.1.20 Navigation functions
      21. G.1.21 Parsing functions
      22. G.1.22 Partition functions
      23. G.1.23 Rounding functions
      24. G.1.24 Sorting functions
      25. G.1.25 Statistical functions
      26. G.1.26 Streaming functions
      27. G.1.27 String functions
      28. G.1.28 Technical functions
      29. G.1.29 Trigonometry functions
      30. G.1.30 UDF helpers
      31. G.1.31 Validation functions
      32. G.1.32 Deprecated functions
    2. G.2 Function appearance per version of Spark
      1. G.2.1 Functions in Spark v3.0.0
      2. G.2.2 Functions in Spark v2.4.0
      3. G.2.3 Functions in Spark v2.3.0
      4. G.2.4 Functions in Spark v2.2.0
      5. G.2.5 Functions in Spark v2.1.0
      6. G.2.6 Functions in Spark v2.0.0
      7. G.2.7 Functions in Spark v1.6.0
      8. G.2.8 Functions in Spark v1.5.0
      9. G.2.9 Functions in Spark v1.4.0
      10. G.2.10 Functions in Spark v1.3.0
  35. Appendix H. Maven quick cheat sheet
    1. H.1 Source of packages
    2. H.2 Useful commands
    3. H.3 Typical Maven life cycle
    4. H.4 Useful configuration
      1. H.4.1 Built-in properties
      2. H.4.2 Building an uber JAR
      3. H.4.3 Including the source code
      4. H.4.4 Executing from Maven
  36. Appendix I. Reference for transformations and actions
    1. I.1 Transformations
    2. I.2 Actions
  37. Appendix J. Enough Scala
    1. J.1 What is Scala
    2. J.2 Scala to Java conversion
      1. J.2.1 General conversions
      2. J.2.2 Maps: Conversion from Scala to Java
  38. Appendix K. Installing Spark in production and a few tips
    1. K.1 Installation
      1. K.1.1 Installing Spark on Windows
      2. K.1.2 Installing Spark on macOS
      3. K.1.3 Installing Spark on Ubuntu
    2. Figure K.1 Getting the real download URL for Apache Spark so you can copy it to your command line
      1. K.1.4 Installing Spark on AWS EMR
    3. K.2 Understanding the installation
    4. K.3 Configuration
      1. K.3.1 Properties syntax
      2. K.3.2 Application configuration
      3. K.3.3 Runtime configuration
      4. K.3.4 Other configuration points
  39. Appendix L. Reference for ingestion
    1. L.1 Spark datatypes
    2. L.2 Options for CSV ingestion
    3. L.3 Options for JSON ingestion
    4. L.4 Options for XML ingestion
    5. L.5 Methods for building a full dialect
    6. L.6 Options for ingesting and writing data from/to a database
    7. L.7 Options for ingesting and writing data from/to Elasticsearch
  40. Appendix M. Reference for joins
    1. M.1 Setting up the decorum
    2. M.2 Performing an inner join
    3. M.3 Performing an outer join
    4. M.4 Performing a left, or left-outer, join
    5. M.5 Performing a right, or right-outer, join
    6. M.6 Performing a left-semi join
    7. M.7 Performing a left-anti join
    8. M.9 Performing a cross-join
  41. Appendix N. Installing Elasticsearch and sample data
    1. N.1 Installing the software
      1. N.1.1 All platforms
      2. N.1.2 macOS with Homebrew
    2. N.2 Installing the NYC restaurant dataset
    3. N.3 Understanding Elasticsearch terminology
    4. N.4 Working with useful commands
      1. N.4.1 Get the server status
      2. N.4.2 Display the structure
      3. N.4.3 Count documents
  42. Appendix O. Generating streaming data
    1. O.1 Need for generating streaming data
    2. O.2 A simple stream
    3. O.3 Joined data
    4. O.4 Types of fields
  43. Appendix P. Reference for streaming
    1. P.1 Output mode
    2. P.2 Sinks
    3. P.3 Sinks, output modes, and options
    4. P.4 Examples of using the various sinks
      1. P.4.1 Output in a file
      2. P.4.2 Output to a Kafka topic
      3. P.4.3 Processing streamed records through foreach
      4. P.4.4 Output in memory and processing from memory
  44. Appendix Q. Reference for exporting data
    1. Q.1 Specifying the way to save data
    2. Q.2 Spark export formats
    3. Q.3 Options for the main formats
      1. Q.3.1 Exporting as CSV
      2. Q.3.2 Exporting as JSON
      3. Q.3.3 Exporting as Parquet
      4. Q.3.4 Exporting as ORC
      5. Q.3.5 Exporting as XML
      6. Q.3.6 Exporting as text
    4. Q.4 Exporting data to datastores
      1. Q.4.1 Exporting data to a database via JDBC
      2. Q.4.2 Exporting data to Elasticsearch
      3. Q.4.3 Exporting data to Delta Lake
  45. Appendix R. Finding help when you’re stuck
    1. R.1 Small annoyances here and there
      1. R.1.1 Service sparkDriver failed after 16 retries . . .
      2. R.1.2 Requirement failed
      3. R.1.3 Class cast exception
      4. R.1.4 Corrupt record in ingestion
      5. R.1.5 Cannot find winutils.exe
    2. R.2 Help in the outside world
      1. R.2.1 User mailing list
      2. R.2.2 Stack Overflow
  46. index
    1. Numerics
    2. A
    3. B
    4. C
    5. D
    6. E
    7. F
    8. G
    9. H
    10. I
    11. J
    12. K
    13. L
    14. M
    15. N
    16. O
    17. P
    18. Q
    19. R
    20. S
    21. T
    22. U
    23. V
    24. W
    25. X
    26. Y
    27. Z

Product information

  • Title: Spark in Action, Second Edition
  • Author(s): Jean-Georges Perrin
  • Release date: June 2020
  • Publisher(s): Manning Publications
  • ISBN: 9781617295522