Skip to content
O'Reilly home
Learning Path

Getting Up and Running with Apache Spark 2.X

Time to complete: 3h 22m

Published byInfinite Skills and O'Reilly Media, Inc.

CreatedJuly 2018

Big data today is no longer a niche area of curiosity nor the exclusive realm of data scientists; it is a mature and ever-expanding sector in the IT industry, with applications extending from enterprises looking for insights into their customers to the gathering and processing of streaming data generated by devices of all sorts on the Internet of Things (IoT). Apache Spark is the renowned open source framework that has become one of the most popular projects in the Hadoop ecosystem. This powerful, distributed computing engine for big data has emerged as a leading tool for data scientists and data engineers to use to explore, understand, and transform massive datasets.

In this learning path, which includes text and video components, you’ll begin by learning about Spark’s core architecture, such as the DataFrames API, transformations, and actions. You’ll then examine how to write your own Spark applications, applying best practices for deploying Spark machine learning models to production. You’ll even analyze actual stock data using the popular Python language. If you have a solid awareness and fluency in the concepts underpinning big data and data processing, this curated collection of content from industry experts—including Spark creator Matei Zaharia—will quickly help you get up and running with Spark, adding a valuable and increasingly sought-after new skill set to your personal toolbox.

What you’ll learn—and how you can apply it

  • Core architecture of a Spark application, including the DataFrames API, and transformations and actions
  • How to install Spark in your own environment
  • Key components of Spark 2.0, including unified APIs with Spark Session, Spark MLlib, and Structured Streaming
  • How to analyze real stock data using Python and Spark’s most common structured API, the DataFramebest practices for building machine learning pipelines in Spark MLlib, and deploying machine learning models to production

This learning path is for you because…

  • You're a data scientist, engineer, or analyst, and you want to use Apache Spark to interactively process and query large amounts of data and build statistical models that scale
  • You're a data engineer, developer, or architect looking to use Apache Spark to write maintainable, reproducible production applications


  • Some experience in data processing and analysis
  • Basic Python programming experience or advanced knowledge in another programming language such as Java
  • Optional: background in machine learning tools such as Python scikit-Learn

Materials or downloads needed in advance:

  • Latest version of Python