Spark 3.0 First Steps
Get Started with Analytics, ETL, Streaming, Machine Learning, and Graph Compute with Apache Spark
Apache Spark allows large-scale querying and processing of data for reporting, analysis, and ETL purposes; stream processing for real-time applications; and machine learning, all using a single set of abstractions and APIs without further integration.
As Spark matures, it is essential to master the core constructs and best practices of Spark in order to plan and deliver effective solutions. For example, Spark is storage agnostic -- it can process data from many storage locations and in many formats -- but there are significant implications flowing from each specific choice. This course is a comprehensive overview of the key use cases for Apache Spark, with a focus on performance and best practices appropriate to version 3.0.
What you'll learn-and how you can apply it
By the end of this live, hands-on, online course, you’ll understand:
- The components of Spark and how they work together
- How Spark handles data-parallel computation, including partitioning and shuffling data
- The latest APIs and new performance-enhancing infrastructure in Spark 3.0
And you’ll be able to:
- Code analytics, streaming, ETL, and ML jobs for Spark
- Use the Spark UI to understand the parallelism and performance of your jobs
- Plan Spark deployments, whether a single cluster for one task, or a whole platform
This training course is for you because...
- You are a data engineer, data analyst, or data scientist
- Your company relies on Apache Spark and/or Hadoop for large-scale processing
- You want to focus on the easiest, most performant way to get results from Spark
Useful, but not strictly required:
- Familiarity with the basics of Python, SQL, and either Scala or Java.
- Basics of machine learning.
- Previous exposure to Spark is not necessary.
- For Python basics, review chapters 2-3 of Python for Data Analysis, second edition (book).
- For Scala basics, review this tour.
- For machine learning basics, chapter 1 of Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, second edition (book). Optionally, look at chapter 2 to see a nice worked example using common (non-Spark) tools.
- No local installation or setup is necessary. During the course, we’ll use Databricks Databricks Community Edition. This is a free service (no credit card required), but it does require setting up an account. Databricks access is not required, but is strongly recommended to get the benefit of full participation in the exercises.
About your instructor
Adam Breindel consults and teaches courses on Apache Spark, data engineering, machine learning, AI, and deep learning. He supports instructional initiatives as a senior instructor at Databricks, has taught classes on Apache Spark and deep learning for O'Reilly, and runs a business helping large firms and startups implement data and ML architectures. Adam’s first full-time job in tech was neural net–based fraud detection, deployed at North America's largest banks back; since then, he's worked with numerous startups, where he’s enjoyed getting to build things like mobile check-in for two of America's five biggest airlines years before the iPhone came out. He’s also worked in entertainment, insurance, and retail banking; on web, embedded, and server apps; and on clustering architectures, APIs, and streaming analytics.
The timeframes are only estimates and may vary according to how the class is progressing