Book description
Apache Spark is an in-memory framework that allows data scientists to explore and interact with big data much more quickly than with Hadoop. Python users can work with Spark using an interactive shell called PySpark.
Why is it important?
PySpark makes the large-scale data processing capabilities of Apache Spark accessible to data scientists who are more familiar with Python than Scala or Java. This also allows for reuse of a wide variety of Python libraries for machine learning, data visualization, numerical analysis, etc.
What you'll learn—and how you can apply it
Compare the different components provided by Spark, and what use cases they fit. Learn how to use RDDs (resilient distributed datasets) with PySpark. Write Spark applications in Python and submit them to the cluster as Spark jobs. Get an introduction to the Spark computing framework. Apply this approach to a worked example to determine the most frequent airline delays in a specific month and year.
This lesson is for you because…
- You're a data scientist, familiar with Python coding, who needs to get up and running with PySpark
- You're a Python developer who needs to leverage the distributed computing resources available on a Hadoop cluster, without learning Java or Scala first
Prerequisites
- Familiarity with writing Python applications
- Some familiarity with bash command-line operations
- Basic understanding of how to use simple functional programming constructs in Python, such as closures, lambdas, maps, etc.
Materials or downloads needed in advance
This lesson is taken from Data Analytics with Hadoop by Jenny Kim and Benjamin Bengfort.
Publisher resources
Product information
- Title: Interactive Spark using PySpark
- Author(s):
- Release date: August 2016
- Publisher(s): O'Reilly Media, Inc.
- ISBN: 9781491966181
You might also like
video
Apache Spark with Python - Big Data with PySpark and Spark
This course covers all the fundamentals of Apache Spark with Python and teaches you everything you …
video
Apache Spark Streaming with Python and PySpark
Spark Streaming is becoming incredibly popular, and with good reason. According to IBM, 90% of the …
book
Stream Processing with Apache Spark
Before you can build analytics tools to gain quick insights, you first need to know how …
video
Apache Spark with Scala - Learn Spark from a Big Data Guru
This course covers all the fundamentals of Apache Spark with Scala and teaches you everything you …