Video description
In Video Editions the narrator reads the book while the content, figures, code listings, diagrams, and text appear on the screen. Like an audiobook that you can also watch as a video.
Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines.
In Data Analysis with Python and PySpark you will learn how to:
- Manage your data as it scales across multiple machines
- Scale up your data programs with full confidence
- Read and write data to and from a variety of sources and formats
- Deal with messy data with PySpark’s data manipulation functionality
- Discover new data sets and perform exploratory data analysis
- Build automated data pipelines that transform, summarize, and get insights from data
- Troubleshoot common PySpark errors
- Creating reliable long-running jobs
Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required.
About the Technology
The Spark data processing engine is an amazing analytics factory: raw data comes in, insight comes out. PySpark wraps Spark’s core engine with a Python-based API. It helps simplify Spark’s steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem.
About the Book
Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You’ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that’s Hadoop clusters, cloud data storage, or local data files. Once you’ve covered the fundamentals, you’ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code.
What's Inside
- Organizing your PySpark code
- Managing your data, no matter the size
- Scale up your data programs with full confidence
- Troubleshooting common data pipeline problems
- Creating reliable long-running jobs
About the Reader
Written for data scientists and data engineers comfortable with Python.
About the Author
As a ML director for a data-driven software company, Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts.
Quotes
A clear and in-depth introduction for truly tackling big data with Python.
- Gustavo Patino, Oakland University William Beaumont School of Medicine
The perfect way to learn how to analyze and master huge datasets.
- Gary Bake, Brambles
Covers both basic and more advanced topics of PySpark, with a good balance between theory and hands-on.
- Philippe Van Bergenl, P² Consulting
For beginner to pro, a well-written book to help understand PySpark.
- Raushan Kumar Jha, Microsoft
Table of contents
- Chapter 1. Introduction
- Chapter 1. Your very own factory: How PySpark works
- Chapter 1. What will you learn in this book?
- Chapter 1. What do I need to get started?
- Chapter 1. Summary
- Part 1. Get acquainted: First steps in PySpark
- Chapter 2. Your first data program in PySpark
- Chapter 2. Mapping our program
- Chapter 2. Ingest and explore: Setting the stage for data transformation
- Chapter 2. Simple column transformations: Moving from a sentence to a list of words
- Chapter 2. Filtering rows
- Chapter 2. Summary
- Chapter 3. Submitting and scaling your first PySpark program
- Chapter 3. Ordering the results on the screen using orderBy
- Chapter 3. Writing data from a data frame
- Chapter 3. Putting it all together: Counting
- Chapter 3. Using spark-submit to launch your program in batch mode
- Chapter 3. What didn’t happen in this chapter
- Chapter 3. Scaling up our word frequency program
- Chapter 3. Summary
- Chapter 4. Analyzing tabular data with pyspark.sql
- Chapter 4. PySpark for analyzing and processing tabular data
- Chapter 4. Reading and assessing delimited data in PySpark
- Chapter 4. The basics of data manipulation: Selecting, dropping, renaming, ordering, diagnosing
- Chapter 4. Summary
- Chapter 5. Data frame gymnastics: Joining and grouping
- Chapter 5. Summarizing the data via groupby and GroupedData
- Chapter 5. Taking care of null values: Drop and fill
- Chapter 5. What was our question again? Our end-to-end program
- Chapter 5. Summary
- Part 2. Get proficient: Translate your ideas into code
- Chapter 6. Multidimensional data frames: Using PySpark with JSON data
- Chapter 6. Breaking the second dimension with complex data types
- Chapter 6. The struct: Nesting columns within columns
- Chapter 6. Building and using the data frame schema
- Chapter 6. Putting it all together: Reducing duplicate data with complex data types
- Chapter 6. Summary
- Chapter 7. Bilingual PySpark: Blending Python and SQL code
- Chapter 7. Preparing a data frame for SQL
- Chapter 7. SQL and PySpark
- Chapter 7. Using SQL-like syntax within data frame methods
- Chapter 7. Simplifying our code: Blending SQL and Python
- Chapter 7. Conclusion
- Chapter 7. Summary
- Chapter 8. Extending PySpark with Python: RDD and UDFs
- Chapter 8. Using Python to extend PySpark via UDFs
- Chapter 8. Summary
- Chapter 9. Big data is just a lot of small data: Using pandas UDFs
- Chapter 9. UDFs on grouped data: Aggregate and apply
- Chapter 9. What to use, when
- Chapter 9. Summary
- Chapter 10. Your data under a different lens: Window functions
- Chapter 10. Beyond summarizing: Using ranking and analytical functions
- Chapter 10. Flex those windows! Using row and range boundaries
- Chapter 10. Going full circle: Using UDFs within windows
- Chapter 10. Look in the window: The main steps to a successful window function
- Chapter 10. Summary
- Chapter 11. Faster PySpark: Understanding Spark’s query planning
- Chapter 11. Thinking about performance: Operations and memory
- Chapter 11. Summary
- Part 3. Get confident: Using machine learning with PySpark
- Chapter 12. Setting the stage: Preparing features for machine learning
- Chapter 12. Feature creation and refinement
- Chapter 12. Feature preparation with transformers and estimators
- Chapter 12. Summary
- Chapter 13. Robust machine learning with ML Pipelines
- Chapter 13. Building a (complete) machine learning pipeline
- Chapter 13. Evaluating and optimizing our model
- Chapter 13. Getting the biggest drivers from our model: Extracting the coefficients
- Chapter 13. Summary
- Chapter 14. Building custom ML transformers and estimators
- Chapter 14. Creating your own estimator
- Chapter 14. Using our transformer and estimator in an ML pipeline
- Chapter 14. Summary
- Appendix C. Some useful Python concepts
- Appendix C. Packing and unpacking arguments (*args and **kwargs)
- Appendix C. Python’s typing and mypy/pyright
- Appendix C. Python closures and the PySpark transform() method
- Appendix C. Python decorators: Wrapping a function to change its behavior
Product information
- Title: Data Analysis with Python and PySpark, Video Edition
- Author(s):
- Release date: March 2022
- Publisher(s): Manning Publications
- ISBN: None
You might also like
book
Data Analysis with Python and PySpark
Think big about your data! PySpark brings the powerful Spark big data processing engine to the …
video
Apache Spark 3 for Data Engineering and Analytics with Python
Apache Spark 3 is an open-source distributed engine for querying and processing data. This course will …
book
Introduction to Machine Learning with Python
Machine learning has become an integral part of many commercial applications and research projects, but this …
video
Data Analysis with Pandas and Python
This course begins with the essentials, introducing you to Anaconda and Jupyter Lab setup for Python …