Data Analysis with Python and PySpark, Video Edition

Video description

In Video Editions the narrator reads the book while the content, figures, code listings, diagrams, and text appear on the screen. Like an audiobook that you can also watch as a video.

Think big about your data! PySpark brings the powerful Spark big data processing engine to the Python ecosystem, letting you seamlessly scale up your data tasks and create lightning-fast pipelines.

In Data Analysis with Python and PySpark you will learn how to:

  • Manage your data as it scales across multiple machines
  • Scale up your data programs with full confidence
  • Read and write data to and from a variety of sources and formats
  • Deal with messy data with PySpark’s data manipulation functionality
  • Discover new data sets and perform exploratory data analysis
  • Build automated data pipelines that transform, summarize, and get insights from data
  • Troubleshoot common PySpark errors
  • Creating reliable long-running jobs

Data Analysis with Python and PySpark is your guide to delivering successful Python-driven data projects. Packed with relevant examples and essential techniques, this practical book teaches you to build pipelines for reporting, machine learning, and other data-centric tasks. Quick exercises in every chapter help you practice what you’ve learned, and rapidly start implementing PySpark into your data systems. No previous knowledge of Spark is required.

About the Technology
The Spark data processing engine is an amazing analytics factory: raw data comes in, insight comes out. PySpark wraps Spark’s core engine with a Python-based API. It helps simplify Spark’s steep learning curve and makes this powerful tool available to anyone working in the Python data ecosystem.

About the Book
Data Analysis with Python and PySpark helps you solve the daily challenges of data science with PySpark. You’ll learn how to scale your processing capabilities across multiple machines while ingesting data from any source—whether that’s Hadoop clusters, cloud data storage, or local data files. Once you’ve covered the fundamentals, you’ll explore the full versatility of PySpark by building machine learning pipelines, and blending Python, pandas, and PySpark code.

What's Inside
  • Organizing your PySpark code
  • Managing your data, no matter the size
  • Scale up your data programs with full confidence
  • Troubleshooting common data pipeline problems
  • Creating reliable long-running jobs


About the Reader
Written for data scientists and data engineers comfortable with Python.

About the Author
As a ML director for a data-driven software company, Jonathan Rioux uses PySpark daily. He teaches the software to data scientists, engineers, and data-savvy business analysts.

Quotes
A clear and in-depth introduction for truly tackling big data with Python.
- Gustavo Patino, Oakland University William Beaumont School of Medicine

The perfect way to learn how to analyze and master huge datasets.
- Gary Bake, Brambles

Covers both basic and more advanced topics of PySpark, with a good balance between theory and hands-on.
- Philippe Van Bergenl, P² Consulting

For beginner to pro, a well-written book to help understand PySpark.
- Raushan Kumar Jha, Microsoft

Table of contents

  1. Chapter 1. Introduction
  2. Chapter 1. Your very own factory: How PySpark works
  3. Chapter 1. What will you learn in this book?
  4. Chapter 1. What do I need to get started?
  5. Chapter 1. Summary
  6. Part 1. Get acquainted: First steps in PySpark
  7. Chapter 2. Your first data program in PySpark
  8. Chapter 2. Mapping our program
  9. Chapter 2. Ingest and explore: Setting the stage for data transformation
  10. Chapter 2. Simple column transformations: Moving from a sentence to a list of words
  11. Chapter 2. Filtering rows
  12. Chapter 2. Summary
  13. Chapter 3. Submitting and scaling your first PySpark program
  14. Chapter 3. Ordering the results on the screen using orderBy
  15. Chapter 3. Writing data from a data frame
  16. Chapter 3. Putting it all together: Counting
  17. Chapter 3. Using spark-submit to launch your program in batch mode
  18. Chapter 3. What didn’t happen in this chapter
  19. Chapter 3. Scaling up our word frequency program
  20. Chapter 3. Summary
  21. Chapter 4. Analyzing tabular data with pyspark.sql
  22. Chapter 4. PySpark for analyzing and processing tabular data
  23. Chapter 4. Reading and assessing delimited data in PySpark
  24. Chapter 4. The basics of data manipulation: Selecting, dropping, renaming, ordering, diagnosing
  25. Chapter 4. Summary
  26. Chapter 5. Data frame gymnastics: Joining and grouping
  27. Chapter 5. Summarizing the data via groupby and GroupedData
  28. Chapter 5. Taking care of null values: Drop and fill
  29. Chapter 5. What was our question again? Our end-to-end program
  30. Chapter 5. Summary
  31. Part 2. Get proficient: Translate your ideas into code
  32. Chapter 6. Multidimensional data frames: Using PySpark with JSON data
  33. Chapter 6. Breaking the second dimension with complex data types
  34. Chapter 6. The struct: Nesting columns within columns
  35. Chapter 6. Building and using the data frame schema
  36. Chapter 6. Putting it all together: Reducing duplicate data with complex data types
  37. Chapter 6. Summary
  38. Chapter 7. Bilingual PySpark: Blending Python and SQL code
  39. Chapter 7. Preparing a data frame for SQL
  40. Chapter 7. SQL and PySpark
  41. Chapter 7. Using SQL-like syntax within data frame methods
  42. Chapter 7. Simplifying our code: Blending SQL and Python
  43. Chapter 7. Conclusion
  44. Chapter 7. Summary
  45. Chapter 8. Extending PySpark with Python: RDD and UDFs
  46. Chapter 8. Using Python to extend PySpark via UDFs
  47. Chapter 8. Summary
  48. Chapter 9. Big data is just a lot of small data: Using pandas UDFs
  49. Chapter 9. UDFs on grouped data: Aggregate and apply
  50. Chapter 9. What to use, when
  51. Chapter 9. Summary
  52. Chapter 10. Your data under a different lens: Window functions
  53. Chapter 10. Beyond summarizing: Using ranking and analytical functions
  54. Chapter 10. Flex those windows! Using row and range boundaries
  55. Chapter 10. Going full circle: Using UDFs within windows
  56. Chapter 10. Look in the window: The main steps to a successful window function
  57. Chapter 10. Summary
  58. Chapter 11. Faster PySpark: Understanding Spark’s query planning
  59. Chapter 11. Thinking about performance: Operations and memory
  60. Chapter 11. Summary
  61. Part 3. Get confident: Using machine learning with PySpark
  62. Chapter 12. Setting the stage: Preparing features for machine learning
  63. Chapter 12. Feature creation and refinement
  64. Chapter 12. Feature preparation with transformers and estimators
  65. Chapter 12. Summary
  66. Chapter 13. Robust machine learning with ML Pipelines
  67. Chapter 13. Building a (complete) machine learning pipeline
  68. Chapter 13. Evaluating and optimizing our model
  69. Chapter 13. Getting the biggest drivers from our model: Extracting the coefficients
  70. Chapter 13. Summary
  71. Chapter 14. Building custom ML transformers and estimators
  72. Chapter 14. Creating your own estimator
  73. Chapter 14. Using our transformer and estimator in an ML pipeline
  74. Chapter 14. Summary
  75. Appendix C. Some useful Python concepts
  76. Appendix C. Packing and unpacking arguments (*args and **kwargs)
  77. Appendix C. Python’s typing and mypy/pyright
  78. Appendix C. Python closures and the PySpark transform() method
  79. Appendix C. Python decorators: Wrapping a function to change its behavior

Product information

  • Title: Data Analysis with Python and PySpark, Video Edition
  • Author(s): Jonathan Rioux
  • Release date: March 2022
  • Publisher(s): Manning Publications
  • ISBN: None