9 Big data is just a lot of small data: Using pandas UDFs

This chapter covers

Using pandas Series UDFs to accelerate column transformation compared to Python UDFs
Addressing the cold start of some UDFs using Iterator of Series UDF
Controlling batch composition in a split-apply-combine programming pattern
Confidently making a decision about the best pandas UDF to use

This chapter approaches the distributed nature of PySpark a little differently. If we take a few seconds to think about it, we read data into a data frame, and Spark distributes the data across partitions on nodes. What if we could directly operate on those partitions as if they were single-node data frames? More interestingly, what if we control how those single-node partitions ...

Get Data Analysis with Python and PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Data Analysis with Python and PySpark by Jonathan Rioux

9 Big data is just a lot of small data: Using pandas UDFs

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly