9 Big data is just a lot of small data: Using pandas UDFs
This chapter covers
- Using pandas Series UDFs to accelerate column transformation compared to Python UDFs
- Addressing the cold start of some UDFs using Iterator of Series UDF
- Controlling batch composition in a split-apply-combine programming pattern
- Confidently making a decision about the best pandas UDF to use
This chapter approaches the distributed nature of PySpark a little differently. If we take a few seconds to think about it, we read data into a data frame, and Spark distributes the data across partitions on nodes. What if we could directly operate on those partitions as if they were single-node data frames? More interestingly, what if we control how those single-node partitions ...
Get Data Analysis with Python and PySpark now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.