7 Processing data
This chapter covers
- Accessing large-amounts of cloud-based data quickly
- Using Apache Arrow for efficient, in-memory data processing
- Leveraging SQL-based query engines to preprocess data for workflows
- Encoding features for models at scale
The past five chapters covered how to take data science projects from prototype to production. We have learned how to build workflows, use them to run computationally demanding tasks in the cloud, and deploy the workflows to a production scheduler. Now that we have a crisp idea of the prototyping loop and interaction with production deployments, we can return to the fundamental question: how should the workflows consume and produce data?
Interfacing with data is a key concern of all data science ...
Get Effective Data Science Infrastructure now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.