7 Processing data

This chapter covers

Accessing large-amounts of cloud-based data quickly
Using Apache Arrow for efficient, in-memory data processing
Leveraging SQL-based query engines to preprocess data for workflows
Encoding features for models at scale

The past five chapters covered how to take data science projects from prototype to production. We have learned how to build workflows, use them to run computationally demanding tasks in the cloud, and deploy the workflows to a production scheduler. Now that we have a crisp idea of the prototyping loop and interaction with production deployments, we can return to the fundamental question: how should the workflows consume and produce data?

Interfacing with data is a key concern of all data science ...

Get Effective Data Science Infrastructure now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Effective Data Science Infrastructure by Ville Tuulos

7 Processing data

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly