Chapter 1. An Overview of Ray
One of the reasons we need efficient distributed computing is that we’re collecting ever more data with great variety at increasing speeds. The storage systems, data processing, and analytics engines that have emerged in the past decade are crucial to the success of many companies. Interestingly, most “big data” technologies are built for and operated by (data) engineers who are in charge of data collection and processing tasks. The rationale is to free up data scientists to do what they’re best at. As a data science practitioner, you might want to focus on training complex machine learning models, running efficient hyperparameter selection, building entirely new and custom models or simulations, or serving your models to showcase them.
At the same time, it might be inevitable to scale these workloads to a compute cluster. To do that, the distributed system of your choice needs to support all of these fine-grained “big compute” tasks, potentially on specialized hardware. Ideally, it also fits into the big data tool chain you’re using and is fast enough to meet your latency requirements. In other words, distributed computing has to be powerful and flexible enough for complex data science workloads—and Ray can help you with that.
Python is likely the most popular language for data science today; it’s certainly the one we find the most useful for our daily work. Python is now more than 30 years old, but it still has a growing and active community. The ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access