Chapter 1. Distributed Computing Is Hard but Necessary
The accuracy of Moore’s Law, which held that the number of transistors in microchips would double every two years, has resulted in the amazing compute power that all of us enjoy every day in our phones, in our laptops, and in our servers. Yet it’s never enough.
Many software systems built today require CPU cycles and memory that far exceed even the largest servers. Even when one server is large enough, the need for high availability means we often use multiple machines, even multiple datacenters, to ensure that our services remain available, even when failures occur. Also, many jobs (including training machine learning and AI models) can be decomposed into parallel tasks, greatly reducing the total time required to complete those jobs, if we spread the work over a cluster.
However, distributed systems programming has always been hard to do, requiring special expertise. Why is that necessary? Isn’t it possible to provide intuitive abstractions that allow developers to express the computing they need to do, while transparently spreading that work across a cluster of machines?
Why Ray?
Researchers in machine learning (ML) and AI at the University of California, Berkeley, faced this problem. They needed an easy way to run workloads at massive scale, requiring distribution of tasks over clusters, yet none of the available tools were right for the job. They didn’t require fine-grained control over how the work was done. They weren’t ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access