Chapter 6. Scalability and Distributed Training
The examples that we have seen in the previous chapters are considered toy examples and relatively simple. They can fit and run in the memory and compute constraints of a single machine. Most enterprises have larger datasets and more complex requirements that need to scale bigger than one machine. In this chapter, we will look at architecture and techniques that will help enterprises scale Snorkel.
When we think about scaling Snorkel, we essentially are looking to run labeling functions distributed across several machines, typically as part of a cluster.
In this chapter, we will start by exploring Apache Spark as the core set of technologies that allow us to scale Snorkel. Instead of custom engineering a solution that would need to factor in the infrastructure and plumbing needed to scale, Apache Spark can do this out of the box and is already a popular option when it comes to big data and production deployments within enterprises.
We will use the NLP-based fake news example from Chapter 3 and see how we can scale it with Apache Spark. During this journey, we will understand the code and design changes that are needed to allow us to achieve a fake news implementation that we can distribute and deploy at a massive scale.
Bigger models create pressure on the underlying systems for training and inference, which is where we learn from the experience of other well-established software engineering paradigms of high-performance computing ...
Get Practical Weak Supervision now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.