Chapter 15. Training and Serving NLP Models Using Spark

Author’s Note

This article describes a framework we built to organize, construct, and serve predictive models. It was used by production systems in a variety of different industries, and while the larger system is no longer operational, the component that this article focuses on is open source and can be found on GitHub.

Identifying critical information out of a sea of unstructured data or customizing real-time human interaction are a couple of examples of how clients utilize our technology at Idibon—a San Francisco startup focusing on natural language processing (NLP). The machine learning libraries in Spark ML and MLlib have enabled us to create an adaptive machine intelligence environment that analyzes text in any language, at a scale far surpassing the number of words per second in the Twitter firehose.

Our engineering team has built a platform that trains and serves thousands of NLP models that function in a distributed environment. This allows us to scale out quickly and provide thousands of predictions per second for many clients simultaneously. In this post, we’ll explore the types of problems we’re working to resolve, the processes we follow, and the technology stack we use. This should be helpful for anyone looking to build out or improve their own NLP pipelines.

Constructing Predictive Models with Spark

Our clients are companies that need to automatically classify documents or extract information ...

Get Artificial Intelligence Now now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.