Chapter 6. Understanding Wikipedia with LDA and Spark NLP

With the growing amount of unstructured text data in recent years, it has become difficult to obtain the relevant and desired information. Language technology provides powerful methods that can be used to mine through text data and fetch the information that we are looking for. In this chapter, we will use PySpark and the Spark NLP (natural language processing) library to use one such technique—topic modeling. Specifically, we will use the latent Dirichlet algorithm (LDA) to understand a dataset of Wikipedia documents.

Topic modeling, one of the most common tasks in natural language processing, is a statistical approach for data modeling that helps in discovering underlying topics that are present in a collection of documents. Extracting topic distribution from millions of documents can be useful in many ways—for example, identifying the reasons for complaints about a particular product or all products, or identifying topics in news articles. The most popular algorithm for topic modeling is LDA. It is a generative model that assumes that documents are represented by a distribution of topics. Topics, in turn, are represented by a distribution of words. PySpark MLlib offers an optimized version of LDA that is specifically designed to work in a distributed environment. We will build a simple topic modeling pipeline using Spark NLP for preprocessing the data and Spark MLlib’s LDA to extract topics from the data.

In this chapter, ...

Get Advanced Analytics with PySpark now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Advanced Analytics with PySpark by Akash Tandon, Sandy Ryza, Uri Laserson, Sean Owen, Josh Wills

Chapter 6. Understanding Wikipedia with LDA and Spark NLP

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly