R’s tidytext turns messy text into valuable insight

Authors Julia Silge and David Robinson discuss the power of tidy data principles, sentiment lexicons, and what they're up to at Stack Overflow.

By Nicole Tache

July 26, 2017

Woodtype (source: Pixabay)

“Many of us who work in analytical fields are not trained in even simple interpretation of natural language,” write Julia Silge, Ph.D., and David Robinson, Ph.D., in their newly released book Text Mining with R: A tidy approach. The applications of text mining are numerous and varied, though; sentiment analysis can assess the emotional content of text, frequency measurements can identify a document’s most important terms, analysis can explore relationships and connections between words, and topic modeling can classify and cluster similar documents.

I recently caught up with Silge and Robinson to discuss how they’re using text mining on job postings at Stack Overflow, some of the challenges and best practices they’ve experienced when mining text, and how their tidytext package for R aims to make text analysis both easy and informative.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Let’s start with the basics. Why would an analyst mine text? What insights can be derived from mining instances of words, sentiment of words?

Text and other unstructured data is increasingly important for data analysts and data scientists in diverse fields from health care to tech to nonprofits. This data can help us make good decisions, but to capitalize on it, we must have the tools and the skills to get from unstructured text to insights. We can learn a lot by exploring word frequencies or comparing word usage, and we can dig deeper by implementing sentiment analysis to analyze the emotion or opinion content of words, or by fitting a topic model to discover underlying structure in a set of documents.

Why did you create the tidytext text mining package in R? How does it make an R user’s life easier?

We created the tidytext package because we believe in the power of tidy data principles, and we wanted to apply this consistent, opinionated approach for handling data to text mining tasks. Tidy tools like dplyr and ggplot2 are widely used, and integrating natural language processing into these tools allows R users to work with greater fluency.

One feature in tidytext 0.1.3 is the addition of the Loughran and McDonald sentiment lexicon of words specific to financial reporting, where words like “collaborate” and “collaborators” seem to be tagged as positive and words like “collapsed” and “collapsing” seem to be tagged as negative. For someone who is new to text mining, what is the general purpose of a sentiment lexicon? What are some ways this lexicon would be used in by an analyst?

Sentiment lexicons are lists of words that have been assigned scores according to how positive or negative they are, or what emotions (such as “anticipation” or “fear”) they might be associated with. We can analyze the emotion content of text by adding up the scores of the words within it, which is a common approach to sentiment analysis. The tidytext package contains several general purpose English lexicons appropriate for general text, and we are excited to extend these with a context-specific lexicon for finance. A word like “share” has a positive meaning in most contexts, but is neutral in financial contexts, where it usually refers to shares of stock. Applying the Loughran-McDonald lexicon allows us to explore the sentiment content of documents dealing with finance with more confidence.

In your book, you perform text analysis on data sets ranging from classic Jane Austen novels to NASA metadata to Twitter archives. What are some of the ways you’re analyzing text data in your daily work at Stack Overflow?

We are swimming in text data at Stack Overflow! One example we deal with is text in job postings; we use text mining and modeling to match job listings with people who may be interested in them. Another example is text in messages between companies who are hiring and developers they want to hire; we use text mining to see what makes a developer more likely to respond to a company. But, we’re certainly not unique in this; many organizations are dealing with increasing amounts of text data that are important to their decision-making.

Text data is messy, and things like abbreviations, “filler” words, or repeated words can present many challenges. What are some common challenges practitioners might confront when wrangling or visualizing text data, as opposed to more traditional data types (e.g., numerical)?

Data scientists and analysts like us are usually trained on numerical data in a rectangular shape like a table (i.e., data frame), so it takes some practice to fluently wrangle raw text data. We find ourselves reaching for regular expressions and the stringr package a lot, to deal with challenges such as stripping out HTML tags or email headers, or extracting subsets of text we are interested in. We often put such tasks into practice using the purrr package; it’s a very useful tool for dealing with iteration.

What are some best practices you can offer to data scientists and analysts looking to overcome text mining problems?

We come from a particular, opinionated perspective on this question; our advice is that adopting tidy data principles is an effective strategy to approach text mining problems. The tidy text format keeps one token (typically a word) in each row, and keeps each variable (such as a document or chapter) in a column. When your data is tidy, you can use a common set of tools for exploring and visualizing them. This frees you from struggling to get your data into the right format for each task and instead lets you focus on the questions you want to ask.

Your book demonstrates how to do text mining in R. Which R tools do you commonly use to support text mining? And why is R your tool of choice?

Our main toolbox for text mining in R focuses on our package tidytext, along with the packages dplyr, tidyr, and ggplot2. These are all tools from the tidyverse collection of packages in R, and the availability and cohesion of these tools are the reasons why we use R for text mining. Using consistent tools designed for handling tidy data gives us a dependable framework for understanding how to represent text data in R, visualize the characteristics of text, model topics, and move smoothly to more complex machine learning applications.

What is the difference between text mining and natural language processing?

In our experience, definitions for these terms are somewhat vague and sometimes interchangeable. When people talk about text mining, they often mean getting insight from text through statistical analysis, perhaps looking at word frequencies or clustering. When people talk about natural language processing, they’re often describing the interaction between language and computers, and sometimes the goal of extracting meaning to enable human-computer conversations. We describe our work as “text mining” because our goal is extracting and visualizing insights, but there is a great deal of overlap.

Post topics: Data science