Chapter 1. Language and Computation

Applications that leverage natural language processing to understand text and audio data are becoming fixtures of our lives. On our behalf, they curate the myriad of human-generated information on the web, offering new and personalized mechanisms of human-computer interaction. These applications are so prevalent that we have grown accustomed to a wide variety of behind-the-scenes applications, from spam filters that groom our email traffic, to search engines that take us right where we want to go, to virtual assistants who are always listening and ready to respond.

Language-aware features are data products built at the intersection of experimentation, research, and practical software development. The application of text and speech analysis is directly experienced by users whose response provides feedback that tailors both the application and the analysis. This virtuous cycle often starts somewhat naively, but over time can grow into a deep system with rewarding outcomes.

Ironically, while the potential for integrating language-based features into applications continues to multiply, a disproportionate number are being rolled out by the “big guys.” So why aren’t more people doing it? Perhaps it is in part because as these features become increasingly prevalent, they also become increasingly invisible, masking the complexity required to implement them. But it’s also because the rising tide of data science hasn’t yet permeated the prevailing culture of software development.

We believe applications that rely on natural language interfaces are only going to become more common, replacing much of what is currently done with forms and clicks. To develop these future applications, software development must embrace hypothesis-driven data science techniques. To ensure that language-aware data products become more robust, data scientists must employ software engineering practices that create production-grade code. These efforts are integrated by a newly evolving paradigm of data science, which leads to the creation of language-aware data products, the primary focus of this book.

The Data Science Paradigm

Thanks to innovations in machine learning and scalable data processing, the past decade has seen “data science” and “data product” rapidly become household terms. It has also led to a new job description, data scientist—one part statistician, one part computer scientist, and one part domain expert. Data scientists are the pivotal value creators of the information age, and so this new role has become one of the most significant, even sexy, jobs of the 21st century, but also one of the most misunderstood.

Data scientists bridge work traditionally done in an academic context, research and experimentation, to the workflow of a commercial product. This is in part because many data scientists have previously spent time in postgraduate studies (giving them the jack-of-all-trades and creative skills required for data science), but is primarily because the process of data product development is necessarily experimental.

The challenge, which prominent voices in the field have begun to signal, is that the data science workflow is not always compatible with software development practices. Data can be unpredictable, and signal is not always guaranteed. As Hilary Mason says of data product development, data science isn’t always particularly agile.1

Or, said another way:

There is a fundamental difference between delivering production software and actionable insights as artifacts of an agile process. The need for insights to be actionable creates an element of uncertainty around the artifacts of data science—they might be “complete” in a software sense, and yet lack any value because they don’t yield real, actionable insights….agile software methodologies don’t handle this uncertainty well.

Russell Jurney, Agile Data Science 2.0

As a result, data scientists and data science departments often operate autonomously from the development team in a work paradigm described in Figure 1-1. In this context, data science work produces business analytics for senior management, who communicate changes to the technology or product leadership; those changes are eventually passed on to the development team for implementation.

In organizations, data scientists often operate autonomously from the development team.
Figure 1-1. The current data science paradigm

While this structure may be sufficient for some organizations, it is not particularly efficient. If data scientists were integrated with the development team from the start as in Figure 1-2, improvements to the product would be much more immediate and the company much more competitive. There aren’t many companies that can afford to build things twice! More importantly, the efforts of data science practice are directed toward users, requiring an in-the-loop approach alongside frontend development.

While integrating data science directly into development is not straightforward, it presents tremendous potential.
Figure 1-2. Toward a better paradigm for data science development

One of the impediments to a more integrated data science development paradigm is the lack of applications-focused data science content. Most of the published resources on machine learning and natural language processing are written in ways that support research, but do not scale well to application development. For instance, while there are a number of excellent tools for machine learning on text, the available resources, documentation, tutorials, and blog posts tend to lean heavily on toy datasets, data exploration tools, and research code. Few resources exist to explain, for example, how to build a sufficiently large corpus to support an application, how to manage its size and structure as it grows over time, or how to transform raw documents into usable data. In practice, this is unquestionably the majority of the work involved in building scalable language-based data products.

This book is intended to bridge that gap by empowering a development-oriented approach to text analytics. In it, we will demonstrate how to leverage the available open source technologies to create data products that are modular, testable, tunable, and scalable. Together with these tools, we hope the applied techniques presented in this book will enable data scientists to build the next generation of data products.

This chapter serves as the foundation to the more practical, programming-focused chapters of the rest of the book. It begins by framing what we mean by language-aware data products and talking about how to begin spotting them in the wild. Next, we’ll discuss architectural design patterns that are well suited to text analytics applications. Finally, we’ll consider the features of language that can be used to model it computationally.

Language-Aware Data Products

Data scientists build data products. Data products are applications that derive their value from data and generate new data in return.2 In our view, the goal of applied text analytics is to enable the creation of “language-aware data products”—user-facing applications that are not only responsive to human input and can adapt to change but also are impressively accurate and relatively simple to design. At their core, these applications accept text data as input, parse it into composite parts, compute upon those composites, and recombine them in a way that delivers a meaningful and tailored result.

One of our favorite examples of this is “Yelpy Insights,” a review filtering application that leverages a combination of sentiment analysis, significant collocations (words that tend to appear together), and search techniques to determine if a restaurant is suitable for your tastes and dietary restrictions. This application uses a rich, domain-specific corpus and presents results to users in an intuitive way that helps them decide whether to patronize a particular restaurant. Because of the application’s automatic identification of significant sentences in reviews and term highlighting, it allows potential restaurant-goers to digest a large amount of text quickly and make dining decisions more easily. Although language analysis is not Yelp’s core business, the impact this feature has on the experience of their users is undeniable. Since introducing “Yelpy Insights” in 2012, Yelp has steadily rolled out new language-based features, and during that same period, has seen annual revenue rise by a factor of 6.5.3

Another simple example of bolt-on language analysis with oversized effects is the “suggested tag” feature incorporated into the data products of companies like Stack Overflow, Netflix, Amazon, YouTube, and others. Tags are meta information about a piece of content that are essential for search and recommendations, and they play a significant role in determining what content is viewed by specific users. Tags identify properties of the content they describe, which can be used to group similar items together and propose descriptive topic names for a group.

There are many, many more. Reverb offers a personalized news reader trained on the Wordnik lexicon. The Slack chatbot provides contextual automatic interaction. Google Smart Reply can suggest responses based on the text of the email you’re replying to. Textra, iMessage, and other instant messaging tools try to predict what you’ll type next based on the text you just entered, and autocorrect tries to fix our spelling mistakes for us. There are also a host of new voice-activated virtual assistants—Alexa, Siri, Google Assistant, and Cortana—trained on audio data, that are able to parse speech and provide (usually) appropriate responses.


So what about speech data? While this book is focused on text rather than on audio or speech analysis, audio data is typically transcribed into text and then applied to the analytics described in this book. Transcription itself is a machine learning process, one that is also becoming more common!

Features like these highlight the basic methodology of language-aware applications: clustering similar text into meaningful groups or classifying text with specific labels, or said another way—unsupervised and supervised machine learning.

In the next section, we’ll explore some architectural design patterns that support the machine learning model lifecycle.

The Data Product Pipeline

The standard data product pipeline, shown in Figure 1-3, is an iterative process consisting of two phases—build and deploy—which mirror the machine learning pipeline.4 During the build phase, data is ingested and wrangled into a form that allows models to be fit and experimented on. During the deploy phase, models are selected and then used to make estimations or predictions that directly engage a user.

The data product pipeline focuses on machine learning models, which are trained from data then generate new data that can be used as feedback to adapt the models.
Figure 1-3. A data product pipeline

Users respond to the output of models, creating feedback, which is in turn reingested and used to adapt models. The four stages—interaction, data, storage, and computation—describe the architectural components required for each phase. For example, during interaction the build phase requires a scraper or utility to ingest data while the user requires some application frontend. The data stage usually refers to internal components that act as glue to the storage stage, which is usually a database. Computation can take many forms from simple SQL queries, Jupyter notebooks, or even cluster computing using Spark.

The deploy phase, other than requiring the selection and use of a fitted model, does not significantly differ from more straightforward software development. Often data science work products end at the API, which is consumed by other APIs or a user frontend. The build phase for a data product, however, does require more attention—and even more so in the case of text analytics. When we build language-aware data products, we create additional lexical resources and artifacts (such as dictionaries, translators, regular expressions, etc.) on which our deployed application will depend.

A more detailed view of the build phase is shown in Figure 1-4, a pipeline that supports robust, language-aware machine learning applications. The process of moving from raw data to deployed model is essentially a series of incremental data transformations. First, we transform the data from its original state into an ingested corpus, stored and managed inside a persistent data store. Next, the ingested data is aggregated, cleaned, normalized, and then transformed into vectors so that we can perform meaningful computation. In the final transformation, a model or models are fit on the vectorized corpus and produce a generalized view of the original data, which can be employed from within the application.

Data products that operate on text transform their data into a series of increasingly informed corpora then use machine learning models on vector representations of documents.
Figure 1-4. Language-aware data products

The model selection triple

What differentiates the construction of machine learning products is that the architecture must support and streamline these data transformations so that they are efficiently testable and tunable. As data products have become more successful, there has been increasing interest in generally defining a machine learning workflow for more rapid—or even automated—model building. Unfortunately, because the search space is large, automatic techniques for optimization are not sufficient.

Instead, the process of selecting an optimal model is complex and iterative, involving repeated cycling through feature engineering, model selection, and hyperparameter tuning. Results are evaluated after each iteration in order to arrive at the best combination of features, model, and parameters that will solve the problem at hand. We refer to this as the model selection triple5 workflow. This workflow, shown in Figure 1-5, aims to treat iteration as central to the science of machine learning, something to be facilitated rather than limited.

The model selection triple is a generalization of the machine learning workflow that expresses an instance of a model as its feature engineering, algorithm, and hyperparameter components.
Figure 1-5. The model selection triple workflow

In a 2015 article, Wickham et al.6 neatly disambiguate the overloaded term “model” by describing its three principal uses in statistical machine learning: model family, model form, and fitted model. The model family loosely describes the relationships of variables to the target of interest (e.g., a “linear model” or a “recurrent tensor neural network”). The model form is a specific instantiation of the model selection triple: a set of features, an algorithm, and specific hyperparameters. Finally, the fitted model is a model form that has been fit to a specific set of training data and is available to make predictions. Data products are composed of many fitted models, constructed through the model selection workflow, which creates and evaluates model forms.

Because we are not accustomed to thinking of language as data, the primary challenge of text analysis is interpreting what is happening during each of these transformations. With each successive transformation, the text becomes less and less directly meaningful to us because it becomes less and less like language. In order to be effective in our construction of language-aware data products, we must shift the way we think about language.

Throughout the rest of this chapter, we will frame how to think about language as data that can be computed upon. Along the way, we will build a small vocabulary that will enable us to articulate the kinds of transformations we will be performing on text data in subsequent chapters.

Language as Data

Language is unstructured data that has been produced by people to be understood by other people. By contrast, structured or semistructured data includes fields or markup that enable it to be easily parsed by a computer. However, while it does not feature an easily machine-readable structure, unstructured data is not random. On the contrary, it is governed by linguistic properties that make it very understandable to other people.

Machine learning techniques, particularly supervised learning, are currently the most well-studied and promising means of computing upon languages. Machine learning allows us to train (and retrain) statistical models on language as it changes. By building models of language on context-specific corpora, applications can leverage narrow windows of meaning to be accurate without interpretation. For example, building an automatic prescription application that reads medical charts requires a very different model than an application that summarizes and personalizes news.

A Computational Model of Language

As data scientists building language-aware data products, our primary task is to create a model that describes language and can make inferences based on that description.

The formal definition of a language model attempts to take as input an incomplete phrase and infer the subsequent words most likely to complete the utterance. This type of language model is hugely influential to text analytics because it demonstrates the basic mechanism of a language application—the use of context to guess meaning. Language models also reveal the basic hypothesis behind applied machine learning on text: text is predictable. In fact, the mechanism used to score language models in an academic context, perplexity, is a measure of how predictable the text is by evaluating the entropy (the level of uncertainty or surprisal) of the language model’s probability distribution.

Consider the following partial phrases: “man’s best…” or “the witch flew on a…”. These low entropy phrases mean that language models would guess “friend” and “broomstick,” respectively, with a high likelihood (and in fact, English speakers would be surprised if the phrase wasn’t completed that way). On the other hand, high entropy phrases like “I’m going out to dinner tonight with my…” lend themselves to a lot of possibilities (“friend,” “mother,” and “work colleagues” could all be equally likely). Human listeners can use experience, imagination, and memory as well as situational context to fill in the blank. Computational models do not necessarily have the same context and as a result must be more constrained.

Language models demonstrate an ability to infer or define relationships between tokens, the UTF-8 encoded strings of data the model observes that human listeners and readers identify as words with meaning. In the formal definition, the model is taking advantage of context, defining a narrow decision space in which only a few possibilities exist.

This insight gives us the ability to generalize the formal model to other models of language that operate in applications such as machine translation or sentiment analysis. To take advantage of the predictability of text, we need to define a constrained, numeric decision space on which the model can compute. By doing this, we can leverage statistical machine learning techniques, both supervised and unsupervised, to build models of language that expose meaning from data.

The first step in machine learning is the identification of the features of data that predict our target. Text data provides many opportunities to extract features either at a shallow level by simply using string splitting, to deeper levels that parse text to extract morphological, syntactic, and even semantic representations from the data.

In the following sections we’ll explore some simple ways that language data can expose complex features for modeling purposes. First, we’ll explore how the linguistic properties of a specific language (e.g., gender in English) can give us the quick ability to perform statistical computation on text. We’ll then take a deeper look at how context modifies interpretation, and how this is usually used to create the traditional “bag-of-words” model. Finally we’ll explore richer features that are parsed using morphologic, syntactic, and semantic natural language processing.

Language Features

Consider a simple model that uses linguistic features to identify the predominant gender in a piece of text. In 2013 Neal Caren, an assistant professor of Sociology at the University of North Carolina Chapel Hill, wrote a blog post7 that investigated the role of gender in news to determine if men and women come up in different contexts. He applied a gender-based analysis of text to New York Times articles and determined that in fact male and female words appeared in starkly different contexts, potentially reinforcing gender biases.

What was particularly interesting about this analysis was the use of gendered words to create a frequency-based score of maleness or femaleness. In order to implement a similar analysis in Python, we can begin by building sets of words that differentiate sentences about men and about women. For simplicity, we’ll say that a sentence can have one of four states—it can be about men, about women, about both men and women, or unknown (since sentences can be about neither men nor women, and also because our MALE_WORDS and FEMALE_WORDS sets are not exhaustive):

MALE = 'male'
FEMALE = 'female'
UNKNOWN = 'unknown'
BOTH = 'both'

MALE_WORDS = set([


Now that we have gender word sets, we need a method for assigning gender to a sentence; we’ll create a genderize function that examines the numbers of words from a sentence that appear in our MALE_WORDS list and in our FEMALE_WORDS list. If a sentence has only MALE_WORDS, we’ll call it a male sentence, and if it has only FEMALE_WORDS, we’ll call it a female sentence. If a sentence has nonzero counts for both male and female words, we’ll call it both; and if it has zero male and zero female words, we’ll call it unknown:

def genderize(words):

    mwlen = len(MALE_WORDS.intersection(words))
    fwlen = len(FEMALE_WORDS.intersection(words))

    if mwlen > 0 and fwlen == 0:
        return MALE
    elif mwlen == 0 and fwlen > 0:
        return FEMALE
    elif mwlen > 0 and fwlen > 0:
        return BOTH
        return UNKNOWN

We need a method for counting the frequency of gendered words and sentences within the complete text of an article, which we can do with the collections.Counters class, a built-in Python class. The count_gender function takes a list of sentences and applies the genderize function to evaluate the total number of gendered words and gendered sentences. Each sentence’s gender is counted and all words in the sentence are also considered as belonging to that gender:

from collections import Counter

def count_gender(sentences):

    sents = Counter()
    words = Counter()

    for sentence in sentences:
        gender = genderize(sentence)
        sents[gender] += 1
        words[gender] += len(sentence)

    return sents, words

Finally, in order to engage our gender counters, we require some mechanism for parsing the raw text of the articles into component sentences and words, and for this we will use the NLTK library (which we’ll discuss further later in this chapter and in the next) to break our paragraphs into sentences. With the sentences isolated, we can then tokenize them to identify individual words and punctuation and pass the tokenized text to our gender counters to print a document’s percent male, female, both, or unknown:

import nltk

def parse_gender(text):

    sentences = [
        [word.lower() for word in nltk.word_tokenize(sentence)]
        for sentence in nltk.sent_tokenize(text)

    sents, words = count_gender(sentences)
    total = sum(words.values())

    for gender, count in words.items():
        pcent = (count / total) * 100
        nsents = sents[gender]

            "{0.3f}% {} ({} sentences)".format(pcent, gender, nsents)

Running our parse_gender function on an article from the New York Times entitled “Rehearse, Ice Feet, Repeat: The Life of a New York City Ballet Corps Dancer” yields the following, unsurprising results:

50.288% female (37 sentences)
42.016% unknown (49 sentences)
4.403% both (2 sentences)
3.292% male (3 sentences)

The scoring function here takes into account the length of the sentence in terms of the number of words it contains. Therefore even though there are fewer total female sentences, over 50% of the article is female. Extensions of this technique can analyze words that are in female sentences versus in male sentences to see if there are any auxiliary terms that are by default associated with male and female genders. We can see that this analysis is relatively easy to implement in Python, and Caren found his results very striking:

If your knowledge of men’s and women’s roles in society came just from reading last week’s New York Times, you would think that men play sports and run the government. Women do feminine and domestic things. To be honest, I was a little shocked at how stereotypical the words used in the women subject sentences were.

Neal Caren

So what exactly is happening here? This mechanism, while deterministic, is a very good example of how words contribute to predictability in context (stereotypical though it may be). However, this mechanism works specifically because gender is a feature that is encoded directly into language. In other languages (like French, for example), gender is even more pronounced: ideas, inanimate objects, and even body parts can have genders (even if at times they are counter-intuitive). Language features do not necessarily convey definitional meaning, but often convey other information; for example, plurality and tense are other features we can extract from a language—we could potentially apply a similar analysis to detect past, present, or future language. However, language features are only part of the equation when it comes to predicting meaning in text.

Contextual Features

Sentiment analysis, which we will discuss in greater depth in Chapter 12, is an extremely popular text classification technique because the tone of text can convey a lot of information about the subject’s perspective and lead to aggregate analyses of reviews, message polarity, or reactions. One might assume that sentiment analysis can be conducted with a technique similar to the gender analysis of the previous section: gather lists of positive words (“awesome,” “good,” “stupendous”) and negative words (“horrible,” “tasteless,” “bland”) and compute the relative frequencies of these tokens in their context. Unfortunately, this technique is naive and often produces highly inaccurate results.

Sentiment analysis is fundamentally different from gender classification because sentiment is not a language feature, but instead dependent on word sense; for example, “that kick flip was sick” is positive whereas “the chowder made me sick” is negative, and “I have a sick pet iguana” is somewhat ambiguous—the definition of the word “sick” in these examples is changing. Moreover, sentiment is dependent on context even when definitions remain constant; “bland” may be negative when talking about hot peppers, but can be a positive term when describing cough syrup. Finally, unlike gender or tense, sentiment can be negated: “not good” means bad. Negation can flip the meaning of large amounts of positive text; “I had high hopes and great expectations for the movie dubbed wonderful and exhilarating by critics, but was hugely disappointed.” Here, though words typically indicating positive sentiment such as “high hopes,” “great,” “wonderful and exhilarating,” and even “hugely” outnumber the sole negative sentiment of “disappointed,” the positive words not only do not lessen the negative sentiment, they actually enhance it.

However, all of these examples are predictable; a positive or negative sentiment is clearly communicated, and it seems that a machine learning model should be able to detect sentiment and perhaps even highlight noisy or ambiguous utterances. An a priori deterministic or structural approach loses the flexibility of context and sense—so instead, most language models take into account the localization of words in their context, utilizing machine learning methods to create predictions.

Figure 1-6 shows the primary method of developing simple language models, often called the “bag-of-words” model. This model evaluates the frequency with which words co-occur with themselves and other words in a specific, limited context. Co-occurrences show which words are likely to proceed and succeed each other and by making inferences on limited pieces of text, large amounts of meaning can be captured. We can then use statistical inference methods to make predictions about word ordering.

A simple statistical view of language that counts the frequency of words occurring together in a simple context.
Figure 1-6. A word co-occurrence matrix

Extensions of the “bag-of-words” model consider not only single word co-occurrences, but also phrases that are highly correlated to indicate meaning. If “withdraw money at the bank” contributes a lot of information to the sense of “bank,” so does “fishing by the river bank.” This is called n-gram analysis, where n specifies a ordered sequence of either characters or words to scan on (e.g., a 3-gram is ('withdraw', 'money', 'at') as opposed to the 5-gram ('withdraw', 'money', 'at', 'the', 'bank')). n-grams introduce an interesting opportunity because the vast majority of possible n-grams are nonsensical (e.g., ('bucket', 'jumps', 'fireworks')), though the evolving nature of language means that even that 3-gram could eventually become sensical! Language models that take advantage of context in this way therefore require some ability to learn the relationship of text to some target variable.

Both language features and contextual ones contribute to the overall predictability of language for analytical purposes. But identifications of these features require the ability to parse and define language according to units. In the next section we will discuss the coordination of both language features and context into meaning from the linguistic perspective.

Structural Features

Finally, language models and text analytics have benefited from advances in computational linguistics. Whether we are building models with contextual or linguistic features (or both), it is necessary to consider the high-level units of language used by linguists, which will give us a vocabulary for the operations we’ll perform on our text corpus in subsequent chapters. Different units of language are used to compute at a variety of levels, and understanding the linguistic context is essential to understanding the language processing techniques used in machine learning.

Semantics refer to meaning; they are deeply encoded in language and difficult to extract. If we think of an utterance (a simple phrase instead of a whole paragraph, such as “She borrowed a book from the library.”) in the abstract, we can see there is a template: a subject, the head verb, an object, and an instrument that relates back to the object (subject - predicate - object). Using such templates, ontologies can be constructed that specifically define the relationships between entities, but such work requires significant knowledge of the context and domain, and does not tend to scale well. Nonetheless, there is promising recent work on extracting ontologies from sources such as Wikipedia or DBPedia (e.g., DBPedia’s entry on libraries begins “A library is a collection of sources of information and similar resources, made accessible to a defined community for reference or borrowing.”).

Semantic analysis is not simply about understanding the meaning of text, but about generating data structures to which logical reasoning can be applied. Text meaning representations (or thematic meaning representations, TMRs) can be used to encode sentences as predicate structures to which first-order logic or lambda calculus can be applied. Other structures such as networks can be used to encode predicate interactions of interesting features in the text. Traversal can then be used to analyze the centrality of terms or subjects and reason about the relationships between items. Although not necessarily a complete semantic analysis, graph analysis can produce important insights.

Syntax refers to sentence formation rules usually defined by grammar. Sentences are what we use to build meaning and encode much more information than words, and for this reason we will treat them as the smallest logical unit of language. Syntactic analysis is designed to show the meaningful relationship of words, usually by carving the sentence into chunks or showing the relationship of tokens in a tree structure (similar to the sentence diagramming you probably did in grammar school). Syntax is a necessary prerequisite to reasoning on discourse or semantics because it is a vital tool to understanding how words modify each other in the formation of phrases. For example, syntactic analysis should identify the prepositional phrase “from the library” and the noun phrase “a book from the library” as being subcomponents of the verb phrase “borrowed a book from the library.”

Morphology refers to the form of things, and in text analysis, the form of individual words or tokens. The structure of words can help us to identify plurality (wife versus wives), gender (fiancé versus fiancée), tense (ran versus run), conjugation (to run versus he runs), etc. Morphology is challenging because most languages have many exceptions and special cases. English punctuation, for instance, has both orthographic rules, which merely adjust the ending of a word (puppy - puppies), as well as morphological rules that are complete translations (goose - geese). English is an affixal language, which means that we simply add characters to the beginning or end of a word to modify it. Other languages have different morphologic modes: Hebrew uses templates of consonants that are filled in with vowels to create meaning, whereas Chinese uses pictographic symbols that are not necessarily modified directly.

The primary goal of morphology is to understand the parts of words so that we can assign them to classes, often called part-of-speech tags. For example, we want to know if a word is a singular noun, a plural noun, or a proper noun. We might also want to know if a verb is infinitive, past tense, or a gerund. These parts of speech are then used to build up larger structures such as chunks or phrases, or even complete trees, that can then in turn be used to build up semantic reasoning data structures.

Semantics, syntax, and morphology allow us to add data to simple text strings with linguistic meaning. In Chapter 3 we will explore how to carve up text into units of reason, using tokenization and segmentation to break up text into their units of logic and meaning, as well as assign part-of-speech tags. In Chapter 4 we will apply vectorization to these structures to create numeric feature spaces—for example, normalizing text with stemming and lemmatization to reduce the number of features. Finally, in Chapter 7, we will directly use the structures to encode information into our machine learning protocols to improve performance and target more specific types of analytics.


Natural language is one of the most untapped forms of data available today. It has the ability to make data products even more useful and integral to our lives than they already are. Data scientists are uniquely poised to build these types of language-aware data products, and by combining text data with machine learning, they have the potential to build powerful applications in a world where information often equates to value and a competitive advantage. From email to maps to search, our modern life is powered by natural language data sources, and language-aware data products are what make their value accessible.

In the next few chapters, we will discuss the necessary precursors to machine learning on text, namely corpus management (in Chapter 2), preprocessing (in Chapter 3), and vectorization (in Chapter 4). We will then experiment with formulating machine learning problems to those of classification (in Chapter 5) and clustering (Chapter 6). In Chapter 7 we’ll implement feature extraction to maximize the effectiveness of our models, and in Chapter 8 we’ll see how to employ text visualization to surface results and diagnose modeling errors. In Chapter 9, we will explore a different approach to modeling language, using the graph data structure to represent words and their relationships. We’ll then explore more specialized methods of retrieval, extraction, and generation for chatbots in Chapter 10. Finally, in Chapters 11 and 12 we will investigate techniques for scaling processing power with Spark and scaling model complexity with artificial neural networks.

As we will see in the next chapter, in order to perform scalable analytics and machine learning on text, we will first need both domain knowledge and a domain-specific corpus. For example, if you are working in the financial domain, your application should be able to recognize stock symbols, financial terms, and company names, which means that the documents in the corpus you construct need to contain these entities. In other words, developing a language-aware data product begins with acquiring the right kind of text data and building a custom corpus that contains the structural and contextual features from the domain in which you are working.

1 Hillary Mason, The Next Generation of Data Products, (2017)

2 Mike Loukides, What is data science?, (2010)

3 Market Watch, (2018)

4 Benjamin Bengfort, The Age of the Data Product, (2015)

5 Arun Kumar, Robert McCann, Jeffrey Naughton, and Jignesh M. Patel, Model Selection Management Systems: The Next Frontier of Advanced Analytics, (2015)

6 Hadley Wickham, Dianne Cook, and Heike Hofmann, Visualizing Statistical Models: Removing the Blindfold, (2015)

7 Neal Caren, Using Python to see how the Times writes about men and women, (2013)

Get Applied Text Analysis with Python now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.