Chapter 1. Introduction to NLP

What do you think your computer can do? Show you emails? Edit some files? Spin up an Excel sheet maybe?

But what if we told you your computer could read?

from transformers import pipeline
classifier = pipeline('sentiment-analysis')
classifier('I am reading the greatest NLP book ever!')

[{'label': 'POSITIVE', 'score': 0.9996862411499023}]

And write:

text_generator = pipeline("text-generation")
text_generator("Welcome to the ", max_length=5, do_sample=False)

And, most impressively, understand:

nlp = pipeline("question-answering")
context = """
Natural language processing (NLP) is a subfield of linguistics,
computer science, and artificial intelligence concerned with the
interactions between computers and human language, in particular
how to program computers to process and analyze large amounts of
natural language data. The result is a computer capable of
"understanding" the contents of documents, including the contextual
nuances of the language within them. The technology can then accurately
extract information and insights contained in the documents as well
as categorize and organize the documents themselves.
"""
nlp(question="What is NLP?", context=context)

{'score': 0.9869255423545837,
 'start': 1,
 'end': 28,
 'answer': 'Natural language processing'}

What was once the fantasy of a distant future is not only here but is accessible to anyone with a computer and an internet connection. The ability to understand and communicate in natural language, one of the most valuable assets that humanity has developed over the course of our existence, is now practical to do on machines.

“Of course!” you proclaim. “Technology always gets better, and we’ve had speech recognition and Google Translate for ages!”

But even just five years ago, “NLP” was something better suited to TechCrunch articles than actual production codebases. In the last three years, we’ve seen an exponential growth in progress in the field; models being deployed in production today are vastly superior to the most obscure research leaderboards from the days past.

But we’re getting ahead of ourselves. Before we delve deeper, let’s start with a high-level overview of the field. Once we cover the basics, we will introduce more advanced topics. Our goal is to help you build intuition and experience working with NLP, chapter by chapter, so that by the end of the book, you’ll be able to build real applications that add real value to the world.

In the first half of this chapter, we will define NLP, explore some commercial applications of the technology, and walk through how the field has evolved since its origins in the 1950s.

In the second half of the chapter, we will introduce a very performant NLP library that is popular in the enterprise and use it to perform basic NLP tasks. While these tasks are elementary, when combined together, they allow computers to process and analyze natural language data in complex ways that make amazing commercial applications such as chatbots and voicebots possible.

In some ways, the process of machines learning how to process language is similar to how toddlers begin to learn language by mumbling and fumbling over words, only to later speak in full sentences and paragraphs. As we move through the book, we will build on the basic NLP tasks covered in this chapter.

What Is NLP?

Let’s begin by defining what natural language processing is. Here is how NLP is defined on Wikipedia (accessed March 2021):

Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data.

Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation.

Let’s unpack this definition. When we say “natural language,” we mean “human language” as opposed to programming languages. Natural language refers to not only textual data, but also to speech and audio data.

Great, but so what if computers can now work with large amounts of text, speech, and audio data? Why is this so important?

Imagine for a second the world without language. How would we communicate via text or speech? How would we read books, listen to music, or comprehend movies and TV shows? Life as we know it would cease to exist; we would be stuck in caveman days, able to process information visually but unable to share our knowledge with each other or communicate in any meaningful way.¹

Likewise, if machines can work with only numerical and visual data but cannot process natural language, they would be limited in the number and variety of applications they would have in the real world. Without the ability to handle natural language, machines will never be able to approach general artificial intelligence or anything that resembles human intelligence today.

Fortunately, machines can now finally process natural language data reasonably well. Let’s explore what commercial applications are possible because of this relatively newfound ability of computers to work with natural language data.

Popular Applications

Because of the advances in NLP, machines are able to handle a broad array of natural language tasks, at least in a rudimentary way. Here are some common applications of NLP today:

Machine translation: Machine translation is the process of using machines to translate from one language to another without any human intervention. By far the most popular example of this is Google Translate, which supports over 100 languages and serves over 500 million people daily. When it was first launched in 2006, the performance of Google Translate was notably worse than what it is today. Performance today is fast approaching human expert level.²
Speech recognition: It may sound shocking, but voice recognition technology has been around for over 50 years. None of the voice recognition software had good performance or had gone mainstream until very recently, driven by the rise of deep learning. Today, Amazon Alexa, Apple Siri, Google Assistant, Microsoft Cortana, digital voice assistants in your car, and other software are now able to recognize speech with such a high level of accuracy that the software is able to process the information in real time and answer in a mostly reasonable way. Even as little as 15 years ago, the ability of such machines to recognize speech and respond in a coherent manner was abysmal.
Question answering: For these digital assistants to deliver a delightful experience to humans asking questions, speech recognition is only the first half of the job. The software needs to (a) recognize the speech and (b), given the speech recognized, retrieve an appropriate response. This second half is known as question answering (QA).
Text summarization: One of the most common tasks humans do every day, especially in white collar desk jobs, is read long-form documents and summarize the contents. Machines are now able to perform this summarization, creating a shorter summary of a longer text document. Text summarization reduces the reading time for humans. Humans who analyze lots of text daily (i.e., lawyers, paralegals, business analysts, students, etc.) are able to sift through the machine-generated short summaries of long-form documents and then, based on the summaries, choose the relevant documents to read more thoroughly.
Chatbots: If you have spent some time perusing websites recently, you may have realized that more and more sites now have a chatbot that automatically chimes in to engage the human user. The chatbot usually greets the human in a friendly, nonthreatening manner and then asks the user questions to gauge the purpose and intent of the visit to the site. The chatbot then tries to automatically respond to any questions the user has without human intervention. Such chatbots are now automating digital customer engagement.
Text-to-speech and speech-to-text: Software is now able to convert text to high-fidelity audio very easily. For example, Google Cloud Text-to-Speech is able to convert text into human-like speech in more than 180 voices across over 30 languages. Likewise, Google Cloud Speech-to-Text is able to convert audio to text for over 120 languages, delivering a truly global offering.
Voicebots: Ten years ago, automated voice agents were clunky. Unless humans responded in a fairly constrained manner (e.g., with yes or no type responses), the voice agents on the phone could not process the information. Now, AI voicebots like those provided by VOIQ are able to help augment and automate calls for sales, marketing, and customer success teams.
Text and audio generation: Years ago, text generation relied on templates and rules-based systems. This limited the scope of application. Now, software is able to generate text and audio using machine learning, broadening the scope of application considerably. For example, Gmail is now able to suggest entire sentences based on previous sentences you’ve drafted, and it’s able to do this on the fly as you type. While natural language generation is best at short blurbs of text (partial sentences), soon such systems may be able to produce reasonably good long-form content. A popular commercial application of natural language generation is data-to-text software, which generates textual summaries of databases and datasets. Data-to-text software includes data analysis as well as text generation. Firms in this space include Narrative Science and Automated Insights.
Sentiment analysis: With the explosion of social media content, there is an ever-growing need to automate customer sentiment analysis, dissecting tweets, posts, and comments for sentiment such as positive versus negative versus neutral or angry versus sad versus happy. Such software is also known as emotion AI.
Information extraction: One major challenge in NLP is creating structured data from unstructured and/or semi-structured documents. For example, named entity recognition software is able to extract people, organizations, locations, dates, and currencies from long-form texts such as mainstream news. Information extraction also involves relationship extraction, identifying the relations between entities, if any.

The number of NLP applications in the enterprise has exploded over the past decade, ranging from speech recognition and question and answering to voicebots and chatbots that are able to generate natural language on their own. This is quite astounding given where the field was a few decades ago.

To put the current progress in NLP into perspective, let’s walk through how NLP has progressed, starting from its origins in 1950.

History

The field of natural language processing has been around for nearly 70 years. Perhaps most famously, Alan Turing laid the foundation for the field by developing the Turing test in 1950. The Turing test is a test of a machine’s ability to demonstrate intelligence that is indistinguishable from that of a human. For the machine to pass the Turing test, it must generate human-like responses such that a human evaluator would not be able to tell whether the responses were generated by a human or a machine (i.e., the machine’s responses are of human quality).³

The Turing test launched significant debate in the then-nascent artificial intelligence field and spurred researchers to develop natural langugage processing models that would serve as building blocks for a machine that someday may pass the Turing test, a search that continues to this day.

Like the broader field of artificial intelligence, NLP has had many booms and busts, lurching from hype cycles to AI winters. In 1954, Georgetown University and IBM successfully built a system that could automatically translate more than 60 Russian sentences to English. At the time, researchers at Georgetown University thought machine translation would be a solved problem within three to five years. The success in the US also spurred the Soviet Union to launch similar efforts. The Georgetown-IBM success coupled with the Cold War mentality led to increased funding for NLP in these early years.

However, by 1966, progress had stalled, and the Automatic Language Processing Advisory Committee (known as ALPAC)—a US government agency set up to evaluate the progress in computational linguistics—released a sobering report. The report stated that machine translation was more expensive, less accurate, and slower than human translation and unlikely to reach human-level performance in the near future. The report led to a reduction in funding for machine translation research. Following the report, research in the field nearly died for almost a decade.

Despite these setbacks, the field of NLP reemerged in the 1970s. By the 1980s, computational power had increased significantly and costs had come down sufficiently, opening up the field to many more researchers around the world.

In the late 1980s, NLP rose in prominence again with the release of the first statistical machine translation systems, led by researchers at IBM’s Thomas J. Watson Research Center. Prior to the rise of statistical machine translation, machine translation relied on human handcrafted rules for language. These systems were called rules-based machine translation. The rules would help correct and control mistakes that the machine translation systems would typically make, but crafting such rules was a laborious and painstaking process. The machine translation systems were also brittle as a result; if the machine translation systems encountered edge-case scenarios for which rules had not been developed, they would fail, sometimes egregiously.

Statistical machine translation helped reduce the need for human handcrafted rules, and it relied much more heavily on learning from data. Using a bilingual corpus with parallel texts as data (i.e., two texts that are identical except for the language they are written in), such systems would carve sentences into small subsets and translate the subsets segment-by-segment from the source language to the target language. The more data (i.e., bilingual text corpuses) the system had, the better the translation. Statistical machine translation would remain the most widely studied and used machine translation method until the rise of neural machine translation in the mid-2010s.

By the 1990s, such successes led researchers to expand beyond text into speech recognition. Speech recognition, like machine translation, had been around since the early 1950s, spurred by early successes by the likes of Bell Labs and IBM. But speech recognition systems had severe limitations. In the 1960s, for example, such systems could take voice commands for playing chess but not do much else.

By the mid-1980s, IBM applied a statistical approach to speech recognition and launched a voice-activated typewriter called Tangora, which could handle a 20,000-word vocabulary.

DARPA, Bell Labs, and Carnegie Mellon University also had similar successes by the late 1980s. Speech recognition software systems by then had larger vocabularies than the average human and could handle continuous speech recognition, a milestone in the history of speech recognition.

In the 1990s, several researchers in the space left research labs and universities to work in industry, which led to more commercial applications of speech recognition and machine translation.

Today’s NLP heavyweights, such as Google, hired their first speech recognition employees in 2007. The US government also got involved then; the National Security Agency began tagging large volumes of recorded conversations for specific keywords, facilitating the search process for NSA analysts.

By the early 2010s, NLP researchers, both in academia and industry, began experimenting with deep neural networks for NLP tasks. Early deep learning–led successes came from a deep learning method called long short-term memory (LSTM). In 2015, Google used such a method to revamp Google Voice.

Deep learning methods led to dramatic performance improvements in NLP tasks, spurring more dollars into the space. These successes have led to a much deeper integration of NLP software in our everyday lives.

For example, cars in the early 2010s had voice recognition software that could handle a limited set of voice commands. Cars now have tech that can handle a much broader set of natural language commands, inferring context and intent much more clearly.

Looking back today, progress in NLP was slow but steady, moving from rules-based systems in the early days to statistical machine translation by the 1980s and to neural network–based systems by the 2010s. While academic research in the space has been fierce for quite some time, NLP has become a mainstream topic only recently. Let’s examine the main inflection points over the past several years that have helped NLP become one of the hottest topics in AI today.

Inflection Points

NLP and computer vision are both subfields of artificial intelligence, but computer vision has had more commercial successes to date. Computer vision had its inflection point in 2012 (the so-called “ImageNet” moment) when the deep learning–based solution AlexNet decimated the previous error rate of computer vision models.

In the years since 2012, computer vision has powered applications such as auto-tagging of photos and videos, self-driving cars, cashier-less stores, facial recognition–powered authentication of devices, radiology diagnoses, and more.

NLP has been a relatively late bloomer by comparison. NLP made waves from 2014 onward with the release of Amazon Alexa, a revamped Apple Siri, Google Assistant, and Microsoft Cortana. Google also launched a much-improved version of Google Translate in 2016, and now chatbots and voicebots are much more commonplace.

That being said, it wasn’t until 2018 that NLP had its very own ImageNet moment with the release of large pretrained language models trained using the Transformer architecture; the most notable of these was Google’s BERT, which was launched in November 2018.

In 2019, generative models such as OpenAI’s GPT-2 made splashes, generating new content on the fly based on previous content, a previously insurmountable feat. In 2020, OpenAI released an even larger and more impressive version, GPT-3, building on its previous successes.

Heading into 2021 and beyond, NLP is now no longer an experimental subfield of AI. Along with computer vision, NLP is now poised to have many broad-based applications in the enterprise. With this book, we hope to share some concepts and tools that will help you build some of these applications at your company.

A Final Word

There is not one single approach to solving NLP tasks. The three dominant approaches today are rule-based, traditional machine learning (statistical-based), and neural network–based.

Let’s explore each approach:

Rule-based NLP: Traditional NLP software relies heavily on human-crafted rules of languages; domain experts, typically linguists, curate these rules using things like regular expressions and pattern matching. Rule-based NLP performs well in narrowly scoped-out use cases but typically does not generalize well. More and more rules are necessary to generalize such a system, and this makes rule-based NLP a labor-intensive and brittle solution compared to the other NLP approaches. Here are examples of rules in a rule-based system: words ending in -ing are verbs, words ending in -er or -est are adjectives, words ending in ’s are possessives, etc. Think of how many rules we would need to create by hand to make a system that could analyze and process a large volume of natural language data. Not only would the creation of rules be a mind-bogglingly difficult and tedious process, but we would also have to deal with the many errors that would occur from using such rules. We would have to create rules for rules to address all the corner cases for each and every rule.
Traditional (or classical) machine learning: Traditional machine learning relies less on rules and more on data. It uses a statistical approach, drawing probability distributions of words based on a large annotated corpus. Humans still play a meaningful role; domain experts need to perform feature engineering to improve the machine learning model’s performance. Features include capitalization, singular versus plural, surrounding words, etc. After creating these features, you would have to train a traditional ML model to perform NLP tasks; e.g., text classification. Since traditional ML uses a statistical approach to determine when to apply certain features or rules to process language, traditional ML-based NLP is easier to build and maintain than a rule-based system. It also generalizes better than rule-based NLP.
Neural networks: Neural networks address the shortcomings of traditional machine learning. Instead of requiring humans to perform feature engineering, neural networks will “learn” the important features via representation learning. To perform well, these neural networks just need copious amounts of data. The amount of data required for these neural nets to perform well is substantial, but, in today’s internet age, data is not too hard to acquire. You can think of neural networks as very powerful function approximators or “rule” creators; these rules and features are several degrees more nuanced and complex than the rules created by humans, allowing for more automated learning and more generalization of the system in processing natural language data.

Of these three, the neural network–based branch of NLP, fueled by the rise of very deep neural networks (i.e., deep learning), is the most powerful and the one that has led to many of the mainstream commercial applications of NLP in recent years.

In this book, we will focus mostly on neural network–based approaches to NLP, but we will also explore traditional machine learning approaches, too. The former has state-of-the-art performance in many NLP tasks, but traditional machine learning is still actively used in commercial applications.

We won’t focus much on rule-based NLP, but, since it has been around for decades, you will not have difficulty finding other resources on that topic. Rule-based NLP does have a room among the other two approaches, but usually only to deal with edge cases.

Basic NLP

Now that we’ve defined NLP, explored applications in vogue today, covered its history and inflection points, and clarified the different approaches to solve NLP tasks, let’s start our journey by performing the most basic tasks in NLP.

We will leverage one of the most popular open source libraries for use in commercial applications of NLP to perform these tasks: spacy.

Before we use spacy, let’s discuss these most basic NLP tasks. As we said in the chapter introduction, they are pretty elementary, akin to teaching a child the basics of language. But, these basic NLP tasks, once combined, help us accomplish more complex tasks, which ultimately power the major NLP applications today.

Machines, like us, must walk before they run.

Defining NLP Tasks

Earlier in the chapter, we explored several NLP applications in vogue today, including the following:

Machine translation
Speech recognition
Question answering
Text summarization
Chatbots
Text-to-speech and speech-to-text conversion
Voicebots
Text and audio generation
Sentiment analysis
Information extraction

For machines to perform these complex applications, they need to perform several smaller, more bite-sized NLP tasks. In other words, to build successful commercial NLP applications, we must master the NLP tasks that serve as building blocks for those applications.

It is important to note that modern neural network–based NLP models perform these “tasks” automatically through training the neural network; that is, the neural network learns on its own how to perform some of these tasks. We, the operators, do not need to perform these tasks explicitly.

These tasks are a bit outdated for this reason, but they are still relevant today both for building greater intuition around how machines learn to work with natural language and for working with non-neural network–based NLP models. Classical, non-neural network–based NLP is still commonplace in the enterprise even if it is out of favor in state-of-the-art research today. For these reasons, it is worthwhile to learn these tasks.

Without further ado, here are some of these NLP tasks:

Tokenization

Tokenization is the process of splitting text into minimal meaningful units such as words, punctuation marks, symbols, etc. For example, the sentence “We live in Paris” could be tokenized into four tokens: We, live, in, Paris. Tokenization is typically the first step of every NLP process. Tokenization is a necessary step because the machine needs to break down natural language data into the most basic elements (or tokens) so that it can analyze each element in context of the other elements. Otherwise, it would have to analyze a long piece of text or audio as if it were one singular element, making the problem intractable for the machine. Just like a beginner student of a language breaks down a sentence into smaller bits to learn and process the information word by word, a machine needs to do the same. Even with complex numerical calculations, machines break down the problem into basic elements, performing tasks such as addition, subtraction, multiplication, and division of two sets of numbers. The major advantage that the machine has is that it can do this at a pace and scale that no human can. After tokenization breaks down the text into minimal meaningful units, the machine needs to assign metadata to each unit, providing it more information on how to process each unit in the context of other units.

Part-of-speech tagging

Part-of-speech (POS) tagging is the process of assigning word types to tokens, such as noun, pronoun, verb, adverb, adjective, conjunction, preposition, interjection, etc. For “We live in Paris,” the parts of speech are: pronoun, verb, preposition, and noun. This part-of-speech tagging gives each token a bit more metadata, making it easier for the machine to assign relationships between each token and every other token. In the sentence, “I kick the ball,” “I” and “ball” are both nouns and “kick” is a verb. Using this metadata, we can infer that “kick” somehow connects “I” and the “ball,” allowing us to form a relationship among the words. This is why the parts of speech are so important. Without knowing that some words are nouns and other are verbs, etc., the machine would not be able to map the relationships among the tokens.

Dependency parsing

Dependency parsing involves labeling the relationships between individual tokens, assigning a syntactic structure to the sentence. Once the relationships are labeled, the entire sentence can be structured as a series of relationships among sets of tokens. It is easier for the machine to process text once it has identified the inherent structure among the text. Think how difficult it would be for you to understand a sentence if you had all the words in the sentence presented to you out of order and you had no prior knowledge of the rules of grammar. In much the same way, until the machine performs dependency parsing, it has little to no knowledge of the structure of the text that it has converted into tokens. Once the structure is apparent, processing the text becomes a little bit easier.

Dependency parsing can get tricky so the best way to understand it is to visualize the relationships using a parse tree. AllenNLP has a great dependency parsing demo, which we used to generate the dependency graph in Figure 1-1. This dependency graph allows us to visualize the relationships among the tokens. As you can see from the figure, “We” is the personal pronoun (PRP) and the nominal subject (NSUBJ) of “live,” which is the non-third person singular present verb (VBP). “Live” is connected to the prepositional phrase (PREP) “in Paris.” “In” is the preposition (IN), and “Paris” is the object of the preposition (POBJ) and is itself a singular proper noun (NNP). These relationships are very complex to model, and one reason why it is very difficult to be truly fluent in any language. Most of us apply the rules of grammar on the fly, having learned language through years of experience. A machine does the same type of analysis, but to perform natural language processing it has to crunch these operations one after the other at blazingly fast speeds.

Chunking

Chunking involves combining related tokens into a single token, creating related noun groups, related verb groups, etc. For example, “New York City” could be treated as a single token/chunk instead of as three separate tokens. Chunking is the process that makes this possible. Chunking is important to perform once the machine has broken the original text into tokens, identified the parts of speech, and tagged how each token is related to other tokens in the text. Chunking combines similar tokens together, making the overall process of analyzing the text a bit easier to perform. For example, instead of treating “New,” “York,” and “City” as three separate tokens, we can infer that they are related and group them together into a single group (or chunk). Then, we can relate the chunk to other chunks in the text. Once we’ve done this for the entire set of tokens, we will have a much smaller set of tokens and chunks to work with.

Lemmatization

Lemmatization is the process of converting words into their base forms. For example, lemmatization converts “horses” to “horse,” “slept” to “sleep,” and “biggest” to “big.” It allows the machine to simplify the text processing work it has to perform. Instead of working with a variant of the base word, it can work directly with the base word after it has performed lemmatization.

Stemming

Stemming is a process related to lemmatization, but simpler. Stemming reduces words to their word stems. Stemming algorithms are typically rule-based. For example, the word “biggest” would be reduced to “big,” but the word “slept” would not be reduced at all. Stemming sometimes results in nonsensical subwords, and we prefer lemmatization to stemming for this reason. Lemmatization returns a word to its base or canonical form, per the dictionary. But, it is a more expensive process compared to stemming, because it requires knowing the part of speech of the word to perform well.

Note

Tokenization, part-of-speech tagging, dependency parsing, chunking, and lemmatization and stemming are tasks to process natural language for downstream NLP applications; in other words, these tasks are means to an end. Technically, the next two “tasks”—named entity recognition and entity linking—are not natural language tasks but rather are closer to NLP applications. Named entity recognition and entity linking can be ends themselves, rather than just means to an end. But, since they are also used for downstream NLP applications, we will include them in the “tasks” section here.

Named entity recognition: Named entity recognition (NER), is the process of assigning labels to known objects (or entities) such as person, organization, location, date, currency, etc. In “We live in Paris,” “Paris” would be marked as the location. NER is very powerful. It allows machines to tag the most important tokens with named entity tags, and this is very important for informational retrieval applications of NLP. For example, if we want to search for former US President George W. Bush in a set of documents, we would want the machine to tag all persons in all the documents using named entity recognition, and then we would search within this list of persons to find the relevant set of documents for us to investigate further.
Entity linking: Entity linking is the process of disambiguating entities to an external database, linking text in one form to another. This is important both for entity resolution applications (e.g., deduping datasets) and information retrieval applications. In the George W. Bush example, we would want to resolve all instances of “George W. Bush” to “George W. Bush,” but not to “George H. W. Bush,” George W. Bush’s father and also a former US President. This resolution and linking to the correct version of President Bush is a tricky, thorny process, but one that a machine is capable of performing given all the textual context it has. Once a machine has performed entity recognition and linking, information retrieval becomes a cinch, which is one of the most commercially relevant applications of NLP today.

This is just a quick-and-dirty overview of the most basic NLP tasks. You will want to research these tasks further; there are ample resources available online. But, for now, this is plenty of information for us to get started.

Now that you know the basic NLP tasks that serve as building blocks for more ambitious NLP applications, let’s use the open source NLP library spacy to perform some of these basic NLP tasks.

Set Up the Programming Environment

To perform the basic NLP tasks, we first will need to set up our programming environment.

In this book, we will use one of these easiest to use programming environments available to data scientists today: Google’s Colaboratory. Google Colab is a free Jupyter Notebook environment that runs entirely in the cloud. In Chapter 2, we will discuss Google Colab and alternative programming environments in more detail.

We will use GitHub as our coding repository.⁴

If you prefer to run the code locally on your machine, we have instructions for setting up your local environment on our GitHub repo.

With that, let’s get started with coding the basic NLP tasks.

spaCy, fast.ai, and Hugging Face

In this book, we will use open source software libraries offered by three major companies: spacy, fast.ai, and Hugging Face—to perform NLP. These libraries are high-level, abstracting away a lot of the low-level work that we would otherwise have to do. Think of these libraries as beautiful wrappers for us to quickly apply NLP. All three libraries are performant and commercially viable, and you can pick any of the three to do your own applied work; you do not have to choose all three. That being said, it is wise to be well-versed in all three because they do have their respective strengths and weaknesses, and sometimes one will be quicker at adopting the latest advances in NLP than the others. Let us quickly introduce each of the three before we move forward with spacy in this chapter. In Chapter 2, we will work with fast.ai and Hugging Face.

spaCy

First released in 2015, spacy is an open source library for NLP with blazing fast performance, leveraging both Python and Cython. Prior to spacy, the Natural Language Toolkit (NLTK) was the leading NLP library among researchers, but NLTK was dated (it was initially released in 2001) and scaled poorly. spacy was the first modern NLP library intended for commercial audiences; it was built with scaling in production in mind. Now one of the go-to libraries for NLP applications in the enterprise, it supports more than 64 languages and both TensorFlow and PyTorch.

Prior to 2021, spacy 2.x relied on recurrent neural networks (RNNs), which we will cover later in the book, rather than the industry-leading transformer-based models. But, as of January 2021, spacy now supports state-of-the-art transformer-based pipelines, too, solidifying its positioning among the major NLP libraries in use today.

spacy’s creator and parent company, Explosion AI, also offers an excellent annotation platform called Prodigy, which we will use in Chapter 3. Among the three libraries, spacy is the most mature and most extensible given all the integrations its creators have created and supported over the past six-plus years. It is the one best suited for production usage today.

fast.ai

fast.ai (the company) released its open source library fastai in 2018, built on top of PyTorch. fast.ai, the company, built its reputation by offering massive open online courses (MOOCs) to coders that want a more practical introduction to machine learning, and the fastai library reflects this ethos. It has high-level components that allow coders to quickly and easily produce state-of-the-art results. At the same time, fastai has low-level components for researchers to mix and match to solve custom problems. The creators of fastai also created ULMFiT, one of the first transfer learning methods in NLP, which we will use in Chapter 2. For those who would like course work and videos alongside a fast and easy-to-use library, fastai is a great option. However, it is less mature and less suited to production work than both spacy and Hugging Face.

Hugging Face

Founded in 2016, Hugging Face is the newest comer on the block but likely the best funded and the fastest-growing of the three today; the company just raised a $40 million Series B in March 2021. Hugging Face focuses exclusively on NLP and is built to help practitioners build NLP applications using state-of-the-art transformers. Its library, called transformers, is built for PyTorch and TensorFlow and supports over 100 languages. In fact, it is possible to move from PyTorch and TensorFlow for development and deployment pretty seamlessly. Hugging Face also has a pipeline API for productionizing NLP models. We are most excited for the future of Hugging Face among the three libraries and highly recommend you spend sufficient time familiarizing yourself with it.

Perform NLP Tasks Using spaCy

Let’s now use spacy for our NLP tasks.

First, we’ll install spacy. For more on installation, visit the official spaCy website. If you haven’t installed spacy already, these commands will give you everything you need (if you’re running them in a notebook, prefix each line with a ! character):

pip install -U spacy[cuda110,transformers,lookups]==3.0.3
pip install -U spacy-lookups-data==1.0.0
pip install cupy-cuda110==8.5.0
python -m spacy download en_core_web_trf

Download pretrained language models

spacy has pretrained language models for out-of-the-box use. Pretrained models are models that have been trained on lots of data already and are ready for us to perform inference with.

These pretrained language models will help us solve the basic NLP tasks, but more advanced users are welcome to fine-tune them on more specific data of your choosing. This will deliver even better performance for your specific tasks at hand.

Fine-tuning is the process of taking a pretrained model and training it some more (i.e., fine-tuning the model) on a more specific corpus of text that is relevant to the domain of the user.⁵ For example, if we worked in finance, we may decide to fine-tune a generic pretrained language model on financial documents to generate a finance-specific language model. This finance-specific language model would have even better performance on finance-related NLP tasks versus the generic pretrained language model.

spacy breaks out its pretrained language models into two groups: core models and starter models. The core models are general-purpose models and will help us solve the basic NLP tasks. The starter models are base models useful for transfer learning; these models have pretrained weights, which you could use to initialize and fine-tune for your own models. Think of the core models as ready-to-go models and the base models as do-it-yourself starter kits.

We will use the ready-to-go core models to perform the basic NLP tasks. Let’s first import the core model:⁶

# Import spacy and download language model
import spacy
nlp = spacy.load("en_core_web_trf")

Now, let’s perform the first of the NLP tasks: tokenization.

Tokenization

Tokenization is where all NLP work begins; before the machine can process any of the text it sees, it must break the text into bite-sized tokens. Tokenization will segment text into words, punctuation marks, etc.

spacy automatically runs the entire NLP pipeline when you run a language model on the data (i.e., nlp(SENTENCE)), but to isolate just the tokenizer, we will invoke just the tokenizer using nlp.tokenizer(SENTENCE).

Then, we will print the length of the tokens and the individual tokens:

# Tokenization
sentence = nlp.tokenizer("We live in Paris.")

# Length of sentence
print("The number of tokens: ", len(sentence))

# Print individual words (i.e., tokens)
print("The tokens: ")
for words in sentence:
    print(words)

The number of tokens:  5
The tokens:
We
live
in
Paris
.

The length of tokens is 5, and the individual tokens are "We, live, in, Paris, .". The period at the end of the sentence is its own token.

Note that the spacy tokenizer will treat new lines (\n), tabs (\t), and whitespace characters beyond a single space (") as tokens.

Let’s try the tokenizer on a slightly more complex example.

We will load in publicly available Jeopardy questions and then run the entire spacy language model on a few of the questions:

import pandas as pd
import os
cwd = os.getcwd()

# Import Jeopardy Questions
data = pd.read_csv(cwd+'/data/jeopardy_questions/jeopardy_questions.csv')
data = pd.DataFrame(data=data)

# Lowercase, strip whitespace, and view column names
data.columns = map(lambda x: x.lower().strip(), data.columns)

# Reduce size of data
data = data[0:1000]

# Tokenize Jeopardy Questions
data["question_tokens"] = data["question"].apply(lambda x: nlp(x))

We have now created tokens for each of the 1,000 Jeopardy questions.

To make sure this worked right, let’s view the first question and the tokens created:

# View first question
example_question = data.question[0]
example_question_tokens = data.question_tokens[0]
print("The first questions is:")
print(example_question)

The first questions is:
For the last 8 years of his life, Galileo was under house arrest for espousing
 > this man's theory

# Print individual tokens of first question
print("The tokens from the first question are:")
for tokens in example_question_tokens:
    print(tokens)

The tokens from the first question are:
For
the
last
8
years
of
his
life
,
Galileo
was
under
house
arrest
for
espousing
this
man
's
theory

This is the first basic NLP task that machines perform; now we can move on to the other NLP tasks. Well done!

Part-of-speech tagging

After tokenization, machines need to tag each token with relevant metadata, such as the part-of-speech of each token. This is what we will perform now.

Since we applied the entire spacy language model to the Jeopardy questions, the tokens generated already have a lot of the meaningful attributes/metadata we care about.

spacy uses preloaded statistical models to predict the part of speech of each token. We loaded the English language statistical model earlier using this code: spacy.load("en_core_web_sm").

Let’s take a look at the POS tagging attributes for the tokens in the first question:

# Print Part-of-speech tags for tokens in the first question
print("Here are the Part-of-speech tags for each token in the first question:")
for token in example_question_tokens:
    print(token.text,token.pos_, spacy.explain(token.pos_))

Here are the Part-of-speech tags for each token in the first question:
For ADP adposition
the DET determiner
last ADJ adjective
8 NUM numeral
years NOUN noun
of ADP adposition
his PRON pronoun
life NOUN noun
, PUNCT punctuation
Galileo PROPN proper noun
was AUX auxiliary
under ADP adposition
house NOUN noun
arrest NOUN noun
for ADP adposition
espousing VERB verb
this DET determiner
man NOUN noun
's PART particle
theory NOUN noun

The first token “For” is marked as an adposition (e.g., in, to, during); the second token “the” is a determiner (e.g., a, an, the); the third token “last” is an adjective, the fourth token “8” is a numeral; the fifth token “years” is a noun; and so on.

Table 1-1 displays the full list of all possible POS tags, including descriptions and examples of each.⁷

Table 1-1. Universal part-of-speech tags
POS	Description	Example
`ADJ`	Adjective	Big, old, green, incomprehensible, first
`ADP`	Adposition	In, to, during
`ADV`	Adverb	Very, tomorrow, down, where, there
`AUX`	Auxiliary	Is, has (done), will (do), should (do)
`CONJ`	Conjunction	And, or, but
`CCONJ`	Coordinating conjunction	And, or, but
`DET`	Determiner	A, an, the
`INTJ`	Interjection	Psst, ouch, bravo, hello
`NOUN`	Noun	Girl, cat, tree, air, beauty
`NUM`	Numeral	1, 2017, one, seventy-seven, IV, MMXIV
`PART`	Particle	’s, not
`PRON`	Pronoun	I, you, he, she, myself, themselves, somebody
`PROPN`	Proper noun	Mary, John, London, NATO, HBO
`PUNCT`	Punctuation	., (, ), ?
`SCONJ`	Subordinating conjunction	If, while, that
`SYM`	Symbol	×, %, §, ©, +, -, ×, ÷, =, :),
`VERB`	Verb	Run, runs, running, eat, ate, eating
`X`	Other	Sfpksdpsxmsa
`SPACE`	Space

Now that we have used the tokenizer to create tokens for each sentence and part-of-speech tagging to tag each token with meaningful attributes, let’s label each token’s relationship with other tokens in the sentence. In other words, let’s find the inherent structure among the tokens given the part-of-speech metadata we have generated.

Dependency parsing

Dependency parsing is the process of finding these relationships among the tokens. Once we have performed this step, we will be able to visualize the relationships using a dependency parsing graph.

First, let’s view the dependency parsing tags for each of the tokens in the first question:

# Print Dependency Parsing tags for tokens in the first question
for token in example_question_tokens:
    print(token.text,token.dep_, spacy.explain(token.dep_))

For prep prepositional modifier
the det determiner
last amod adjectival modifier
8 nummod numeric modifier
years pobj object of preposition
of prep prepositional modifier
his poss possession modifier
life pobj object of preposition
, punct punctuation
Galileo nsubj nominal subject
was ROOT None
under prep prepositional modifier
house compound compound
arrest pobj object of preposition
for prep prepositional modifier
espousing pcomp complement of preposition
this det determiner
man poss possession modifier
's case case marking
theory dobj direct object

The first token “For” is marked as a prepositional modifier; the second token “the” is a determiner; the third token “last” is an adjectival modifier; the fourth token “8” is a numeric modifier; the fifth token “years” is the object of preposition; and so on.

Table 1-2 lists all the possible syntactic dependency tags, including descriptions and examples of each.⁸

Table 1-2. Universal dependency labels
Label	Description
`ac1`	Clausal modifier of noun (adjectival clause)
`advc1`	Adverbial clause modifier
`advmod`	Adverbial modifier
`amod`	Adjectival modifier
`appos`	Appositional modifier
`aux`	Auxiliary
`case`	Case marking
`cc`	Coordinating conjunction
`ccomp`	Clausal complement
`clf`	Classifier
`compound`	Compound
`conj`	Conjunction
`cop`	Copula
`csubj`	Clausal subject
`dep`	Unspecified dependency
`det`	Determiner
`discourse`	Discourse element
`dislocated`	Dislocated element
`expl`	Expletive
`fixed`	Fixed multiword expression
`flat`	Flat multiword expression
`goeswith`	Goes with
`iobj`	Indirect object
`list`	List
`mark`	Marker
`nmod`	Nominal modifier
`nsubj`	Nominal subject
`nummod`	Numeric modifier
`obj`	Object
`obl`	Oblique nominal
`orphan`	Orphan
`parataxis`	Parataxis
`punct`	Punctuation
`reparandum`	Overridden disfluency
`root`	Root
`vocative`	Vocative
`xcomp`	Open clausal complement

These tags help define the relationships among the tokens; using these tags, we can understand the relationship structure among the tokens that make up the sentence.

Dependency parsing is hard to unpack, so let’s use spacy’s built-in visualizer to get a better sense of the dependencies across the tokens:

# Visualize the dependency parse
from spacy import displacy

displacy.render(example_question_tokens, style='dep',
                jupyter=True, options={'distance': 120})

Figure 1-2 displays the first part of the sentence parsed.

Notice the importance of “For” and “years” in the prepositional phrase—multiple tokens map to these two.

Figure 1-3 displays the second part of the sentence parsed.

The token “was” connects to the nominal subject “Galileo” and two prepositional phrases: “under house arrest” and “for espousing this man’s theory.”

These figures show how certain tokens can be grouped together and how the groups of tokens are related to one another. This is an essential step in NLP. First, the machine breaks the sentence apart into tokens. Then it assigns metadata to each token (e.g., part of speech), and then it connects the tokens based on their relationship to one another.

Let’s move on to chunking, which is another form of grouping of related tokens.

Chunking

Let’s perform chunking on the sentence “My parents live in New York City”:

# Print tokens for example sentence without chunking
for token in nlp("My parents live in New York City."):
    print(token.text)

My
parents
live
in
New
York
City
.

Chunking combines related tokens into a single token.

With chunking, the spacy language model will identify “My parents” and “New York City” as noun chunks, much like a human would when parsing a sentence:

# Print chunks for example sentence
for chunk in nlp("My parents live in New York City.").noun_chunks:
      print(chunk.text)

My parents
New York City

By grouping related tokens into chunks, the machine will have an easier time processing the sentence. Instead of viewing each token in isolation, the machine now recognizes that certain tokens are related to others, a necessary step in NLP.

Lemmatization

Now, let’s go a step further and perform lemmatization. If you recall, lemmatization is the process of converting words into their base (or canonical) forms; for example, “horses” to “horse,” “slept” to “sleep,” and “biggest” to “big.” Just like part-of-speech tagging, dependency parsing, and chunking, lemmatization helps the machine “process” the tokens. With lemmatization, the machine is able to simplify the tokens by converting some of them into their most basic forms.

Stemming is a related concept, but stemming is simpler. Stemming reduces words to their word stems, often using a rule-based approach.

Lemmatization is a more difficult process but generally results in better outputs; stemming sometimes creates outputs that are nonsensical (nonwords). In fact, spacy does not even support stemming; it supports only lemmatization.

We will create a DataFrame to store and view the original and lemmatized versions of tokens side-by-side:

# Print lemmatization for tokens in the first question
lemmatization = pd.DataFrame(data=[], \
  columns=["original","lemmatized"])
i = 0
for token in example_question_tokens:
    lemmatization.loc[i,"original"] = token.text
    lemmatization.loc[i,"lemmatized"] = token.lemma_
    i = i+1

lemmatization

	Original	Lemmatized
0	For	for
1	the	the
2	last	last
3	8	8
4	years	year
5	of	of
6	his	his
7	life	life
8	,	,
9	Galileo	Galileo
10	was	be
11	under	under
12	house	house
13	arrest	arrest
14	for	for
15	espousing	espouse
16	this	this
17	man	man
18	’s	’s
19	theory	theory

As you can see, words such as “years,” “was,” and “espousing” are lemmatized to their base forms. The other tokens are already their base forms, so the lemmatized output is the same as the original. Lemmatization simplifies tokens into their simplest forms, where possible, to simplify the process for the machine to parse sentences.

Named entity recognition

When combined together, everything we’ve done so far—tokenization, part-of-speech tagging, dependency parsing, chunking, and lemmatization—makes it possible for machines to perform more complex NLP tasks. One example of a complex NLP task is named entity recognition (also known as “NER”), which parses notable entities in natural language and labels them with their appropriate class label. For example, NER labels names of people with the label “Person” and names of cities with the label “Location.”

NER is possible only because the machine is able to perform text classification using the metadata generated by the earlier NLP tasks we’ve covered. Without the metadata from the earlier NLP tasks, the machine would have a very difficult time performing NER because it would not have enough features to classify names of people as “Person,” names of cities as “Location,” etc.

NER is a valuable NLP task because many organizations need to process lots and lots of documents in volume, and the simple act of labeling notable entities with the appropriate class label is a meaningful first step in analyzing the textual information, particularly for information retrieval tasks (e.g., finding information that you need as quickly as possible).

These documents include contracts, leases, real estate purchase agreements, financial reports, news articles, etc. Before named entity recognition, humans would have had to label such entities by hand (at many companies, they still do). Now, named entity recognition provides an algorithmic way to perform this task.

spacy’s NER model is able to label many types of notable entities (“real-world objects”). Table 1-3 displays the current set of entity types the spacy model is able to recognize.

Table 1-3. spaCy NER entity types
Type	Description
`PERSON`	People, including fictional
`NORP`	Nationalities or religious or political groups
`FAC`	Buildings, airports, highways, bridges, etc.
`ORG`	Companies, agencies, institutions, etc.
`GPE`	Countries, cities, states
`LOC`	Non-GPE locations, mountain ranges, bodies of water
`PRODUCT`	Objects, vehicles, foods, etc. (not services)
`EVENT`	Named hurricanes, battles, wars, sports events, etc.
`WORK_OF_ART`	Titles of books, songs, etc.
`LAW`	Named documents made into laws
`LANGUAGE`	Any named language
`DATE`	Absolute or relative dates or periods
`TIME`	Times smaller than a day
`PERCENT`	Percentage, including %
`MONEY`	Monetary values, including unit
`QUANTITY`	Measurements, as of weight or distance
`ORDINAL`	“First,” “second,” etc.
`CARDINAL`	Numerals that do not fall under another type

It’s very important to note that NER is, at its very core, a classification model. Using the context around the token of interest, the NER model predicts the entity type of the token of interest. NER is a statistical model, and the corpus of data the model has trained on matters a lot. For better performance, developers of these models in the enterprise will fine-tune the base NER models on their particular corpus of documents to achieve better performance versus the base NER model.

Let’s try the spacy NER model. We will perform NER on the first sentence of the Wikipedia article (accessed March 2021) describing George Washington, the first president of the United States. Here’s the sentence:

George Washington was an American political leader, military general, statesman, and Founding Father who served as the first president of the United States from 1789 to 1797.

As you can see, there are several real-world objects to recognize here, including “George Washington” and “the United States”:

# Print NER results
example_sentence = "George Washington was an American political leader, \
military general, statesman, and Founding Father who served as the \
first president of the United States from 1789 to 1797.\n"

print(example_sentence)

print("Text Start End Label")
doc = nlp(example_sentence)
for token in doc.ents:
    print(token.text, token.start_char, token.end_char, token.label_)

George Washington was an American political leader, military general, statesman,
 > and Founding Father who served as the first president of the United States
 > from 1789 to 1797.

Text Start End Label
George Washington 0 17 PERSON
American 25 33 NORP
first 119 124 ORDINAL
the United States 138 155 GPE
1789 to 1797 161 173 DATE

There are four elements to the output. First, the text that comprises the entity; note that the text could be a single token or a set of tokens that makes up the entire entity. Second, the start position of the text in the sentence. Third, the end position of the text in the sentence. Fourth, the label of the entity.

To make the value of NER even more apparent, let’s use spacy’s built-in visualizer to visualize this sentence with the relevant entity labels:

# Visualize NER results
displacy.render(doc, style='ent', jupyter=True, options={'distance': 120})

As you can see in Figure 1-4, the spacy NER model does a great job labeling the entities. “George Washington” is a person, and the text starts at index 0 and ends at index 17. His nationality is “American.” “First” is labeled as an ordinal number, “the United States” is a geopolitical entity, and “1789 to 1797” is a date.

The sentence is beautifully rendered with color-coded labels based on the entity type. This is a powerful and meaningful NLP task; you can see how doing this machine-driven labeling at scale without humans could add a lot of value to enterprises that work with a lot of textual data. Of course, to train such a model in the first place, you do need to have a lot of humans that annotate textual data. And you may need humans in the loop to deal with edge cases in production. You are never really human-free, but perhaps you could ultimately get to a mostly human-free process.

Named entity linking

Another complex yet very useful NLP task in the enterprise is named entity linking (NEL). NEL resolves a textual entity to a unique identifier in a knowledge base. In other words, NEL resolves the entity in your source text to a canonical version in a knowledge database. Let’s try to link all entities that are named persons to Google’s Knowledge Graph. We will make a Google Knowledge Graph API call to perform this named entity linking.⁹

Here is the function to perform this API call:

# Import libraries
import requests

# Define Google Knowledge Graph API Result function
def returnGraphResult(query, key, entityType):
    if entityType=="PERSON":
        google = f"https://kgsearch.googleapis.com/v1/entities:search\
         ?query={query}&key={key}"
        resp = requests.get(google)
        url = resp.json()['itemListElement'][0]['result']\
         ['detailedDescription']['url']
        description = resp.json()['itemListElement'][0]['result']\
         ['detailedDescription']['articleBody']
        return url, description
    else:
        return "no_match", "no_match"

Let’s perform entity linking on our George Washington example:

# Print Wikipedia descriptions and URLs for entities
for token in doc.ents:
    url, description = returnGraphResult(token.text, key, token.label_)
    print(token.text, token.label_, url, description)

Here is the output:

George Washington: PERSON https://en.wikipedia.org/wiki/George_Washington George Washington was an American political leader, military general, statesman, and Founding Father, who also served as the first President of the United States from 1789 to 1797.
American: NORP no_match no_match
first: ORDINAL no_match no_match
the United States: GPE no_match no_match
1789 to 1797: DATE no_match no_match

As you can see, George Washington is a PERSON and is linked successfully to the “George Washington” Wikipedia URL and description. The rest are not of entity type PERSON and are not linked. If desired, we could link the other named entities, such as the United States, to relevant Wikipedia articles, too.

NEL has many use cases in the enterprise, especially since the need to link information to a taxonomy comes up over and over again (e.g., linking stock tickers, pharmaceutical drugs, publicly traded companies, consumer products, etc., to canonical versions in a taxonomy or knowledge base).

Conclusion

In this chapter, we defined NLP and covered its origins, including some of the commercial applications that are popular in the enterprise today. Then, we defined some basic NLP tasks and performed them using the very performant NLP library known as spacy. You should spend more time using spacy, including reviewing documentation that is available online, to hone what you have learned in this chapter.

While the tasks we performed are very basic, when combined, NLP tasks such as tokenization, part-of-speech tagging, dependency parsing, chunking, and lemmatization make it possible for machines to perform even more complex NLP tasks such as NER and entity linking. We hope our walkthrough of these tasks helped you build some intuition on just how machines are able to unpack and process natural language, demystifying some of the space.

Today, most complex NLP applications do not require practitioners to perform these tasks manually; rather, neural networks learn to perform these tasks on their own. In the next chapter, we will dive into some of the state-of-the-art approaches using the Transformer architecture and large, pretrained language models from fast.ai and Hugging Face to show just how easy it is to get up and running with NLP today. Later in the book, we will return to the basics (which we just teased you with briefly in this chapter) and help you build more of your foundational knowledge of NLP.

¹ One of the major leaps in human history was the formation of a human (aka “natural”) language, which allowed humans to communicate with one another, form groups, and operate as collective units of people instead of as solo individuals.

² For more, read The New York Times Magazine article from 2016 on Google’s neural machine translation.

³ For more, refer to the Wikipedia article about the Turing test.

⁴ For more on GitHub, visit the GitHub website and Google Colab’s instructions on integrating with GitHub.

⁵ The operation of taking a model developed for one task and using it as a starting point for a model on a second task is known as transfer learning.

⁶ A spacy language model is not the same thing as what we generally refer to in the NLP literature as a language model. For more information on language modeling, see Chapter 2.

⁷ Visit the spacy POS documentation for more.

⁸ Visit the spacy documentation for more.

⁹ You’ll need your own Google Knowledge Graph API key to perform this API call on your machine. We will perform this using our own API key for illustrative purposes.

Get Applied Natural Language Processing in the Enterprise now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Applied Natural Language Processing in the Enterprise by Ankur A. Patel, Ajay Uppili Arasanipalai