Chapter 1. The Basics

It seems as though every day there are new and exciting problems that people have taught computers to solve, from how to win at chess or Jeopardy to determining shortest-path driving directions. But there are still many tasks that computers cannot perform, particularly in the realm of understanding human language. Statistical methods have proven to be an effective way to approach these problems, but machine learning (ML) techniques often work better when the algorithms are provided with pointers to what is relevant about a dataset, rather than just massive amounts of data. When discussing natural language, these pointers often come in the form of annotations—metadata that provides additional information about the text. However, in order to teach a computer effectively, it’s important to give it the right data, and for it to have enough data to learn from. The purpose of this book is to provide you with the tools to create good data for your own ML task. In this chapter we will cover:

  • Why annotation is an important tool for linguists and computer scientists alike

  • How corpus linguistics became the field that it is today

  • The different areas of linguistics and how they relate to annotation and ML tasks

  • What a corpus is, and what makes a corpus balanced

  • How some classic ML problems are represented with annotations

  • The basics of the annotation development cycle

The Importance of Language Annotation

Everyone knows that the Internet is an amazing resource for all sorts of information that can teach you just about anything: juggling, programming, playing an instrument, and so on. However, there is another layer of information that the Internet contains, and that is how all those lessons (and blogs, forums, tweets, etc.) are being communicated. The Web contains information in all forms of media—including texts, images, movies, and sounds—and language is the communication medium that allows people to understand the content, and to link the content to other media. However, while computers are excellent at delivering this information to interested users, they are much less adept at understanding language itself.

Theoretical and computational linguistics are focused on unraveling the deeper nature of language and capturing the computational properties of linguistic structures. Human language technologies (HLTs) attempt to adopt these insights and algorithms and turn them into functioning, high-performance programs that can impact the ways we interact with computers using language. With more and more people using the Internet every day, the amount of linguistic data available to researchers has increased significantly, allowing linguistic modeling problems to be viewed as ML tasks, rather than limited to the relatively small amounts of data that humans are able to process on their own.

However, it is not enough to simply provide a computer with a large amount of data and expect it to learn to speak—the data has to be prepared in such a way that the computer can more easily find patterns and inferences. This is usually done by adding relevant metadata to a dataset. Any metadata tag used to mark up elements of the dataset is called an annotation over the input. However, in order for the algorithms to learn efficiently and effectively, the annotation done on the data must be accurate, and relevant to the task the machine is being asked to perform. For this reason, the discipline of language annotation is a critical link in developing intelligent human language technologies.


Giving an ML algorithm too much information can slow it down and lead to inaccurate results, or result in the algorithm being so molded to the training data that it becomes “overfit” and provides less accurate results than it might otherwise on new data. It’s important to think carefully about what you are trying to accomplish, and what information is most relevant to that goal. Later in the book we will give examples of how to find that information, and how to determine how well your algorithm is performing at the task you’ve set for it.

Datasets of natural language are referred to as corpora, and a single set of data annotated with the same specification is called an annotated corpus. Annotated corpora can be used to train ML algorithms. In this chapter we will define what a corpus is, explain what is meant by an annotation, and describe the methodology used for enriching a linguistic data collection with annotations for machine learning.

The Layers of Linguistic Description

While it is not necessary to have formal linguistic training in order to create an annotated corpus, we will be drawing on examples of many different types of annotation tasks, and you will find this book more helpful if you have a basic understanding of the different aspects of language that are studied and used for annotations. Grammar is the name typically given to the mechanisms responsible for creating well-formed structures in language. Most linguists view grammar as itself consisting of distinct modules or systems, either by cognitive design or for descriptive convenience. These areas usually include syntax, semantics, morphology, phonology (and phonetics), and the lexicon. Areas beyond grammar that relate to how language is embedded in human activity include discourse, pragmatics, and text theory. The following list provides more detailed descriptions of these areas:


The study of how words are combined to form sentences. This includes examining parts of speech and how they combine to make larger constructions.


The study of meaning in language. Semantics examines the relations between words and what they are being used to represent.


The study of units of meaning in a language. A morpheme is the smallest unit of language that has meaning or function, a definition that includes words, prefixes, affixes, and other word structures that impart meaning.


The study of the sound patterns of a particular language. Aspects of study include determining which phones are significant and have meaning (i.e., the phonemes); how syllables are structured and combined; and what features are needed to describe the discrete units (segments) in the language, and how they are interpreted.


The study of the sounds of human speech, and how they are made and perceived. A phoneme is the term for an individual sound, and is essentially the smallest unit of human speech.


The study of the words and phrases used in a language, that is, a language’s vocabulary.

Discourse analysis

The study of exchanges of information, usually in the form of conversations, and particularly the flow of information across sentence boundaries.


The study of how the context of text affects the meaning of an expression, and what information is necessary to infer a hidden or presupposed meaning.

Text structure analysis

The study of how narratives and other textual styles are constructed to make larger textual compositions.

Throughout this book we will present examples of annotation projects that make use of various combinations of the different concepts outlined in the preceding list.

What Is Natural Language Processing?

Natural Language Processing (NLP) is a field of computer science and engineering that has developed from the study of language and computational linguistics within the field of Artificial Intelligence. The goals of NLP are to design and build applications that facilitate human interaction with machines and other devices through the use of natural language. Some of the major areas of NLP include:

Question Answering Systems (QAS)

Imagine being able to actually ask your computer or your phone what time your favorite restaurant in New York stops serving dinner on Friday nights. Rather than typing in the (still) clumsy set of keywords into a search browser window, you could simply ask in plain, natural language—your own, whether it’s English, Mandarin, or Spanish. (While systems such as Siri for the iPhone are a good start to this process, it’s clear that Siri doesn’t fully understand all of natural language, just a subset of key phrases.)


This area includes applications that can take a collection of documents or emails and produce a coherent summary of their content. Such programs also aim to provide snap “elevator summaries” of longer documents, and possibly even turn them into slide presentations.

Machine Translation

The holy grail of NLP applications, this was the first major area of research and engineering in the field. Programs such as Google Translate are getting better and better, but the real killer app will be the BabelFish that translates in real time when you’re looking for the right train to catch in Beijing.

Speech Recognition

This is one of the most difficult problems in NLP. There has been great progress in building models that can be used on your phone or computer to recognize spoken language utterances that are questions and commands. Unfortunately, while these Automatic Speech Recognition (ASR) systems are ubiquitous, they work best in narrowly defined domains and don’t allow the speaker to stray from the expected scripted input (“Please say or type your card number now”).

Document classification

This is one of the most successful areas of NLP, wherein the task is to identify in which category (or bin) a document should be placed. This has proved to be enormously useful for applications such as spam filtering, news article classification, and movie reviews, among others. One reason this has had such a big impact is the relative simplicity of the learning models needed for training the algorithms that do the classification.

As we mentioned in the Preface, the Natural Language Toolkit (NLTK), described in the O’Reilly book Natural Language Processing with Python, is a wonderful introduction to the techniques necessary to build many of the applications described in the preceding list. One of the goals of this book is to give you the knowledge to build specialized language corpora (i.e., training and test datasets) that are necessary for developing such applications.

A Brief History of Corpus Linguistics

In the mid-20th century, linguistics was practiced primarily as a descriptive field, used to study structural properties within a language and typological variations between languages. This work resulted in fairly sophisticated models of the different informational components comprising linguistic utterances. As in the other social sciences, the collection and analysis of data was also being subjected to quantitative techniques from statistics. In the 1940s, linguists such as Bloomfield were starting to think that language could be explained in probabilistic and behaviorist terms. Empirical and statistical methods became popular in the 1950s, and Shannon’s information-theoretic view to language analysis appeared to provide a solid quantitative approach for modeling qualitative descriptions of linguistic structure.

Unfortunately, the development of statistical and quantitative methods for linguistic analysis hit a brick wall in the 1950s. This was due primarily to two factors. First, there was the problem of data availability. One of the problems with applying statistical methods to the language data at the time was that the datasets were generally so small that it was not possible to make interesting statistical generalizations over large numbers of linguistic phenomena. Second, and perhaps more important, there was a general shift in the social sciences from data-oriented descriptions of human behavior to introspective modeling of cognitive functions.

As part of this new attitude toward human activity, the linguist Noam Chomsky focused on both a formal methodology and a theory of linguistics that not only ignored quantitative language data, but also claimed that it was misleading for formulating models of language behavior (Chomsky 1957).

This view was very influential throughout the 1960s and 1970s, largely because the formal approach was able to develop extremely sophisticated rule-based language models using mostly introspective (or self-generated) data. This was a very attractive alternative to trying to create statistical language models on the basis of still relatively small datasets of linguistic utterances from the existing corpora in the field. Formal modeling and rule-based generalizations, in fact, have always been an integral step in theory formation, and in this respect, Chomsky’s approach on how to do linguistics has yielded rich and elaborate models of language.

Theory construction, however, also involves testing and evaluating your hypotheses against observed phenomena. As more linguistic data has gradually become available, something significant has changed in the way linguists look at data. The phenomena are now observable in millions of texts and billions of sentences over the Web, and this has left little doubt that quantitative techniques can be meaningfully applied to both test and create the language models correlated with the datasets. This has given rise to the modern age of corpus linguistics. As a result, the corpus is the entry point from which all linguistic analysis will be done in the future.


You gotta have data! As philosopher of science Thomas Kuhn said: “When measurement departs from theory, it is likely to yield mere numbers, and their very neutrality makes them particularly sterile as a source of remedial suggestions. But numbers register the departure from theory with an authority and finesse that no qualitative technique can duplicate, and that departure is often enough to start a search” (Kuhn 1961).

The assembly and collection of texts into more coherent datasets that we can call corpora started in the 1960s.

Some of the most important corpora are listed in Table 1-1.

Table 1-1. A sampling of important corpora
Name of corpusYear publishedSizeCollection contents
British National Corpus (BNC)1991–1994100 million wordsCross section of British English, spoken and written
American National Corpus (ANC)200322 million wordsSpoken and written texts
Corpus of Contemporary American English (COCA)2008425 million wordsSpoken, fiction, popular magazine, and academic texts

What Is a Corpus?

A corpus is a collection of machine-readable texts that have been produced in a natural communicative setting. They have been sampled to be representative and balanced with respect to particular factors; for example, by genre—newspaper articles, literary fiction, spoken speech, blogs and diaries, and legal documents. A corpus is said to be “representative of a language variety” if the content of the corpus can be generalized to that variety (Leech 1991).

This is not as circular as it may sound. Basically, if the content of the corpus, defined by specifications of linguistic phenomena examined or studied, reflects that of the larger population from which it is taken, then we can say that it “represents that language variety.”

The notion of a corpus being balanced is an idea that has been around since the 1980s, but it is still a rather fuzzy notion and difficult to define strictly. Atkins and Ostler (1992) propose a formulation of attributes that can be used to define the types of text, and thereby contribute to creating a balanced corpus.

Two well-known corpora can be compared for their effort to balance the content of the texts. The Penn TreeBank (Marcus et al. 1993) is a 4.5-million-word corpus that contains texts from four sources: the Wall Street Journal, the Brown Corpus, ATIS, and the Switchboard Corpus. By contrast, the BNC is a 100-million-word corpus that contains texts from a broad range of genres, domains, and media.

The most diverse subcorpus within the Penn TreeBank is the Brown Corpus, which is a 1-million-word corpus consisting of 500 English text samples, each one approximately 2,000 words. It was collected and compiled by Henry Kucera and W. Nelson Francis of Brown University (hence its name) from a broad range of contemporary American English in 1961. In 1967, they released a fairly extensive statistical analysis of the word frequencies and behavior within the corpus, the first of its kind in print, as well as the Brown Corpus Manual (Francis and Kucera 1964).


There has never been any doubt that all linguistic analysis must be grounded on specific datasets. What has recently emerged is the realization that all linguistics will be bound to corpus-oriented techniques, one way or the other. Corpora are becoming the standard data exchange format for discussing linguistic observations and theoretical generalizations, and certainly for evaluation of systems, both statistical and rule-based.

Table 1-2 shows how the Brown Corpus compares to other corpora that are also still in use.

Table 1-2. Comparing the Brown Corpus to other corpora
Brown Corpus500 English text samples; 1 million wordsPart-of-speech tagged data; 80 different tags used
Child Language Data Exchange System (CHILDES)20 languages represented; thousands of textsPhonetic transcriptions of conversations with children from around the world
Lancaster-Oslo-Bergen Corpus500 British English text samples, around 2,000 words eachPart-of-speech tagged data; a British version of the Brown Corpus

Looking at the way the files of the Brown Corpus can be categorized gives us an idea of what sorts of data were used to represent the English language. The top two general data categories are informative, with 374 samples, and imaginative, with 126 samples.

These two domains are further distinguished into the following topic areas:


Press: reportage (44), Press: editorial (27), Press: reviews (17), Religion (17), Skills and Hobbies (36), Popular Lore (48), Belles Lettres, Biography, Memoirs (75), Miscellaneous (30), Natural Sciences (12), Medicine (5), Mathematics (4), Social and Behavioral Sciences (14), Political Science, Law, Education (15), Humanities (18), Technology and Engineering (12)


General Fiction (29), Mystery and Detective Fiction (24), Science Fiction (6), Adventure and Western Fiction (29), Romance and Love Story (29) Humor (9)

Similarly, the BNC can be categorized into informative and imaginative prose, and further into subdomains such as educational, public, business, and so on. A further discussion of how the BNC can be categorized can be found in Distributions Within Corpora.

As you can see from the numbers given for the Brown Corpus, not every category is equally represented, which seems to be a violation of the rule of “representative and balanced” that we discussed before. However, these corpora were not assembled with a specific task in mind; rather, they were meant to represent written and spoken language as a whole. Because of this, they attempt to embody a large cross section of existing texts, though whether they succeed in representing percentages of texts in the world is debatable (but also not terribly important).

For your own corpus, you may find yourself wanting to cover a wide variety of text, but it is likely that you will have a more specific task domain, and so your potential corpus will not need to include the full range of human expression. The Switchboard Corpus is an example of a corpus that was collected for a very specific purpose—Speech Recognition for phone operation—and so was balanced and representative of the different sexes and all different dialects in the United States.

Early Use of Corpora

One of the most common uses of corpora from the early days was the construction of concordances. These are alphabetical listings of the words in an article or text collection with references given to the passages in which they occur. Concordances position a word within its context, and thereby make it much easier to study how it is used in a language, both syntactically and semantically. In the 1950s and 1960s, programs were written to automatically create concordances for the contents of a collection, and the results of these automatically created indexes were called “Key Word in Context” indexes, or KWIC indexes. A KWIC index is an index created by sorting the words in an article or a larger collection such as a corpus, and aligning them in a format so that they can be searched alphabetically in the index. This was a relatively efficient means for searching a collection before full-text document search became available.

The way a KWIC index works is as follows. The input to a KWIC system is a file or collection structured as a sequence of lines. The output is a sequence of lines, circularly shifted and presented in alphabetical order of the first word. For an example, consider a short article of two sentences, shown in Figure 1-1 with the KWIC index output that is generated.

Example of a KWIC index
Figure 1-1. Example of a KWIC index

Another benefit of concordancing is that, by displaying the keyword in its context, you can visually inspect how the word is being used in a given sentence. To take a specific example, consider the different meanings of the English verb treat. Specifically, let’s look at the first two senses within sense (1) from the dictionary entry shown in Figure 1-2.

Senses of the word “treat”
Figure 1-2. Senses of the word “treat”

Now let’s look at the concordances compiled for this verb from the BNC, as differentiated by these two senses.


These concordances were compiled using the Word Sketch Engine, by the lexicographer Patrick Hanks, and are part of a large resource of sentence patterns using a technique called Corpus Pattern Analysis (Pustejovsky et al. 2004; Hanks and Pustejovsky 2005).

What is striking when one examines the concordance entries for each of these senses is the fact that the contexts are so distinct. These are presented in Figures 1-3 and 1-4.

Sense (1a) for the verb “treat”
Figure 1-3. Sense (1a) for the verb “treat”
Sense (1b) for the verb “treat”
Figure 1-4. Sense (1b) for the verb “treat”


The NLTK provides functionality for creating concordances. The easiest way to make a concordance is to simply load the preprocessed texts into the NLTK and then use the concordance function, like this:

>>> import NLTK
>>> from import *
>>> text6.concordance("Ni")

If you have your own set of data for which you would like to create a concordance, then the process is a little more involved: you will need to read in your files and use the NLTK functions to process them before you can create your own concordance. Here is some sample code for a corpus of text files (replace the directory location with your own folder of text files):

>>> corpus_loc = '/home/me/corpus/'
>>> docs = nltk.corpus.PlaintextCorpusReader(corpus_loc,'.*\.txt')

You can see if the files were read by checking what file IDs are present:

>>> print docs.fileids()

Next, process the words in the files and then use the concordance function to examine the data:

>>> docs_processed = nltk.Text(docs.words()) 
>>> docs_processed.concordance("treat")

Corpora Today

When did researchers start to actually use corpora for modeling language phenomena and training algorithms? Beginning in the 1980s, researchers in Speech Recognition began to compile enough spoken language data to create language models (from transcriptions using n-grams and Hidden Markov Models [HMMS]) that worked well enough to recognize a limited vocabulary of words in a very narrow domain. In the 1990s, work in Machine Translation began to see the influence of larger and larger datasets, and with this, the rise of statistical language modeling for translation.

Eventually, both memory and computer hardware became sophisticated enough to collect and analyze increasingly larger datasets of language fragments. This entailed being able to create statistical language models that actually performed with some reasonable accuracy for different natural language tasks.

As one example of the increasing availability of data, Google has recently released the Google Ngram Corpus. The Google Ngram dataset allows users to search for single words (unigrams) or collocations of up to five words (5-grams). The dataset is available for download from the Linguistic Data Consortium, and directly from Google. It is also viewable online through the Google Ngram Viewer. The Ngram dataset consists of more than one trillion tokens (words, numbers, etc.) taken from publicly available websites and sorted by year, making it easy to view trends in language use. In addition to English, Google provides n-grams for Chinese, French, German, Hebrew, Russian, and Spanish, as well as subsets of the English corpus such as American English and English Fiction.


N-grams are sets of items (often words, but they can be letters, phonemes, etc.) that are part of a sequence. By examining how often the items occur together we can learn about their usage in a language, and predict what would likely follow a given sequence (using n-grams for this purpose is called n-gram modeling).

N-grams are applied in a variety of ways every day, such as in websites that provide search suggestions once a few letters are typed in, and for determining likely substitutions for spelling errors. They are also used in speech disambiguation—if a person speaks unclearly but utters a sequence that does not commonly (or ever) occur in the language being spoken, an n-gram model can help recognize that problem and find the words that the speaker probably intended to say.

Another modern corpus is ClueWeb09 (, a dataset “created to support research on information retrieval and related human language technologies. It consists of about 1 billion web pages in ten languages that were collected in January and February 2009.” This corpus is too large to use for an annotation project (it’s about 25 terabytes uncompressed), but some projects have taken parts of the dataset (such as a subset of the English websites) and used them for research (Pomikálek et al. 2012). Data collection from the Internet is an increasingly common way to create corpora, as new and varied content is always being created.

Kinds of Annotation

Consider the different parts of a language’s syntax that can be annotated. These include part of speech (POS), phrase structure, and dependency structure. Table 1-3 shows examples of each of these. There are many different tagsets for the parts of speech of a language that you can choose from.

Table 1-3. Number of POS tags in different corpora
London-Lund Corpus1971982

The tagset in Figure 1-5 is taken from the Penn TreeBank, and is the basis for all subsequent annotation over that corpus.

The Penn TreeBank tagset
Figure 1-5. The Penn TreeBank tagset

The POS tagging process involves assigning the right lexical class marker(s) to all the words in a sentence (or corpus). This is illustrated in a simple example, “The waiter cleared the plates from the table.” (See Figure 1-6.)

POS tagging sample
Figure 1-6. POS tagging sample

POS tagging is a critical step in many NLP applications, since it is important to know what category a word is assigned to in order to perform subsequent analysis on it, such as the following:

Speech Synthesis

Is the word a noun or a verb? Examples include object, overflow, insult, and suspect. Without context, each of these words could be either a noun or a verb.


You need POS tags in order to make larger syntactic units. For example, in the following sentences, is “clean dishes” a noun phrase or an imperative verb phrase?

Clean dishes are in the cabinet.
Clean dishes before going to work!
Machine Translation

Getting the POS tags and the subsequent parse right makes all the difference when translating the expressions in the preceding list item into another language, such as French: “Des assiettes propres” (Clean dishes) versus “Fais la vaisselle!” (Clean the dishes!).

Consider how these tags are used in the following sentence, from the Penn TreeBank (Marcus et al. 1993):

“From the beginning, it took a man with extraordinary qualities to succeed in Mexico,” says Kimihide Takimura, president of Mitsui group’s Kensetsu Engineering Inc. unit.
“/” From/IN the/DT beginning/NN ,/, it/PRP took/VBD a/DT man/NN with/IN extraordinary/JJ qualities/NNS to/TO succeed/VB in/IN Mexico/NNP ,/, “/” says/VBZ Kimihide/NNP Takimura/NNP ,/, president/NN of/IN Mitsui/NNS group/NN ’s/POS Kensetsu/NNP Engineering/NNP Inc./NNP unit/NN ./.

Identifying the correct parts of speech in a sentence is a necessary step in building many natural language applications, such as parsers, Named Entity Recognizers, QAS, and Machine Translation systems. It is also an important step toward identifying larger structural units such as phrase structure.


Use the NLTK tagger to assign POS tags to the example sentence shown here, and then with other sentences that might be more ambiguous:

>>> from nltk import pos_tag, word_tokenize
>>> pos_tag(word_tokenize("This is a test."))

Look for places where the tagger doesn’t work, and think about what rules might be causing these errors. For example, what happens when you try “Clean dishes are in the cabinet.” and “Clean dishes before going to work!”?

While words have labels associated with them (the POS tags mentioned earlier), specific sequences of words also have labels that can be associated with them. This is called syntactic bracketing (or labeling) and is the structure that organizes all the words we hear into coherent phrases. As mentioned earlier, syntax is the name given to the structure associated with a sentence. The Penn TreeBank is an annotated corpus with syntactic bracketing explicitly marked over the text. An example annotation is shown in Figure 1-7.

Syntactic bracketing
Figure 1-7. Syntactic bracketing

This is a bracketed representation of the syntactic tree structure, which is shown in Figure 1-8.

Syntactic tree structure
Figure 1-8. Syntactic tree structure

Notice that syntactic bracketing introduces two relations between the words in a sentence: order (precedence) and hierarchy (dominance). For example, the tree structure in Figure 1-8 encodes these relations by the very nature of a tree as a directed acyclic graph (DAG). In a very compact form, the tree captures the precedence and dominance relations given in the following list:

{Dom(NNP1,John), Dom(VPZ,loves), Dom(NNP2,Mary), Dom(NP1,NNP1), Dom(NP2,NNP2), Dom(S,NP1), Dom(VP,VPZ), Dom(VP,NP2), Dom(S,VP),

Prec(NP1,VP), Prec(VPZ,NP2)}

Any sophisticated natural language application requires some level of syntactic analysis, including Machine Translation. If the resources for full parsing (such as that shown earlier) are not available, then some sort of shallow parsing can be used. This is when partial syntactic bracketing is applied to sequences of words, without worrying about the details of the structure inside a phrase. We will return to this idea in later chapters.

In addition to POS tagging and syntactic bracketing, it is useful to annotate texts in a corpus for their semantic value, that is, what the words mean in the sentence. We can distinguish two kinds of annotation for semantic content within a sentence: what something is, and what role something plays. Here is a more detailed explanation of each:

Semantic typing

A word or phrase in the sentence is labeled with a type identifier (from a reserved vocabulary or ontology), indicating what it denotes.

Semantic role labeling

A word or phrase in the sentence is identified as playing a specific semantic role relative to a role assigner, such as a verb.

Let’s consider what annotation using these two strategies would look like, starting with semantic types. Types are commonly defined using an ontology, such as that shown in Figure 1-9.


The word ontology has its roots in philosophy, but ontologies also have a place in computational linguistics, where they are used to create categorized hierarchies that group similar concepts and objects. By assigning words semantic types in an ontology, we can create relationships between different branches of the ontology, and determine whether linguistic rules hold true when applied to all the words in a category.

A simple ontology
Figure 1-9. A simple ontology

The ontology in Figure 1-9 is rather simple, with a small set of categories. However, even this small ontology can be used to illustrate some interesting features of language. Consider the following example, with semantic types marked:

[Ms. Ramirez]Person of [QBC Productions]Organization visited [Boston]Place on [Saturday]Time, where she had lunch with [Mr. Harris]Person of [STU Enterprises]Organization at [1:15 pm]Time.

From this small example, we can start to make observations about how these objects interact with one other. People can visit places, people have “of” relationships with organizations, and lunch can happen on Saturday at 1:15 p.m. Given a large enough corpus of similarly labeled sentences, we can start to detect patterns in usage that will tell us more about how these labels do and do not interact.

A corpus of these examples can also tell us where our categories might need to be expanded. There are two “times” in this sentence: Saturday and 1:15 p.m. We can see that events can occur “on” Saturday, but “at” 1:15 p.m. A larger corpus would show that this pattern remains true with other days of the week and hour designations—there is a difference in usage here that cannot be inferred from the semantic types. However, not all ontologies will capture all information—the applications of the ontology will determine whether it is important to capture the difference between Saturday and 1:15 p.m.

The annotation strategy we just described marks up what a linguistic expression refers to. But let’s say we want to know the basics for Question Answering, namely, the who, what, where, and when of a sentence. This involves identifying what are called the semantic role labels associated with a verb. What are semantic roles? Although there is no complete agreement on what roles exist in language (there rarely is with linguists), the following list is a fair representation of the kinds of semantic labels associated with different verbs:


The event participant that is doing or causing the event to occur


The event participant who undergoes a change in position or state


The event participant who experiences or perceives something


The location or place from which the motion begins; the person from whom the theme is given


The location or place to which the motion is directed or terminates


The person who comes into possession of the theme


The event participant who is affected by the event


The event participant used by the agent to do or cause the event


The location or place associated with the event itself

The annotated data that results explicitly identifies entity extents and the target relations between the entities:

  • [The man]agent painted [the wall]patient with [a paint brush]instrument.

  • [Mary]figure walked to [the cafe]goal from [her house]source.

  • [John]agent gave [his mother]recipient [a necklace]theme.

  • [My brother]theme lives in [Milwaukee]location.

Language Data and Machine Learning

Now that we have reviewed the methodology of language annotation along with some examples of annotation formats over linguistic data, we will describe the computational framework within which such annotated corpora are used, namely, that of machine learning. Machine learning is the name given to the area of Artificial Intelligence concerned with the development of algorithms that learn or improve their performance from experience or previous encounters with data. They are said to learn (or generate) a function that maps particular input data to the desired output. For our purposes, the “data” that an ML algorithm encounters is natural language, most often in the form of text, and typically annotated with tags that highlight the specific features that are relevant to the learning task. As we will see, the annotation schemas discussed earlier, for example, provide rich starting points as the input data source for the ML process (the training phase).

When working with annotated datasets in NLP, three major types of ML algorithms are typically used:

Supervised learning

Any technique that generates a function mapping from inputs to a fixed set of labels (the desired output). The labels are typically metadata tags provided by humans who annotate the corpus for training purposes.

Unsupervised learning

Any technique that tries to find structure from an input set of unlabeled data.

Semi-supervised learning

Any technique that generates a function mapping from inputs of both labeled data and unlabeled data; a combination of both supervised and unsupervised learning.

Table 1-4 shows a general overview of ML algorithms and some of the annotation tasks they are frequently used to emulate. We’ll talk more about why these algorithms are used for these different tasks in Chapter 7.

Table 1-4. Annotation tasks and their accompanying ML algorithms
ClusteringGenre classification, spam labeling
Decision treesSemantic type or ontological class assignment, coreference resolution
Naïve BayesSentiment classification, semantic type or ontological class assignment
Maximum Entropy (MaxEnt)Sentiment classification, semantic type, or ontological class assignment
Structured pattern induction (HMMs, CRFs, etc.)POS tagging, sentiment classification, word sense disambiguation

You’ll notice that some of the tasks appear with more than one algorithm. That’s because different approaches have been tried successfully for different types of annotation tasks, and depending on the most relevant features of your own corpus, different algorithms may prove to be more or less effective. Just to give you an idea of what the algorithms listed in that table mean, the rest of this section gives an overview of the main types of ML algorithms.


Classification is the task of identifying the labeling for a single entity from a set of data. For example, in order to distinguish spam from not-spam in your email inbox, an algorithm called a classifier is trained on a set of labeled data, where individual emails have been assigned the label [+spam] or [-spam]. It is the presence of certain (known) words or phrases in an email that helps to identify an email as spam. These words are essentially treated as features that the classifier will use to model the positive instances of spam as compared to not-spam. Another example of a classification problem is patient diagnosis, from the presence of known symptoms and other attributes. Here we would identify a patient as having a particular disease, A, and label the patient record as [+disease-A] or [-disease-A], based on specific features from the record or text. This might include blood pressure, weight, gender, age, existence of symptoms, and so forth. The most common algorithms used in classification tasks are Maximum Entropy (MaxEnt), Naïve Bayes, decision trees, and Support Vector Machines (SVMs).


Clustering is the name given to ML algorithms that find natural groupings and patterns from the input data, without any labeling or training at all. The problem is generally viewed as an unsupervised learning task, where either the dataset is unlabeled or the labels are ignored in the process of making clusters. The clusters that are formed are “similar in some respect,” and the other clusters formed are “dissimilar to the objects” in other clusters. Some of the more common algorithms used for this task include k-means, hierarchical clustering, Kernel Principle Component Analysis, and Fuzzy C-Means (FCM).

Structured Pattern Induction

Structured pattern induction involves learning not only the label or category of a single entity, but rather learning a sequence of labels, or other structural dependencies between the labeled items. For example, a sequence of labels might be a stream of phonemes in a speech signal (in Speech Recognition); a sequence of POS tags in a sentence corresponding to a syntactic unit (phrase); a sequence of dialog moves in a phone conversation; or steps in a task such as parsing, coreference resolution, or grammar induction. Algorithms used for such problems include Hidden Markov Models (HMMs), Conditional Random Fields (CRFs), and Maximum Entropy Markov Models (MEMMs).

We will return to these approaches in more detail when we discuss machine learning in greater depth in Chapter 7.

The Annotation Development Cycle

The features we use for encoding a specific linguistic phenomenon must be rich enough to capture the desired behavior in the algorithm that we are training. These linguistic descriptions are typically distilled from extensive theoretical modeling of the phenomenon. The descriptions in turn form the basis for the annotation values of the specification language, which are themselves the features used in a development cycle for training and testing an identification or labeling algorithm over text. Finally, based on an analysis and evaluation of the performance of a system, the model of the phenomenon may be revised for retraining and testing.

We call this particular cycle of development the MATTER methodology, as detailed here and shown in Figure 1-10 (Pustejovsky 2006):


Structural descriptions provide theoretically informed attributes derived from empirical observations over the data.


An annotation scheme assumes a feature set that encodes specific structural descriptions and properties of the input data.


The algorithm is trained over a corpus annotated with the target feature set.


The algorithm is tested against held-out data.


A standardized evaluation of results is conducted.


The model and the annotation specification are revisited in order to make the annotation more robust and reliable with use in the algorithm.

The MATTER cycle
Figure 1-10. The MATTER cycle

We assume some particular problem or phenomenon has sparked your interest, for which you will need to label natural language data for training for machine learning. Consider two kinds of problems. First imagine a direct text classification task. It might be that you are interested in classifying your email according to its content or with a particular interest in filtering out spam. Or perhaps you are interested in rating your incoming mail on a scale of what emotional content is being expressed in the message.

Now let’s consider a more involved task, performed over this same email corpus: identifying what are known as Named Entities (NEs). These are references to everyday things in our world that have proper names associated with them; for example, people, countries, products, holidays, companies, sports, religions, and so on.

Finally, imagine an even more complicated task, that of identifying all the different events that have been mentioned in your mail (birthdays, parties, concerts, classes, airline reservations, upcoming meetings, etc.). Once this has been done, you will need to “timestamp” them and order them, that is, identify when they happened, if in fact they did happen. This is called the temporal awareness problem, and is one of the most difficult in the field.

We will use these different tasks throughout this section to help us clarify what is involved with the different steps in the annotation development cycle.

Model the Phenomenon

The first step in the MATTER development cycle is “Model the Phenomenon.” The steps involved in modeling, however, vary greatly, depending on the nature of the task you have defined for yourself. In this section, we will look at what modeling entails and how you know when you have an adequate first approximation of a model for your task.

The parameters associated with creating a model are quite diverse, and it is difficult to get different communities to agree on just what a model is. In this section we will be pragmatic and discuss a number of approaches to modeling and show how they provide the basis from which to created annotated datasets. Briefly, a model is a characterization of a certain phenomenon in terms that are more abstract than the elements in the domain being modeled. For the following discussion, we will define a model as consisting of a vocabulary of terms, T, the relations between these terms, R, and their interpretation, I. So, a model, M, can be seen as a triple, M = <T,R,I>. To better understand this notion of a model, let us consider the scenarios introduced earlier. For spam detection, we can treat it as a binary text classification task, requiring the simplest model with the categories (terms) spam and not-spam associated with the entire email document. Hence, our model is simply:

  • T = {Document_type, Spam, Not-Spam}

  • R = {Document_type ::= Spam | Not-Spam}

  • I = {Spam = “something we don’t want!”, Not-Spam = “something we do want!"}

The document itself is labeled as being a member of one of these categories. This is called document annotation and is the simplest (and most coarse-grained) annotation possible. Now, when we say that the model contains only the label names for the categories (e.g., sports, finance, news, editorials, fashion, etc.), this means there is no other annotation involved. This does not mean the content of the files is not subject to further scrutiny, however. A document that is labeled as a category, A, for example, is actually analyzed as a large-feature vector containing at least the words in the document. A more fine-grained annotation for the same task would be to identify specific words or phrases in the document and label them as also being associated with the category directly. We’ll return to this strategy in Chapter 4. Essentially, the goal of designing a good model of the phenomenon (task) is that this is where you start for designing the features that go into your learning algorithm. The better the features, the better the performance of the ML algorithm!

Preparing a corpus with annotations of NEs, as mentioned earlier, involves a richer model than the spam-filter application just discussed. We introduced a four-category ontology for NEs in the previous section, and this will be the basis for our model to identify NEs in text. The model is illustrated as follows:

  • T = {Named_Entity, Organization, Person, Place, Time}

  • R = {Named_Entity ::= Organization | Person | Place | Time}

  • I = {Organization = “list of organizations in a database”, Person = “list of people in a database”, Place = “list of countries, geographic locations, etc.”, Time = “all possible dates on the calendar”}

This model is necessarily more detailed, because we are actually annotating spans of natural language text, rather than simply labeling documents (e.g., emails) as spam or not-spam. That is, within the document, we are recognizing mentions of companies, actors, countries, and dates.

Finally, what about an even more involved task, that of recognizing all temporal information in a document? That is, questions such as the following:

  • When did that meeting take place?

  • How long was John on vacation?

  • Did Jill get promoted before or after she went on maternity leave?

We won’t go into the full model for this domain, but let’s see what is minimally necessary in order to create annotation features to understand such questions. First we need to distinguish between Time expressions (“yesterday,” “January 27,” “Monday”), Events (“promoted,” “meeting,” “vacation”), and Temporal relations (“before,” “after,” “during”). Because our model is so much more detailed, let’s divide the descriptive content by domain:

  • Time_Expression ::= TIME | DATE | DURATION | SET

    • TIME: 10:15 a.m., 3 o’clock, etc.

    • DATE: Monday, April 2011

    • DURATION: 30 minutes, two years, four days

    • SET: every hour, every other month

  • Event: Meeting, vacation, promotion, maternity leave, etc.

  • Temporal_Relations ::= BEFORE | AFTER | DURING | EQUAL | OVERLAP | ...

We will come back to this problem in a later chapter, when we discuss the impact of the initial model on the subsequent performance of the algorithms you are trying to train over your labeled data.


In later chapters, we’ll see that there are actually several models that might be appropriate for describing a phenomenon, each providing a different view of the data. We’ll call this multimodel annotation of the phenomenon. A common scenario for multimodel annotation involves annotators who have domain expertise in an area (such as biomedical knowledge). They are told to identify specific entities, events, attributes, or facts from documents, given their knowledge and interpretation of a specific area. From this annotation, nonexperts can be used to mark up the structural (syntactic) aspects of these same phenomena, making it possible to gain domain expert understanding without forcing the domain experts to learn linguistic theory as well.

Once you have an initial model for the phenomena associated with the problem task you are trying to solve, you effectively have the first tag specification, or spec, for the annotation. This is the document from which you will create the blueprint for how to annotate the corpus with the features in the model. This is called the annotation guideline, and we talk about this in the next section.

Annotate with the Specification

Now that you have a model of the phenomenon encoded as a specification document, you will need to train human annotators to mark up the dataset according to the tags that are important to you. This is easier said than done, and in fact often requires multiple iterations of modeling and annotating, as shown in Figure 1-11. This process is called the MAMA (Model-Annotate-Model-Annotate) cycle, or the “babeling” phase of MATTER. The annotation guideline helps direct the annotators in the task of identifying the elements and then associating the appropriate features with them, when they are identified.

Two kinds of tags will concern us when annotating natural language data: consuming tags and nonconsuming tags. A consuming tag refers to a metadata tag that has real content from the dataset associated with it (e.g., it “consumes” some text); a nonconsuming tag, on the other hand, is a metadata tag that is inserted into the file but is not associated with any actual part of the text. An example will help make this distinction clear. Say that we want to annotate text for temporal information, as discussed earlier. Namely, we want to annotate for three kinds of tags: times (called Timex tags), temporal relations (TempRels), and Events. In the first sentence in the following example, each tag is expressed directly as real text. That is, they are all consuming tags (“promoted” is marked as an Event, “before” is marked as a TempRel, and “the summer” is marked as a Timex). Notice, however, that in the second sentence, there is no explicit temporal relation in the text, even though we know that it’s something like “on”. So, we actually insert a TempRel with the value of “on” in our corpus, but the tag is flagged as a “nonconsuming” tag.

  • John was [promoted]Event [before]TempRel [the summer]Timex.

  • John was [promoted]Event [Monday]Timex.

An important factor when creating an annotated corpus of your text is, of course, consistency in the way the annotators mark up the text with the different tags. One of the most seemingly trivial problems is the most problematic when comparing annotations: namely, the extent or the span of the tag. Compare the three annotations that follow. In the first, the Organization tag spans “QBC Productions,” leaving out the company identifier “Inc.” and the location “of East Anglia,” while these are included in varying spans in the next two annotations.

  • [QBC Productions]Organization Inc. of East Anglia

  • [QBC Productions Inc.]Organization of East Anglia

  • [QBC Productions Inc. of East Anglia]Organization

Each of these might look correct to an annotator, but only one actually corresponds to the correct markup in the annotation guideline. How are these compared and resolved?

The inner workings of the MAMA portion of the MATTER cycle
Figure 1-11. The inner workings of the MAMA portion of the MATTER cycle


In order to assess how well an annotation task is defined, we use Inter-Annotator Agreement (IAA) scores to show how individual annotators compare to one another. If an IAA score is high, that is an indication that the task is well defined and other annotators will be able to continue the work. This is typically defined using a statistical measure called a Kappa Statistic. For comparing two annotations against each other, the Cohen Kappa is usually used, while when comparing more than two annotations, a Fleiss Kappa measure is used. These will be defined in Chapter 8.

Note that having a high IAA score doesn’t necessarily mean the annotations are correct; it simply means the annotators are all interpreting your instructions consistently in the same way. Your task may still need to be revised even if your IAA scores are high. This will be discussed further in Chapter 9.

Once you have your corpus annotated by at least two people (more is preferable, but not always practical), it’s time to create the gold standard corpus. The gold standard is the final version of your annotated data. It uses the most up-to-date specification that you created during the annotation process, and it has everything tagged correctly according to the most recent guidelines. This is the corpus that you will use for machine learning, and it is created through the process of adjudication. At this point in the process, you (or someone equally familiar with all the tasks) will compare the annotations and determine which tags in the annotations are correct and should be included in the gold standard.

Train and Test the Algorithms over the Corpus

Now that you have adjudicated your corpus, you can use your newly created gold standard for machine learning. The most common way to do this is to divide your corpus into two parts: the development corpus and the test corpus. The development corpus is then further divided into two parts: the training set and the development-test set. Figure 1-12 shows a standard breakdown of a corpus, though different distributions might be used for different tasks. The files are normally distributed randomly into the different sets.

Corpus divisions for machine learning
Figure 1-12. Corpus divisions for machine learning

The training set is used to train the algorithm that you will use for your task. The development-test (dev-test) set is used for error analysis. Once the algorithm is trained, it is run on the dev-test set and a list of errors can be generated to find where the algorithm is failing to correctly label the corpus. Once sources of error are found, the algorithm can be adjusted and retrained, then tested against the dev-test set again. This procedure can be repeated until satisfactory results are obtained.

Once the training portion is completed, the algorithm is run against the held-out test corpus, which until this point has not been involved in training or dev-testing. By holding out the data, we can show how well the algorithm will perform on new data, which gives an expectation of how it would perform on data that someone else creates as well. Figure 1-13 shows the “TTER” portion of the MATTER cycle, with the different corpus divisions and steps.

The Training–Evaluation cycle
Figure 1-13. The Training–Evaluation cycle

Evaluate the Results

The most common method for evaluating the performance of your algorithm is to calculate how accurately it labels your dataset. This can be done by measuring the fraction of the results from the dataset that are labeled correctly using a standard technique of “relevance judgment” called the Precision and Recall metric.

Here’s how it works. For each label you are using to identify elements in the data, the dataset is divided into two subsets: one that is labeled “relevant” to the label, and one that is not relevant. Precision is a metric that is computed as the fraction of the correct instances from those that the algorithm labeled as being in the relevant subset. Recall is computed as the fraction of correct items among those that actually belong to the relevant subset. The following confusion matrix helps illustrate how this works:

  Predicted Labeling
Gold Labelingpositivetrue positive (tp)false negative (fn)
negativefalse positive (fp)true negative (tn)

Given this matrix, we can define both precision and recall as shown in Figure 1-14, along with a conventional definition of accuracy.

Precision and recall equations
Figure 1-14. Precision and recall equations

The values of P and R are typically combined into a single metric called the F-measure, which is the harmonic mean of the two.

Precision and recall equations

This creates an overall score used for evaluation where precision and recall are measured equally, though depending on the purpose of your corpus and algorithm, a variation of this measure, such as one that rates precision higher than recall, may be more useful to you. We will give more detail about how these equations are used for evaluation in Chapter 8.

Revise the Model and Algorithms

Once you have evaluated the results of training and testing your algorithm on the data, you will want to do an error analysis to see where it performed well and where it made mistakes. This can be done with various packages and formulas, which we will discuss in Chapter 8, including the creation of what are called confusion matrices. These will help you go back to the design of the model, in order to create better tags and features that will subsequently improve your gold standard, and consequently result in better performance of your learning algorithm.

A brief example of model revision will help make this point. Recall the model for NE extraction from the previous section, where we distinguished between four types of entities: Organization, Place, Time, and Person. Depending on the corpus you have assembled, it might be the case that you are missing a major category, or that you would be better off making some subclassifications within one of the existing tags. For example, you may find that the annotators are having a hard time knowing what to do with named occurrences or events, such as Easter, 9-11, or Thanksgiving. These denote more than simply Times, and suggest that perhaps a new category should be added to the model: Event. Additionally, it might be the case that there is reason to distinguish geopolitical Places from nongeopolitical Places. As with the “Model-Annotate” and “Train-Test” cycles, once such additions and modifications are made to the model, the MATTER cycle begins all over again, and revisions will typically bring improved performance.


In this chapter, we have provided an overview of the history of corpus and computational linguistics, and the general methodology for creating an annotated corpus. Specifically, we have covered the following points:

  • Natural language annotation is an important step in the process of training computers to understand human speech for tasks such as Question Answering, Machine Translation, and summarization.

  • All of the layers of linguistic research, from phonetics to semantics to discourse analysis, are used in different combinations for different ML tasks.

  • In order for annotation to provide statistically useful results, it must be done on a sufficiently large dataset, called a corpus. The study of language using corpora is corpus linguistics.

  • Corpus linguistics began in the 1940s, but did not become a feasible way to study language until decades later, when the technology caught up to the demands of the theory.

  • A corpus is a collection of machine-readable texts that are representative of natural human language. Good corpora are representative and balanced with respect to the genre or language that they seek to represent.

  • The uses of computers with corpora have developed over the years from simple key-word-in-context (KWIC) indexes and concordances that allowed full-text documents to be searched easily, to modern, statistically based ML techniques.

  • Annotation is the process of augmenting a corpus with higher-level information, such as part-of-speech tagging, syntactic bracketing, anaphora resolution, and word senses. Adding this information to a corpus allows the computer to find features that can make a defined task easier and more accurate.

  • Once a corpus is annotated, the data can be used in conjunction with ML algorithms that perform classification, clustering, and pattern induction tasks.

  • Having a good annotation scheme and accurate annotations is critical for machine learning that relies on data outside of the text itself. The process of developing the annotated corpus is often cyclical, with changes made to the tagsets and tasks as the data is studied further.

  • Here we refer to the annotation development cycle as the MATTER cycle—Model, Annotate, Train, Test, Evaluate, Revise.

  • Often before reaching the Test step of the process, the annotation scheme has already gone through several revisions of the Model and Annotate stages.

  • This book will show you how to create an accurate and effective annotation scheme for a task of your choosing, apply the scheme to your corpus, and then use ML techniques to train a computer to perform the task you designed.

Get Natural Language Annotation for Machine Learning now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.