Chapter 3. Corpus Analytics

Now that you have successfully created a corpus for your defined goal, it is important to know what it contains. The goal of this chapter is to equip you with tools for analyzing the linguistic content of this corpus. Hence, we will introduce you to the kinds of techniques and tools you will need in order to perform a variety of statistical analytics over your corpus.

To this end, we will cover the aspects of statistics and probability that you need in order to understand, from a linguistic perspective, just what is in the corpus we are building. This is an area called corpus analytics. Topics will include the following:

  • How to measure basic frequencies of word occurrence, by lemma and by token

  • How to normalize the data you want to analyze

  • How to measure the relatedness between words and phrases in a corpus (i.e., distributions)

Knowing what is in your corpus will help you build your model for automatically identifying the tags you will be creating in the next chapter. We will introduce these concepts using linguistic examples whenever possible. Throughout the chapter, we will reference a corpus of movie reviews, assembled from IMDb.com (IMDb). This will prove to be a useful platform from which we can introduce these concepts.

Statistics is important for several reasons, but mostly it gives us two important abilities:

Data analysis

Discovering latent properties in the dataset

Significance for inferential statistics

Allowing us to make judgments and derive information ...

Get Natural Language Annotation for Machine Learning now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.