Skip to Content
Natural Language Annotation for Machine Learning
book

Natural Language Annotation for Machine Learning

by James Pustejovsky, Amber Stubbs
October 2012
Beginner to intermediate
342 pages
9h 55m
English
O'Reilly Media, Inc.
Content preview from Natural Language Annotation for Machine Learning

Chapter 3. Corpus Analytics

Now that you have successfully created a corpus for your defined goal, it is important to know what it contains. The goal of this chapter is to equip you with tools for analyzing the linguistic content of this corpus. Hence, we will introduce you to the kinds of techniques and tools you will need in order to perform a variety of statistical analytics over your corpus.

To this end, we will cover the aspects of statistics and probability that you need in order to understand, from a linguistic perspective, just what is in the corpus we are building. This is an area called corpus analytics. Topics will include the following:

  • How to measure basic frequencies of word occurrence, by lemma and by token

  • How to normalize the data you want to analyze

  • How to measure the relatedness between words and phrases in a corpus (i.e., distributions)

Knowing what is in your corpus will help you build your model for automatically identifying the tags you will be creating in the next chapter. We will introduce these concepts using linguistic examples whenever possible. Throughout the chapter, we will reference a corpus of movie reviews, assembled from IMDb.com (IMDb). This will prove to be a useful platform from which we can introduce these concepts.

Statistics is important for several reasons, but mostly it gives us two important abilities:

Data analysis

Discovering latent properties in the dataset

Significance for inferential statistics

Allowing us to make judgments and derive information ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

AirBnbBlueOriginElectronic ArtsHomeDepotNasdaqRakutenTata Consultancy Services

QuotationMarkO’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.
Julian F.
Head of Cybersecurity
QuotationMarkI wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.
Addison B.
Field Engineer
QuotationMarkI’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.
Amir M.
Data Platform Tech Lead
QuotationMarkI'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.
Mark W.
Embedded Software Engineer

You might also like

Machine Learning with PyTorch and Scikit-Learn

Machine Learning with PyTorch and Scikit-Learn

Sebastian Raschka, Yuxi (Hayden) Liu, Vahid Mirjalili

Publisher Resources

ISBN: 9781449332693Errata