Skip to Content
Natural Language Processing with Python
book

Natural Language Processing with Python

by Steven Bird, Ewan Klein, Edward Loper
June 2009
Beginner to intermediate
504 pages
16h 27m
English
O'Reilly Media, Inc.
Content preview from Natural Language Processing with Python

Chapter 11. Managing Linguistic Data

Structured collections of annotated linguistic data are essential in most areas of NLP; however, we still face many obstacles in using them. The goal of this chapter is to answer the following questions:

  1. How do we design a new language resource and ensure that its coverage, balance, and documentation support a wide range of uses?

  2. When existing data is in the wrong format for some analysis tool, how can we convert it to a suitable format?

  3. What is a good way to document the existence of a resource we have created so that others can easily find it?

Along the way, we will study the design of existing corpora, the typical workflow for creating a corpus, and the life cycle of a corpus. As in other chapters, there will be many examples drawn from practical experience managing linguistic data, including data that has been collected in the course of linguistic fieldwork, laboratory work, and web crawling.

Corpus Structure: A Case Study

The TIMIT Corpus was the first annotated speech database to be widely distributed, and it has an especially clear organization. TIMIT was developed by a consortium including Texas Instruments and MIT, from which it derives its name. It was designed to provide data for the acquisition of acoustic-phonetic knowledge and to support the development and evaluation of automatic speech recognition systems.

The Structure of TIMIT

Like the Brown Corpus, which displays a balanced selection of text genres and sources, TIMIT includes a balanced ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Start your free trial

You might also like

Natural Language Processing with Python and spaCy

Natural Language Processing with Python and spaCy

Yuli Vasiliev
Hands-On Natural Language Processing with Python

Hands-On Natural Language Processing with Python

Rajesh Arumugam, Rajalingappaa Shanmugamani

Publisher Resources

ISBN: 9780596803346Errata Page