book

Natural Language Processing with Python

by Steven Bird, Ewan Klein, Edward Loper

June 2009

Beginner to intermediate

504 pages

16h 27m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Natural Language Processing with Python
SPECIAL OFFER: Upgrade this ebook with O’Reilly
Preface
Audience
Emphasis
What You Will Learn
Organization
Why Python?
Software Requirements
Natural Language Toolkit (NLTK)
For Instructors

Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
Royalties
1. Language Processing and Python
Computing with Language: Texts and WordsGetting Started with PythonGetting Started with NLTKSearching TextCounting Vocabulary
A Closer Look at Python: Texts as Lists of Words
ListsIndexing ListsVariablesStrings
Computing with Language: Simple Statistics
Frequency DistributionsFine-Grained Selection of WordsCollocations and BigramsCounting Other Things
Back to Python: Making Decisions and Taking Control
ConditionalsOperating on Every ElementNested Code BlocksLooping with Conditions
Automatic Natural Language Understanding
Word Sense DisambiguationPronoun ResolutionGenerating Language OutputMachine TranslationSpoken Dialogue SystemsTextual EntailmentLimitations of NLP
Summary
Further Reading
Exercises
2. Accessing Text Corpora and Lexical Resources
Accessing Text CorporaGutenberg CorpusWeb and Chat TextBrown CorpusReuters CorpusInaugural Address CorpusAnnotated Text CorporaCorpora in Other LanguagesText Corpus StructureLoading Your Own Corpus
Conditional Frequency Distributions
Conditions and EventsCounting Words by GenrePlotting and Tabulating DistributionsGenerating Random Text with Bigrams
More Python: Reusing Code
Creating Programs with a Text EditorFunctionsModules
Lexical Resources
Wordlist CorporaA Pronouncing DictionaryComparative WordlistsShoebox and Toolbox Lexicons
WordNet
Senses and SynonymsThe WordNet HierarchyMore Lexical RelationsSemantic Similarity
Summary
Further Reading
Exercises
3. Processing Raw Text
Accessing Text from the Web and from DiskElectronic BooksDealing with HTMLProcessing Search Engine ResultsProcessing RSS FeedsReading Local FilesExtracting Text from PDF, MSWord, and Other Binary FormatsCapturing User InputThe NLP Pipeline
Strings: Text Processing at the Lowest Level
Basic Operations with StringsPrinting StringsAccessing Individual CharactersAccessing SubstringsMore Operations on StringsThe Difference Between Lists and Strings
Text Processing with Unicode
What Is Unicode?Extracting Encoded Text from FilesUsing Your Local Encoding in Python
Regular Expressions for Detecting Word Patterns
Using Basic MetacharactersRanges and Closures
Useful Applications of Regular Expressions
Extracting Word PiecesDoing More with Word PiecesFinding Word StemsSearching Tokenized Text
Normalizing Text
StemmersLemmatization
Regular Expressions for Tokenizing Text
Simple Approaches to TokenizationNLTK’s Regular Expression TokenizerFurther Issues with Tokenization
Segmentation
Sentence SegmentationWord Segmentation
Formatting: From Lists to Strings
From Lists to StringsStrings and FormatsLining Things UpWriting Results to a FileText Wrapping
Summary
Further Reading
Exercises
4. Writing Structured Programs
Back to the BasicsAssignmentEqualityConditionals
Sequences
Operating on Sequence TypesCombining Different Sequence TypesGenerator Expressions
Questions of Style
Python Coding StyleProcedural Versus Declarative StyleSome Legitimate Uses for Counters
Functions: The Foundation of Structured Programming
Function Inputs and OutputsParameter PassingVariable ScopeChecking Parameter TypesFunctional DecompositionDocumenting Functions
Doing More with Functions
Functions As ArgumentsAccumulative FunctionsHigher-Order FunctionsNamed Arguments
Program Development
Structure of a Python ModuleMultimodule ProgramsSources of ErrorDebugging TechniquesDefensive Programming
Algorithm Design
RecursionSpace-Time Trade-offsDynamic Programming
A Sample of Python Libraries
MatplotlibNetworkXcsvNumPyOther Python Libraries
Summary
Further Reading
Exercises
5. Categorizing and Tagging Words
Using a Tagger
Tagged Corpora
Representing Tagged TokensReading Tagged CorporaA Simplified Part-of-Speech TagsetNounsVerbsAdjectives and AdverbsUnsimplified TagsExploring Tagged Corpora
Mapping Words to Properties Using Python Dictionaries
Indexing Lists Versus DictionariesDictionaries in PythonDefining DictionariesDefault DictionariesIncrementally Updating a DictionaryComplex Keys and ValuesInverting a Dictionary
Automatic Tagging
The Default TaggerThe Regular Expression TaggerThe Lookup TaggerEvaluation
N-Gram Tagging
Unigram TaggingSeparating the Training and Testing DataGeneral N-Gram TaggingCombining TaggersTagging Unknown WordsStoring TaggersPerformance LimitationsTagging Across Sentence Boundaries
Transformation-Based Tagging
How to Determine the Category of a Word
Morphological CluesSyntactic CluesSemantic CluesNew WordsMorphology in Part-of-Speech Tagsets
Summary
Further Reading
Exercises
6. Learning to Classify Text
Supervised ClassificationGender IdentificationChoosing the Right FeaturesDocument ClassificationPart-of-Speech TaggingExploiting ContextSequence ClassificationOther Methods for Sequence Classification
Further Examples of Supervised Classification
Sentence SegmentationIdentifying Dialogue Act TypesRecognizing Textual EntailmentScaling Up to Large Datasets
Evaluation
The Test SetAccuracyPrecision and RecallConfusion MatricesCross-Validation
Decision Trees
Entropy and Information Gain
Naive Bayes Classifiers
Underlying Probabilistic ModelZero Counts and SmoothingNon-Binary FeaturesThe Naivete of IndependenceThe Cause of Double-Counting
Maximum Entropy Classifiers
The Maximum Entropy ModelMaximizing EntropyGenerative Versus Conditional Classifiers
Modeling Linguistic Patterns
What Do Models Tell Us?
Summary
Further Reading
Exercises
7. Extracting Information from Text
Information ExtractionInformation Extraction Architecture
Chunking
Noun Phrase ChunkingTag PatternsChunking with Regular ExpressionsExploring Text CorporaChinkingRepresenting Chunks: Tags Versus Trees
Developing and Evaluating Chunkers
Reading IOB Format and the CoNLL-2000 Chunking CorpusSimple Evaluation and BaselinesTraining Classifier-Based Chunkers
Recursion in Linguistic Structure
Building Nested Structure with Cascaded ChunkersTreesTree Traversal
Named Entity Recognition
Relation Extraction
Summary
Further Reading
Exercises
8. Analyzing Sentence Structure
Some Grammatical DilemmasLinguistic Data and Unlimited PossibilitiesUbiquitous Ambiguity
What’s the Use of Syntax?
Beyond n-grams
Context-Free Grammar
A Simple GrammarWriting Your Own GrammarsRecursion in Syntactic Structure
Parsing with Context-Free Grammar
Recursive Descent ParsingShift-Reduce ParsingThe Left-Corner ParserWell-Formed Substring Tables
Dependencies and Dependency Grammar
Valency and the LexiconScaling Up
Grammar Development
Treebanks and GrammarsPernicious AmbiguityWeighted Grammar
Summary
Further Reading
Exercises
9. Building Feature-Based Grammars
Grammatical FeaturesSyntactic AgreementUsing Attributes and ConstraintsTerminology
Processing Feature Structures
Subsumption and Unification
Extending a Feature-Based Grammar
SubcategorizationHeads RevisitedAuxiliary Verbs and InversionUnbounded Dependency ConstructionsCase and Gender in German
Summary
Further Reading
Exercises
10. Analyzing the Meaning of Sentences
Natural Language UnderstandingQuerying a DatabaseNatural Language, Semantics, and Logic
Propositional Logic
First-Order Logic
SyntaxFirst-Order Theorem ProvingSummarizing the Language of First-Order LogicTruth in ModelIndividual Variables and AssignmentsQuantificationQuantifier Scope AmbiguityModel Building
The Semantics of English Sentences
Compositional Semantics in Feature-Based GrammarThe λ-CalculusQuantified NPsTransitive VerbsQuantifier Ambiguity Revisited
Discourse Semantics
Discourse Representation TheoryDiscourse Processing
Summary
Further Reading
Exercises
11. Managing Linguistic Data
Corpus Structure: A Case StudyThe Structure of TIMITNotable Design FeaturesFundamental Data Types
The Life Cycle of a Corpus
Three Corpus Creation ScenariosQuality ControlCuration Versus Evolution
Acquiring Data
Obtaining Data from the WebObtaining Data from Word Processor FilesObtaining Data from Spreadsheets and DatabasesConverting Data FormatsDeciding Which Layers of Annotation to IncludeStandards and ToolsSpecial Considerations When Working with Endangered Languages
Working with XML
Using XML for Linguistic StructuresThe Role of XMLThe ElementTree InterfaceUsing ElementTree for Accessing Toolbox DataFormatting Entries
Working with Toolbox Data
Adding a Field to Each EntryValidating a Toolbox Lexicon
Describing Language Resources Using OLAC Metadata
What Is Metadata?OLAC: Open Language Archives Community
Summary
Further Reading
Exercises
A. Afterword: The Language Challenge
Language Processing Versus Symbol Processing
Contemporary Philosophical Divides
NLTK Roadmap
Envoi...
B. Bibliography
NLTK Index
General Index
About the Authors
Colophon
SPECIAL OFFER: Upgrade this ebook with O’Reilly

Content preview from Natural Language Processing with Python

Chapter 11. Managing Linguistic Data

Structured collections of annotated linguistic data are essential in most areas of NLP; however, we still face many obstacles in using them. The goal of this chapter is to answer the following questions:

How do we design a new language resource and ensure that its coverage, balance, and documentation support a wide range of uses?
When existing data is in the wrong format for some analysis tool, how can we convert it to a suitable format?
What is a good way to document the existence of a resource we have created so that others can easily find it?

Along the way, we will study the design of existing corpora, the typical workflow for creating a corpus, and the life cycle of a corpus. As in other chapters, there will be many examples drawn from practical experience managing linguistic data, including data that has been collected in the course of linguistic fieldwork, laboratory work, and web crawling.

Corpus Structure: A Case Study

The TIMIT Corpus was the first annotated speech database to be widely distributed, and it has an especially clear organization. TIMIT was developed by a consortium including Texas Instruments and MIT, from which it derives its name. It was designed to provide data for the acquisition of acoustic-phonetic knowledge and to support the development and evaluation of automatic speech recognition systems.

The Structure of TIMIT

Like the Brown Corpus, which displays a balanced selection of text genres and sources, TIMIT includes a balanced ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Hands-On Natural Language Processing with Python

Publisher Resources

ISBN: 9780596803346Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design