book

Natural Language Processing with Python

by Steven Bird, Ewan Klein, Edward Loper

June 2009

Beginner to intermediate

504 pages

16h 27m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Natural Language Processing with Python
SPECIAL OFFER: Upgrade this ebook with O’Reilly
Preface
Audience
Emphasis
What You Will Learn
Organization
Why Python?
Software Requirements
Natural Language Toolkit (NLTK)
For Instructors

Conventions Used in This Book
Using Code Examples
Safari® Books Online
How to Contact Us
Acknowledgments
Royalties
1. Language Processing and Python
Computing with Language: Texts and WordsGetting Started with PythonGetting Started with NLTKSearching TextCounting Vocabulary
A Closer Look at Python: Texts as Lists of Words
ListsIndexing ListsVariablesStrings
Computing with Language: Simple Statistics
Frequency DistributionsFine-Grained Selection of WordsCollocations and BigramsCounting Other Things
Back to Python: Making Decisions and Taking Control
ConditionalsOperating on Every ElementNested Code BlocksLooping with Conditions
Automatic Natural Language Understanding
Word Sense DisambiguationPronoun ResolutionGenerating Language OutputMachine TranslationSpoken Dialogue SystemsTextual EntailmentLimitations of NLP
Summary
Further Reading
Exercises
2. Accessing Text Corpora and Lexical Resources
Accessing Text CorporaGutenberg CorpusWeb and Chat TextBrown CorpusReuters CorpusInaugural Address CorpusAnnotated Text CorporaCorpora in Other LanguagesText Corpus StructureLoading Your Own Corpus
Conditional Frequency Distributions
Conditions and EventsCounting Words by GenrePlotting and Tabulating DistributionsGenerating Random Text with Bigrams
More Python: Reusing Code
Creating Programs with a Text EditorFunctionsModules
Lexical Resources
Wordlist CorporaA Pronouncing DictionaryComparative WordlistsShoebox and Toolbox Lexicons
WordNet
Senses and SynonymsThe WordNet HierarchyMore Lexical RelationsSemantic Similarity
Summary
Further Reading
Exercises
3. Processing Raw Text
Accessing Text from the Web and from DiskElectronic BooksDealing with HTMLProcessing Search Engine ResultsProcessing RSS FeedsReading Local FilesExtracting Text from PDF, MSWord, and Other Binary FormatsCapturing User InputThe NLP Pipeline
Strings: Text Processing at the Lowest Level
Basic Operations with StringsPrinting StringsAccessing Individual CharactersAccessing SubstringsMore Operations on StringsThe Difference Between Lists and Strings
Text Processing with Unicode
What Is Unicode?Extracting Encoded Text from FilesUsing Your Local Encoding in Python
Regular Expressions for Detecting Word Patterns
Using Basic MetacharactersRanges and Closures
Useful Applications of Regular Expressions
Extracting Word PiecesDoing More with Word PiecesFinding Word StemsSearching Tokenized Text
Normalizing Text
StemmersLemmatization
Regular Expressions for Tokenizing Text
Simple Approaches to TokenizationNLTK’s Regular Expression TokenizerFurther Issues with Tokenization
Segmentation
Sentence SegmentationWord Segmentation
Formatting: From Lists to Strings
From Lists to StringsStrings and FormatsLining Things UpWriting Results to a FileText Wrapping
Summary
Further Reading
Exercises
4. Writing Structured Programs
Back to the BasicsAssignmentEqualityConditionals
Sequences
Operating on Sequence TypesCombining Different Sequence TypesGenerator Expressions
Questions of Style
Python Coding StyleProcedural Versus Declarative StyleSome Legitimate Uses for Counters
Functions: The Foundation of Structured Programming
Function Inputs and OutputsParameter PassingVariable ScopeChecking Parameter TypesFunctional DecompositionDocumenting Functions
Doing More with Functions
Functions As ArgumentsAccumulative FunctionsHigher-Order FunctionsNamed Arguments
Program Development
Structure of a Python ModuleMultimodule ProgramsSources of ErrorDebugging TechniquesDefensive Programming
Algorithm Design
RecursionSpace-Time Trade-offsDynamic Programming
A Sample of Python Libraries
MatplotlibNetworkXcsvNumPyOther Python Libraries
Summary
Further Reading
Exercises
5. Categorizing and Tagging Words
Using a Tagger
Tagged Corpora
Representing Tagged TokensReading Tagged CorporaA Simplified Part-of-Speech TagsetNounsVerbsAdjectives and AdverbsUnsimplified TagsExploring Tagged Corpora
Mapping Words to Properties Using Python Dictionaries
Indexing Lists Versus DictionariesDictionaries in PythonDefining DictionariesDefault DictionariesIncrementally Updating a DictionaryComplex Keys and ValuesInverting a Dictionary
Automatic Tagging
The Default TaggerThe Regular Expression TaggerThe Lookup TaggerEvaluation
N-Gram Tagging
Unigram TaggingSeparating the Training and Testing DataGeneral N-Gram TaggingCombining TaggersTagging Unknown WordsStoring TaggersPerformance LimitationsTagging Across Sentence Boundaries
Transformation-Based Tagging
How to Determine the Category of a Word
Morphological CluesSyntactic CluesSemantic CluesNew WordsMorphology in Part-of-Speech Tagsets
Summary
Further Reading
Exercises
6. Learning to Classify Text
Supervised ClassificationGender IdentificationChoosing the Right FeaturesDocument ClassificationPart-of-Speech TaggingExploiting ContextSequence ClassificationOther Methods for Sequence Classification
Further Examples of Supervised Classification
Sentence SegmentationIdentifying Dialogue Act TypesRecognizing Textual EntailmentScaling Up to Large Datasets
Evaluation
The Test SetAccuracyPrecision and RecallConfusion MatricesCross-Validation
Decision Trees
Entropy and Information Gain
Naive Bayes Classifiers
Underlying Probabilistic ModelZero Counts and SmoothingNon-Binary FeaturesThe Naivete of IndependenceThe Cause of Double-Counting
Maximum Entropy Classifiers
The Maximum Entropy ModelMaximizing EntropyGenerative Versus Conditional Classifiers
Modeling Linguistic Patterns
What Do Models Tell Us?
Summary
Further Reading
Exercises
7. Extracting Information from Text
Information ExtractionInformation Extraction Architecture
Chunking
Noun Phrase ChunkingTag PatternsChunking with Regular ExpressionsExploring Text CorporaChinkingRepresenting Chunks: Tags Versus Trees
Developing and Evaluating Chunkers
Reading IOB Format and the CoNLL-2000 Chunking CorpusSimple Evaluation and BaselinesTraining Classifier-Based Chunkers
Recursion in Linguistic Structure
Building Nested Structure with Cascaded ChunkersTreesTree Traversal
Named Entity Recognition
Relation Extraction
Summary
Further Reading
Exercises
8. Analyzing Sentence Structure
Some Grammatical DilemmasLinguistic Data and Unlimited PossibilitiesUbiquitous Ambiguity
What’s the Use of Syntax?
Beyond n-grams
Context-Free Grammar
A Simple GrammarWriting Your Own GrammarsRecursion in Syntactic Structure
Parsing with Context-Free Grammar
Recursive Descent ParsingShift-Reduce ParsingThe Left-Corner ParserWell-Formed Substring Tables
Dependencies and Dependency Grammar
Valency and the LexiconScaling Up
Grammar Development
Treebanks and GrammarsPernicious AmbiguityWeighted Grammar
Summary
Further Reading
Exercises
9. Building Feature-Based Grammars
Grammatical FeaturesSyntactic AgreementUsing Attributes and ConstraintsTerminology
Processing Feature Structures
Subsumption and Unification
Extending a Feature-Based Grammar
SubcategorizationHeads RevisitedAuxiliary Verbs and InversionUnbounded Dependency ConstructionsCase and Gender in German
Summary
Further Reading
Exercises
10. Analyzing the Meaning of Sentences
Natural Language UnderstandingQuerying a DatabaseNatural Language, Semantics, and Logic
Propositional Logic
First-Order Logic
SyntaxFirst-Order Theorem ProvingSummarizing the Language of First-Order LogicTruth in ModelIndividual Variables and AssignmentsQuantificationQuantifier Scope AmbiguityModel Building
The Semantics of English Sentences
Compositional Semantics in Feature-Based GrammarThe λ-CalculusQuantified NPsTransitive VerbsQuantifier Ambiguity Revisited
Discourse Semantics
Discourse Representation TheoryDiscourse Processing
Summary
Further Reading
Exercises
11. Managing Linguistic Data
Corpus Structure: A Case StudyThe Structure of TIMITNotable Design FeaturesFundamental Data Types
The Life Cycle of a Corpus
Three Corpus Creation ScenariosQuality ControlCuration Versus Evolution
Acquiring Data
Obtaining Data from the WebObtaining Data from Word Processor FilesObtaining Data from Spreadsheets and DatabasesConverting Data FormatsDeciding Which Layers of Annotation to IncludeStandards and ToolsSpecial Considerations When Working with Endangered Languages
Working with XML
Using XML for Linguistic StructuresThe Role of XMLThe ElementTree InterfaceUsing ElementTree for Accessing Toolbox DataFormatting Entries
Working with Toolbox Data
Adding a Field to Each EntryValidating a Toolbox Lexicon
Describing Language Resources Using OLAC Metadata
What Is Metadata?OLAC: Open Language Archives Community
Summary
Further Reading
Exercises
A. Afterword: The Language Challenge
Language Processing Versus Symbol Processing
Contemporary Philosophical Divides
NLTK Roadmap
Envoi...
B. Bibliography
NLTK Index
General Index
About the Authors
Colophon
SPECIAL OFFER: Upgrade this ebook with O’Reilly

Content preview from Natural Language Processing with Python

Chapter 3. Processing Raw Text

The most important source of texts is undoubtedly the Web. It’s convenient to have existing text collections to explore, such as the corpora we saw in the previous chapters. However, you probably have your own text sources in mind, and need to learn how to access them.

The goal of this chapter is to answer the following questions:

How can we write programs to access text from local files and from the Web, in order to get hold of an unlimited range of language material?
How can we split documents up into individual words and punctuation symbols, so we can carry out the same kinds of analysis we did with text corpora in earlier chapters?
How can we write programs to produce formatted output and save it in a file?

In order to address these questions, we will be covering key concepts in NLP, including tokenization and stemming. Along the way you will consolidate your Python knowledge and learn about strings, files, and regular expressions. Since so much text on the Web is in HTML format, we will also see how to dispense with markup.

Note

Important: From this chapter onwards, our program samples will assume you begin your interactive session or your program with the following import statements:

>>> from __future__ import division
>>> import nltk, re, pprint

Accessing Text from the Web and from Disk

Electronic Books

A small sample of texts from Project Gutenberg appears in the NLTK corpus collection. However, you may be interested in analyzing other texts from Project ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Hands-On Natural Language Processing with Python

Publisher Resources

ISBN: 9780596803346Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills