book

Python 3 Text Processing with NLTK 3 Cookbook - Second Edition

Name: Python 3 Text Processing with NLTK 3 Cookbook - Second Edition
Author: Jacob Perkins
ISBN: 9781782167853

by Jacob Perkins

August 2014

Beginner to intermediate

304 pages

7h 10m

English

Packt Publishing

Read now

Unlock full access

Python 3 Text Processing with NLTK 3 Cookbook
Table of Contents
Python 3 Text Processing with NLTK 3 Cookbook
Credits
About the Author
About the Reviewers
www.PacktPub.com
Support files, eBooks, discount offers, and moreWhy Subscribe?Free Access for Packt account holders
Preface
What this book covers
What you need for this book
Who this book is for

Conventions
Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions
1. Tokenizing Text and WordNet Basics
Introduction
Tokenizing text into sentences
Getting readyHow to do it...How it works...There's more...Tokenizing sentences in other languagesSee also
Tokenizing sentences into words
How to do it...How it works...There's more...Separating contractionsPunktWordTokenizerWordPunctTokenizerSee also
Tokenizing sentences using regular expressions
Getting readyHow to do it...How it works...There's more...Simple whitespace tokenizerSee also
Training a sentence tokenizer
Getting readyHow to do it...How it works...There's more...See also
Filtering stopwords in a tokenized sentence
Getting readyHow to do it...How it works...There's more...See also
Looking up Synsets for a word in WordNet
Getting readyHow to do it...How it works...There's more...Working with hypernymsPart of speech (POS)See also
Looking up lemmas and synonyms in WordNet
How to do it...How it works...There's more...All possible synonymsAntonymsSee also
Calculating WordNet Synset similarity
How to do it...How it works...There's more...Comparing verbsPath and Leacock Chordorow (LCH) similaritySee also
Discovering word collocations
Getting readyHow to do it...How it works...There's more...Scoring functionsScoring ngramsSee also
2. Replacing and Correcting Words
Introduction
Stemming words
How to do it...How it works...There's more...The LancasterStemmer classThe RegexpStemmer classThe SnowballStemmer classSee also
Lemmatizing words with WordNet
Getting readyHow to do it...How it works...There's more...Combining stemming with lemmatizationSee also
Replacing words matching regular expressions
Getting readyHow to do it...How it works...There's more...Replacement before tokenizationSee also
Removing repeating characters
Getting readyHow to do it...How it works...There's more...See also
Spelling correction with Enchant
Getting readyHow to do it...How it works...There's more...The en_GB dictionaryPersonal word listsSee also
Replacing synonyms
Getting readyHow to do it...How it works...There's more...CSV synonym replacementYAML synonym replacementSee also
Replacing negations with antonyms
How to do it...How it works...There's more...See also
3. Creating Custom Corpora
Introduction
Setting up a custom corpus
Getting readyHow to do it...How it works...There's more...Loading a YAML fileSee also
Creating a wordlist corpus
Getting readyHow to do it...How it works...There's more...Names wordlist corpusEnglish words corpusSee also
Creating a part-of-speech tagged word corpus
Getting readyHow to do it...How it works...There's more...Customizing the word tokenizerCustomizing the sentence tokenizerCustomizing the paragraph block readerCustomizing the tag separatorConverting tags to a universal tagsetSee also
Creating a chunked phrase corpus
Getting readyHow to do it...How it works...There's more...Tree leavesTreebank chunk corpusCoNLL2000 corpusSee also
Creating a categorized text corpus
Getting readyHow to do it...How it works...There's more...Category fileCategorized tagged corpus readerCategorized corporaSee also
Creating a categorized chunk corpus reader
Getting readyHow to do it...How it works...There's more...Categorized CoNLL chunk corpus readerSee also
Lazy corpus loading
How to do it...How it works...There's more...
Creating a custom corpus view
How to do it...How it works...There's more...Block reader functionsPickle corpus viewConcatenated corpus viewSee also
Creating a MongoDB-backed corpus reader
Getting readyHow to do it...How it works...There's more...See also
Corpus editing with file locking
Getting readyHow to do it...How it works...
4. Part-of-speech Tagging
Introduction
Default tagging
Getting readyHow to do it...How it works...There's more...Evaluating accuracyTagging sentencesUntagging a tagged sentenceSee also
Training a unigram part-of-speech tagger
How to do it...How it works...There's more...Overriding the context modelMinimum frequency cutoffSee also
Combining taggers with backoff tagging
How to do it...How it works...There's more...Saving and loading a trained tagger with pickleSee also
Training and combining ngram taggers
Getting readyHow to do it...How it works...There's more...Quadgram taggerSee also
Creating a model of likely word tags
How to do it...How it works...There's more...See also
Tagging with regular expressions
Getting readyHow to do it...How it works...There's more...See also
Affix tagging
How to do it...How it works...There's more...Working with min_stem_lengthSee also
Training a Brill tagger
How to do it...How it works...There's more...TracingSee also
Training the TnT tagger
How to do it...How it works...There's more...Controlling the beam searchSignificance of capitalizationSee also
Using WordNet for tagging
Getting readyHow to do it...How it works...See also
Tagging proper names
How to do it...How it works...See also
Classifier-based tagging
How to do it...How it works...There's more...Detecting features with a custom feature detectorSetting a cutoff probabilityUsing a pre-trained classifierSee also
Training a tagger with NLTK-Trainer
How to do it...How it works...There's more...Saving a pickled taggerTraining on a custom corpusTraining with universal tagsAnalyzing a tagger against a tagged corpusAnalyzing a tagged corpusSee also
5. Extracting Chunks
Introduction
Chunking and chinking with regular expressions
Getting readyHow to do it...How it works...There's more...Parsing different chunk typesParsing alternative patternsChunk rule with contextSee also
Merging and splitting chunks with regular expressions
How to do it...How it works...There's more...Specifying rule descriptionsSee also
Expanding and removing chunks with regular expressions
How to do it...How it works...There's more...See also
Partial parsing with regular expressions
How to do it...How it works...There's more...The ChunkScore metricsLooping and tracing chunk rulesSee also
Training a tagger-based chunker
How to do it...How it works...There's more...Using different taggersSee also
Classification-based chunking
How to do it...How it works...There's more...Using a different classifier builderSee also
Extracting named entities
How to do it...How it works...There's more...Binary named entity extractionSee also
Extracting proper noun chunks
How to do it...How it works...There's more...See also
Extracting location chunks
How to do it...How it works...There's more...See also
Training a named entity chunker
How to do it...How it works...There's more...See also
Training a chunker with NLTK-Trainer
How to do it...How it works...There's more...Saving a pickled chunkerTraining a named entity chunkerTraining on a custom corpusTraining on parse treesAnalyzing a chunker against a chunked corpusAnalyzing a chunked corpusSee also
6. Transforming Chunks and Trees
Introduction
Filtering insignificant words from a sentence
Getting readyHow to do it...How it works...There's more...See also
Correcting verb forms
Getting readyHow to do it...How it works...See also
Swapping verb phrases
How to do it...How it works...There's more...See also
Swapping noun cardinals
How to do it...How it works...See also
Swapping infinitive phrases
How to do it...How it works...There's more...See also
Singularizing plural nouns
How to do it...How it works...See also
Chaining chunk transformations
How to do it...How it works...There's more...See also
Converting a chunk tree to text
How to do it...How it works...There's more...See also
Flattening a deep tree
Getting readyHow to do it...How it works...There's more...The cess_esp and cess_cat treebankSee also
Creating a shallow tree
How to do it...How it works...See also
Converting tree labels
Getting readyHow to do it...How it works...See also
7. Text Classification
Introduction
Bag of words feature extraction
How to do it...How it works...There's more...Filtering stopwordsIncluding significant bigramsSee also
Training a Naive Bayes classifier
Getting readyHow to do it...How it works...There's more...Classification probabilityMost informative featuresTraining estimatorManual trainingSee also
Training a decision tree classifier
How to do it...How it works...There's more...Controlling uncertainty with entropy_cutoffControlling tree depth with depth_cutoffControlling decisions with support_cutoffSee also
Training a maximum entropy classifier
Getting readyHow to do it...How it works...There's more...Megam algorithmSee also
Training scikit-learn classifiers
Getting readyHow to do it...How it works...There's more...Comparing Naive Bayes algorithmsTraining with logistic regressionTraining with LinearSVCSee also
Measuring precision and recall of a classifier
How to do it...How it works...There's more...F-measureSee also
Calculating high information words
How to do it...How it works...There's more...The MaxentClassifier class with high information wordsThe DecisionTreeClassifier class with high information wordsThe SklearnClassifier class with high information wordsSee also
Combining classifiers with voting
Getting readyHow to do it...How it works...See also
Classifying with multiple binary classifiers
Getting readyHow to do it...How it works...There's more...See also
Training a classifier with NLTK-Trainer
How to do it...How it works...There's more...Saving a pickled classifierUsing different training instancesThe most informative featuresThe Maxent and LogisticRegression classifiersSVMsCombining classifiersHigh information words and bigramsCross-fold validationAnalyzing a classifierSee also
8. Distributed Processing and Handling Large Datasets
Introduction
Distributed tagging with execnet
Getting readyHow to do it...How it works...There's more...Creating multiple channelsLocal versus remote gatewaysSee also
Distributed chunking with execnet
Getting readyHow to do it...How it works...There's more...Python subprocessesSee also
Parallel list processing with execnet
How to do it...How it works...There's more...See also
Storing a frequency distribution in Redis
Getting readyHow to do it...How it works...There's more...See also
Storing a conditional frequency distribution in Redis
Getting readyHow to do it...How it works...There's more...See also
Storing an ordered dictionary in Redis
Getting readyHow to do it...How it works...There's more...See also
Distributed word scoring with Redis and execnet
Getting readyHow to do it...How it works...There's more...See also
9. Parsing Specific Data Types
Introduction
Parsing dates and times with dateutil
Getting readyHow to do it...How it works...There's more...See also
Timezone lookup and conversion
Getting readyHow to do it...How it works...There's more...Local timezoneCustom offsetsSee also
Extracting URLs from HTML with lxml
Getting readyHow to do it...How it works...There's more...Extracting links directlyParsing HTML from URLs or filesExtracting links with XPathsSee also
Cleaning and stripping HTML
Getting readyHow to do it...How it works...There's more...See also
Converting HTML entities with BeautifulSoup
Getting readyHow to do it...How it works...There's more...Extracting URLs with BeautifulSoupSee also
Detecting and converting character encodings
Getting readyHow to do it...How it works...There's more...Converting to ASCIIUnicodeDammit conversionSee also
A. Penn Treebank Part-of-speech Tags
Index

Content preview from Python 3 Text Processing with NLTK 3 Cookbook - Second Edition

Cleaning and stripping HTML

Cleaning up text is one of the unfortunate but entirely necessary aspects of text processing. When it comes to parsing HTML, you probably don't want to deal with any embedded JavaScript or CSS, and are only interested in the tags and text.

Getting ready

You'll need to install lxml. See the previous recipe or http://lxml.de/installation.html for installation instructions.

How to do it...

We can use the clean_html() function in the lxml.html.clean module to remove unnecessary HTML tags and embedded JavaScript from an HTML string:

>>> import lxml.html.clean
>>> lxml.html.clean.clean_html('<html><head></head><body onload=loadfunc()>my text</body></html>')
'<div><body>my text</body></div>'

The result is much cleaner and easier ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Python Natural Language Processing Cookbook

Publisher Resources

ISBN: 9781782167853

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design