book

Natural Language Processing: Python and NLTK

Name: Natural Language Processing: Python and NLTK
ISBN: 9781787285101

by Nitin Hardeniya, Jacob Perkins, Deepti Chopra, Nisheeth Joshi, Iti Mathur

November 2016

Beginner to intermediate

687 pages

15h 31m

English

Packt Publishing

Read now

Unlock full access

Natural Language Processing: Python and NLTK
Table of Contents
Natural Language Processing: Python and NLTK
Natural Language Processing: Python and NLTK
Credits
Preface
What this learning path covers
What you need for this learning path
Who this learning path is for
Reader feedback
Customer support
Downloading the example codeErrataPiracyQuestions

1. Module 1
1. Introduction to Natural Language Processing
Why learn NLP?
Let's start playing with Python!
ListsHelping yourselfRegular expressionsDictionariesWriting functions
Diving into NLTK
Your turn
Summary
2. Text Wrangling and Cleansing
What is text wrangling?
Text cleansing
Sentence splitter
Tokenization
Stemming
Lemmatization
Stop word removal
Rare word removal
Spell correction
Your turn
Summary
3. Part of Speech Tagging
What is Part of speech taggingStanford taggerDiving deep into a taggerSequential taggerN-gram taggerRegex taggerBrill taggerMachine learning based tagger
Named Entity Recognition (NER)
NER tagger
Your Turn
Summary
4. Parsing Structure in Text
Shallow versus deep parsing
The two approaches in parsing
Why we need parsing
Different types of parsers
A recursive descent parserA shift-reduce parserA chart parserA regex parser
Dependency parsing
Chunking
Information extraction
Named-entity recognition (NER)Relation extraction
Summary
5. NLP Applications
Building your first NLP application
Other NLP applications
Machine translationStatistical machine translationInformation retrievalBoolean retrievalVector space modelThe probabilistic modelSpeech recognitionText classificationInformation extractionQuestion answering systemsDialog systemsWord sense disambiguationTopic modelingLanguage detectionOptical character recognition
Summary
6. Text Classification
Machine learning
Text classification
Sampling
Naive BayesDecision treesStochastic gradient descentLogistic regressionSupport vector machines
The Random forest algorithm
Text clustering
K-means
Topic modeling in text
Installing gensim
References
Summary
7. Web Crawling
Web crawlers
Writing your first crawler
Data flow in Scrapy
The Scrapy shellItems
The Sitemap spider
The item pipeline
External references
Summary
8. Using NLTK with Other Python Libraries
NumPyndarrayIndexingBasic operationsExtracting data from an arrayComplex matrix operationsReshaping and stackingRandom numbers
SciPy
Linear algebraeigenvalues and eigenvectorsThe sparse matrixOptimization
pandas
Reading dataSeries dataColumn transformationNoisy data
matplotlib
SubplotAdding an axisA scatter plotA bar plot3D plots
External references
Summary
9. Social Media Mining in Python
Data collectionTwitter
Data extraction
Trending topics
Geovisualization
Influencers detectionFacebookInfluencer friends
Summary
10. Text Mining at Scale
Different ways of using Python on HadoopPython streamingHive/Pig UDFStreaming wrappers
NLTK on Hadoop
A UDFPython streaming
Scikit-learn on Hadoop
PySpark
Summary
2. Module 2
1. Tokenizing Text and WordNet Basics
Introduction
Tokenizing text into sentences
Getting readyHow to do it...How it works...There's more...Tokenizing sentences in other languagesSee also
Tokenizing sentences into words
How to do it...How it works...There's more...Separating contractionsPunktWordTokenizerWordPunctTokenizerSee also
Tokenizing sentences using regular expressions
Getting readyHow to do it...How it works...There's more...Simple whitespace tokenizerSee also
Training a sentence tokenizer
Getting readyHow to do it...How it works...There's more...See also
Filtering stopwords in a tokenized sentence
Getting readyHow to do it...How it works...There's more...See also
Looking up Synsets for a word in WordNet
Getting readyHow to do it...How it works...There's more...Working with hypernymsPart of speech (POS)See also
Looking up lemmas and synonyms in WordNet
How to do it...How it works...There's more...All possible synonymsAntonymsSee also
Calculating WordNet Synset similarity
How to do it...How it works...There's more...Comparing verbsPath and Leacock Chordorow (LCH) similaritySee also
Discovering word collocations
Getting readyHow to do it...How it works...There's more...Scoring functionsScoring ngramsSee also
2. Replacing and Correcting Words
Introduction
Stemming words
How to do it...How it works...There's more...The LancasterStemmer classThe RegexpStemmer classThe SnowballStemmer classSee also
Lemmatizing words with WordNet
Getting readyHow to do it...How it works...There's more...Combining stemming with lemmatizationSee also
Replacing words matching regular expressions
Getting readyHow to do it...How it works...There's more...Replacement before tokenizationSee also
Removing repeating characters
Getting readyHow to do it...How it works...There's more...See also
Spelling correction with Enchant
Getting readyHow to do it...How it works...There's more...The en_GB dictionaryPersonal word listsSee also
Replacing synonyms
Getting readyHow to do it...How it works...There's more...CSV synonym replacementYAML synonym replacementSee also
Replacing negations with antonyms
How to do it...How it works...There's more...See also
3. Creating Custom Corpora
Introduction
Setting up a custom corpus
Getting readyHow to do it...How it works...There's more...Loading a YAML fileSee also
Creating a wordlist corpus
Getting readyHow to do it...How it works...There's more...Names wordlist corpusEnglish words corpusSee also
Creating a part-of-speech tagged word corpus
Getting readyHow to do it...How it works...There's more...Customizing the word tokenizerCustomizing the sentence tokenizerCustomizing the paragraph block readerCustomizing the tag separatorConverting tags to a universal tagsetSee also
Creating a chunked phrase corpus
Getting readyHow to do it...How it works...There's more...Tree leavesTreebank chunk corpusCoNLL2000 corpusSee also
Creating a categorized text corpus
Getting readyHow to do it...How it works...There's more...Category fileCategorized tagged corpus readerCategorized corporaSee also
Creating a categorized chunk corpus reader
Getting readyHow to do it...How it works...There's more...Categorized CoNLL chunk corpus readerSee also
Lazy corpus loading
How to do it...How it works...There's more...
Creating a custom corpus view
How to do it...How it works...There's more...Block reader functionsPickle corpus viewConcatenated corpus viewSee also
Creating a MongoDB-backed corpus reader
Getting readyHow to do it...How it works...There's more...See also
Corpus editing with file locking
Getting readyHow to do it...How it works...
4. Part-of-speech Tagging
Introduction
Default tagging
Getting readyHow to do it...How it works...There's more...Evaluating accuracyTagging sentencesUntagging a tagged sentenceSee also
Training a unigram part-of-speech tagger
How to do it...How it works...There's more...Overriding the context modelMinimum frequency cutoffSee also
Combining taggers with backoff tagging
How to do it...How it works...There's more...Saving and loading a trained tagger with pickleSee also
Training and combining ngram taggers
Getting readyHow to do it...How it works...There's more...Quadgram taggerSee also
Creating a model of likely word tags
How to do it...How it works...There's more...See also
Tagging with regular expressions
Getting readyHow to do it...How it works...There's more...See also
Affix tagging
How to do it...How it works...There's more...Working with min_stem_lengthSee also
Training a Brill tagger
How to do it...How it works...There's more...TracingSee also
Training the TnT tagger
How to do it...How it works...There's more...Controlling the beam searchSignificance of capitalizationSee also
Using WordNet for tagging
Getting readyHow to do it...How it works...See also
Tagging proper names
How to do it...How it works...See also
Classifier-based tagging
How to do it...How it works...There's more...Detecting features with a custom feature detectorSetting a cutoff probabilityUsing a pre-trained classifierSee also
Training a tagger with NLTK-Trainer
How to do it...How it works...There's more...Saving a pickled taggerTraining on a custom corpusTraining with universal tagsAnalyzing a tagger against a tagged corpusAnalyzing a tagged corpusSee also
5. Extracting Chunks
Introduction
Chunking and chinking with regular expressions
Getting readyHow to do it...How it works...There's more...Parsing different chunk typesParsing alternative patternsChunk rule with contextSee also
Merging and splitting chunks with regular expressions
How to do it...How it works...There's more...Specifying rule descriptionsSee also
Expanding and removing chunks with regular expressions
How to do it...How it works...There's more...See also
Partial parsing with regular expressions
How to do it...How it works...There's more...The ChunkScore metricsLooping and tracing chunk rulesSee also
Training a tagger-based chunker
How to do it...How it works...There's more...Using different taggersSee also
Classification-based chunking
How to do it...How it works...There's more...Using a different classifier builderSee also
Extracting named entities
How to do it...How it works...There's more...Binary named entity extractionSee also
Extracting proper noun chunks
How to do it...How it works...There's more...See also
Extracting location chunks
How to do it...How it works...There's more...See also
Training a named entity chunker
How to do it...How it works...There's more...See also
Training a chunker with NLTK-Trainer
How to do it...How it works...There's more...Saving a pickled chunkerTraining a named entity chunkerTraining on a custom corpusTraining on parse treesAnalyzing a chunker against a chunked corpusAnalyzing a chunked corpusSee also
6. Transforming Chunks and Trees
Introduction
Filtering insignificant words from a sentence
Getting readyHow to do it...How it works...There's more...See also
Correcting verb forms
Getting readyHow to do it...How it works...See also
Swapping verb phrases
How to do it...How it works...There's more...See also
Swapping noun cardinals
How to do it...How it works...See also
Swapping infinitive phrases
How to do it...How it works...There's more...See also
Singularizing plural nouns
How to do it...How it works...See also
Chaining chunk transformations
How to do it...How it works...There's more...See also
Converting a chunk tree to text
How to do it...How it works...There's more...See also
Flattening a deep tree
Getting readyHow to do it...How it works...There's more...The cess_esp and cess_cat treebankSee also
Creating a shallow tree
How to do it...How it works...See also
Converting tree labels
Getting readyHow to do it...How it works...See also
7. Text Classification
Introduction
Bag of words feature extraction
How to do it...How it works...There's more...Filtering stopwordsIncluding significant bigramsSee also
Training a Naive Bayes classifier
Getting readyHow to do it...How it works...There's more...Classification probabilityMost informative featuresTraining estimatorManual trainingSee also
Training a decision tree classifier
How to do it...How it works...There's more...Controlling uncertainty with entropy_cutoffControlling tree depth with depth_cutoffControlling decisions with support_cutoffSee also
Training a maximum entropy classifier
Getting readyHow to do it...How it works...There's more...Megam algorithmSee also
Training scikit-learn classifiers
Getting readyHow to do it...How it works...There's more...Comparing Naive Bayes algorithmsTraining with logistic regressionTraining with LinearSVCSee also
Measuring precision and recall of a classifier
How to do it...How it works...There's more...F-measureSee also
Calculating high information words
How to do it...How it works...There's more...The MaxentClassifier class with high information wordsThe DecisionTreeClassifier class with high information wordsThe SklearnClassifier class with high information wordsSee also
Combining classifiers with voting
Getting readyHow to do it...How it works...See also
Classifying with multiple binary classifiers
Getting readyHow to do it...How it works...There's more...See also
Training a classifier with NLTK-Trainer
How to do it...How it works...There's more...Saving a pickled classifierUsing different training instancesThe most informative featuresThe Maxent and LogisticRegression classifiersSVMsCombining classifiersHigh information words and bigramsCross-fold validationAnalyzing a classifierSee also
8. Distributed Processing and Handling Large Datasets
Introduction
Distributed tagging with execnet
Getting readyHow to do it...How it works...There's more...Creating multiple channelsLocal versus remote gatewaysSee also
Distributed chunking with execnet
Getting readyHow to do it...How it works...There's more...Python subprocessesSee also
Parallel list processing with execnet
How to do it...How it works...There's more...See also
Storing a frequency distribution in Redis
Getting readyHow to do it...How it works...There's more...See also
Storing a conditional frequency distribution in Redis
Getting readyHow to do it...How it works...There's more...See also
Storing an ordered dictionary in Redis
Getting readyHow to do it...How it works...There's more...See also
Distributed word scoring with Redis and execnet
Getting readyHow to do it...How it works...There's more...See also
9. Parsing Specific Data Types
Introduction
Parsing dates and times with dateutil
Getting readyHow to do it...How it works...There's more...See also
Timezone lookup and conversion
Getting readyHow to do it...How it works...There's more...Local timezoneCustom offsetsSee also
Extracting URLs from HTML with lxml
Getting readyHow to do it...How it works...There's more...Extracting links directlyParsing HTML from URLs or filesExtracting links with XPathsSee also
Cleaning and stripping HTML
Getting readyHow to do it...How it works...There's more...See also
Converting HTML entities with BeautifulSoup
Getting readyHow to do it...How it works...There's more...Extracting URLs with BeautifulSoupSee also
Detecting and converting character encodings
Getting readyHow to do it...How it works...There's more...Converting to ASCIIUnicodeDammit conversionSee also
A. Penn Treebank Part-of-speech Tags
3. Module 3
1. Working with Strings
TokenizationTokenization of text into sentencesTokenization of text in other languagesTokenization of sentences into wordsTokenization using TreebankWordTokenizerTokenization using regular expressions
Normalization
Eliminating punctuationConversion into lowercase and uppercaseDealing with stop wordsCalculate stopwords in English
Substituting and correcting tokens
Replacing words using regular expressionsExample of the replacement of a text with another textPerforming substitution before tokenizationDealing with repeating charactersExample of deleting repeating charactersReplacing a word with its synonymExample of substituting word a with its synonym
Applying Zipf's law to text
Similarity measures
Applying similarity measures using Ethe edit distance algorithmApplying similarity measures using Jaccard's CoefficientApplying similarity measures using the Smith Waterman distanceOther string similarity metrics
Summary
2. Statistical Language Modeling
Understanding word frequencyDevelop MLE for a given textHidden Markov Model estimation
Applying smoothing on the MLE model
Add-one smoothingGood TuringKneser Ney estimationWitten Bell estimation
Develop a back-off mechanism for MLE
Applying interpolation on data to get mix and match
Evaluate a language model through perplexity
Applying metropolis hastings in modeling languages
Applying Gibbs sampling in language processing
Summary
3. Morphology – Getting Our Feet Wet
Introducing morphology
Understanding stemmer
Understanding lemmatization
Developing a stemmer for non-English language
Morphological analyzer
Morphological generator
Search engine
Summary
4. Parts-of-Speech Tagging – Identifying Words
Introducing parts-of-speech taggingDefault tagging
Creating POS-tagged corpora
Selecting a machine learning algorithm
Statistical modeling involving the n-gram approach
Developing a chunker using pos-tagged corpora
Summary
5. Parsing – Analyzing Training Data
Introducing parsing
Treebank construction
Extracting Context Free Grammar (CFG) rules from Treebank
Creating a probabilistic Context Free Grammar from CFG
CYK chart parsing algorithm
Earley chart parsing algorithm
Summary
6. Semantic Analysis – Meaning Matters
Introducing semantic analysisIntroducing NERA NER system using Hidden Markov ModelTraining NER using Machine Learning ToolkitsNER using POS tagging
Generation of the synset id from Wordnet
Disambiguating senses using Wordnet
Summary
7. Sentiment Analysis – I Am Happy
Introducing sentiment analysisSentiment analysis using NERSentiment analysis using machine learningEvaluation of the NER system
Summary
8. Information Retrieval – Accessing Information
Introducing information retrievalStop word removalInformation retrieval using a vector space model
Vector space scoring and query operator interaction
Developing an IR system using latent semantic indexing
Text summarization
Question-answering system
Summary
9. Discourse Analysis – Knowing Is Believing
Introducing discourse analysisDiscourse analysis using Centering TheoryAnaphora resolution
Summary
10. Evaluation of NLP Systems – Analyzing Performance
The need for evaluation of NLP systemsEvaluation of NLP tools (POS taggers, stemmers, and morphological analyzers)Parser evaluation using gold data
Evaluation of IR system
Metrics for error identification
Metrics based on lexical matching
Metrics based on syntactic matching
Metrics using shallow semantic matching
Summary
B. Bibliography
Index

Content preview from Natural Language Processing: Python and NLTK

Rare word removal

This is very intuitive, as some of the words that are very unique in nature like names, brands, product names, and some of the noise characters, such as html leftouts, also need to be removed for different NLP tasks. For example, it would be really bad to use names as a predictor for a text classification problem, even if they come out as a significant predictor. We will talk about this further in subsequent chapters. We definitely don't want all these noisy tokens to be present. We also use length of the words as a criteria for removing words with very a short length or a very long length:

>>># tokens is a list of all tokens in corpus
>>>freq_dist = nltk.FreqDist(token)
>>>rarewords = freq_dist.keys()[-50:]
>>>after_rare_words ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Hands-On Python Natural Language Processing

Publisher Resources

ISBN: 9781787285101Purchase Link

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Natural Language Processing: Python and NLTK

by Nitin Hardeniya, Jacob Perkins, Deepti Chopra, Nisheeth Joshi, Iti Mathur

Rare word removal

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

More than 5,000 organizations count on O’Reilly

Julian F.

Addison B.

Amir M.

Mark W.

You might also like

Hands-On Python Natural Language Processing

Natural Language Processing with Python and spaCy

Python Natural Language Processing

Hands-on NLP with NLTK and Scikit-learn

Publisher Resources