book

The Natural Language Processing Workshop

Name: The Natural Language Processing Workshop
ISBN: 9781800208421

by Rohan Chopra, Aniruddha M. Godbole, Nipun Sadvilkar, Muzaffar Bashir Shah, Sohom Ghosh, Dwight Gunning, Ankit Bhatia, Nagendra Nagaraj, John Bura, Sumit Kumar Raj, Tom Taulli, Ankit Verma

August 2020

Beginner to intermediate

452 pages

7h 42m

English

Packt Publishing

Read now

Unlock full access

The Natural Language Processing Workshop
Preface
About the BookAudienceAbout the ChaptersConventionsCode PresentationSetting up Your EnvironmentInstallation and SetupInstalling the Required LibrariesInstalling LibrariesAccessing the Code Files
1. Introduction to Natural Language Processing
IntroductionHistory of NLPText Analytics and NLPExercise 1.01: Basic Text AnalyticsVarious Steps in NLPTokenizationExercise 1.02: Tokenization of a Simple SentencePoS TaggingExercise 1.03: PoS TaggingStop Word RemovalExercise 1.04: Stop Word RemovalText NormalizationExercise 1.05: Text NormalizationSpelling CorrectionExercise 1.06: Spelling Correction of a Word and a SentenceStemmingExercise 1.07: Using StemmingLemmatizationExercise 1.08: Extracting the Base Word Using LemmatizationNamed Entity Recognition (NER)Exercise 1.09: Treating Named EntitiesWord Sense DisambiguationExercise 1.10: Word Sense DisambiguationSentence Boundary DetectionExercise 1.11: Sentence Boundary DetectionActivity 1.01: Preprocessing of Raw TextKick Starting an NLP ProjectData CollectionData PreprocessingFeature ExtractionModel DevelopmentModel AssessmentModel DeploymentSummary
2. Feature Extraction Methods
IntroductionTypes of DataCategorizing Data Based on StructureCategorizing Data Based on ContentCleaning Text DataTokenizationExercise 2.01: Text Cleaning and TokenizationExercise 2.02: Extracting n-gramsExercise 2.03: Tokenizing Text with Keras and TextBlobTypes of Tokenizers Exercise 2.04: Tokenizing Text Using Various TokenizersStemmingRegexpStemmerExercise 2.05: Converting Words in the Present Continuous Tense into Base Words with RegexpStemmerThe Porter StemmerExercise 2.06: Using the Porter StemmerLemmatizationExercise 2.07: Performing LemmatizationExercise 2.08: Singularizing and Pluralizing WordsLanguage TranslationExercise 2.09: Language TranslationStop-Word RemovalExercise 2.10: Removing Stop Words from TextActivity 2.01: Extracting Top Keywords from the News ArticleFeature Extraction from TextsExtracting General Features from Raw TextExercise 2.11: Extracting General Features from Raw TextExercise 2.12: Extracting General Features from TextBag of Words (BoW)Exercise 2.13: Creating a Bag of WordsZipf's Law Exercise 2.14: Zipf's LawTerm Frequency–Inverse Document Frequency (TFIDF)Exercise 2.15: TFIDF RepresentationFinding Text Similarity – Application of Feature ExtractionExercise 2.16: Calculating Text Similarity Using Jaccard and Cosine SimilarityWord Sense Disambiguation Using the Lesk AlgorithmExercise 2.17: Implementing the Lesk Algorithm Using String Similarity and Text VectorizationWord CloudsExercise 2.18: Generating Word CloudsOther VisualizationsExercise 2.19: Other Visualizations Dependency Parse Trees and Named EntitiesActivity 2.02: Text VisualizationSummary
3. Developing a Text Classifier
IntroductionMachine LearningUnsupervised LearningHierarchical ClusteringExercise 3.01: Performing Hierarchical Clusteringk-means ClusteringExercise 3.02: Implementing k-means ClusteringSupervised LearningClassificationLogistic RegressionExercise 3.03: Text Classification – Logistic RegressionNaive Bayes ClassifiersExercise 3.04: Text Classification – Naive Bayesk-nearest NeighborsExercise 3.05: Text Classification Using the k-nearest Neighbors MethodRegressionLinear RegressionExercise 3.06: Regression Analysis Using Textual DataTree MethodsExercise 3.07: Tree-Based Methods – Decision TreeRandom ForestGradient Boosting Machine and Extreme Gradient BoostExercise 3.08: Tree-Based Methods – Random ForestExercise 3.09: Tree-Based Methods – XGBoostSamplingExercise 3.10: Sampling (Simple Random, Stratified, and Multi-Stage)Developing a Text ClassifierFeature ExtractionFeature EngineeringRemoving Correlated FeaturesExercise 3.11: Removing Highly Correlated Features (Tokens)Dimensionality ReductionExercise 3.12: Performing Dimensionality Reduction Using Principal Component AnalysisDeciding on a Model TypeEvaluating the Performance of a ModelExercise 3.13: Calculating the RMSE and MAPE of a DatasetActivity 3.01: Developing End-to-End Text ClassifiersBuilding Pipelines for NLP ProjectsExercise 3.14: Building the Pipeline for an NLP ProjectSaving and Loading ModelsExercise 3.15: Saving and Loading ModelsSummary
4. Collecting Text Data with Web Scraping and APIs
IntroductionCollecting Data by Scraping Web PagesExercise 4.01: Extraction of Tag-Based Information from HTML FilesRequesting Content from Web PagesExercise 4.02: Collecting Online Text DataExercise 4.03: Analyzing the Content of Jupyter Notebooks (in HTML Format)Activity 4.01: Extracting Information from an Online HTML PageActivity 4.02: Extracting and Analyzing Data Using Regular ExpressionsDealing with Semi-Structured DataJSONExercise 4.04: Working with JSON FilesXMLExercise 4.05: Working with an XML FileUsing APIs to Retrieve Real-Time DataExercise 4.06: Collecting Data Using APIsExtracting data from Twitter Using the OAuth APIActivity 4.03: Extracting Data from TwitterSummary
5. Topic Modeling
IntroductionTopic DiscoveryExploratory Data AnalysisTransforming Unstructured Data to Structured DataBag of WordsTopic-Modeling AlgorithmsLatent Semantic Analysis (LSA)LSA – How It WorksKey Input Parameters for LSA Topic ModelingExercise 5.01: Analyzing Wikipedia World Cup Articles with Latent Semantic AnalysisDirichlet Process and Dirichlet Distribution Latent Dirichlet Allocation (LDA)LDA – How It WorksMeasuring the Predictive Power of a Generative Topic ModelExercise 5.02: Finding Topics in Canadian Open Data Inventory Using the LDA ModelActivity 5.01: Topic-Modeling Jeopardy QuestionsHierarchical Dirichlet Process (HDP)Exercise 5.03: Topics in Around the World in Eighty DaysExercise 5.04: Topics in The Life and Adventures of Robinson Crusoe by Daniel DefoePractical ChallengesState-of-the-Art Topic ModelingActivity 5.02: Comparing Different Topic ModelsSummary
6. Vector Representation
IntroductionWhat Is a Vector?Frequency-Based EmbeddingsExercise 6.01: Word-Level One-Hot EncodingCharacter-Level One-Hot EncodingExercise 6.02: Character One-Hot Encoding – ManualExercise 6.03: Character-Level One-Hot Encoding with KerasLearned Word EmbeddingsWord2VecExercise 6.04: Training Word VectorsUsing Pre-Trained Word VectorsExercise 6.05: Using Pre-Trained Word VectorsDocument VectorsUses of Document VectorsExercise 6.06: Converting News Headlines to Document VectorsActivity 6.01: Finding Similar News Article Using Document VectorsSummary
7. Text Generation and Summarization
IntroductionGenerating Text with Markov ChainsMarkov ChainsExercise 7.01: Text Generation Using a Random Walk over a Markov ChainText SummarizationTextRankKey Input Parameters for TextRankExercise 7.02: Performing Summarization Using TextRankExercise 7.03: Summarizing a Children's Fairy Tale Using TextRank Activity 7.01: Summarizing Complaints in the Consumer Financial Protection Bureau DatasetRecent Developments in Text Generation and SummarizationPractical Challenges in Extractive SummarizationSummary
8. Sentiment Analysis
IntroductionWhy Is Sentiment Analysis Required?The Growth of Sentiment AnalysisThe Monetization of EmotionTypes of SentimentsEmotionKey Ideas and TermsApplications of Sentiment AnalysisTools Used for Sentiment AnalysisNLP Services from Major Cloud ProvidersOnline MarketplacesPython NLP LibrariesDeep Learning FrameworksThe textblob libraryExercise 8.01: Basic Sentiment Analysis Using the textblob LibraryActivity 8.01: Tweet Sentiment Analysis Using the textblob libraryUnderstanding Data for Sentiment AnalysisExercise 8.02: Loading Data for Sentiment AnalysisTraining Sentiment ModelsActivity 8.02: Training a Sentiment Model Using TFIDF and Logistic RegressionSummary

Appendix
1. Introduction to Natural Language ProcessingActivity 1.01: Preprocessing of Raw Text2. Feature Extraction MethodsActivity 2.01: Extracting Top Keywords from the News ArticleActivity 2.02: Text Visualization3. Developing a Text ClassifierActivity 3.01: Developing End-to-End Text Classifiers4. Collecting Text Data with Web Scraping and APIsActivity 4.01: Extracting Information from an Online HTML PageActivity 4.02: Extracting and Analyzing Data Using Regular ExpressionsActivity 4.03: Extracting Data from Twitter5. Topic ModelingActivity 5.01: Topic-Modeling Jeopardy QuestionsActivity 5.02: Comparing Different Topic Models6. Vector RepresentationActivity 6.01: Finding Similar News Article Using Document Vectors7. Text Generation and SummarizationActivity 7.01: Summarizing Complaints in the Consumer Financial Protection Bureau Dataset8. Sentiment AnalysisActivity 8.01: Tweet Sentiment Analysis Using the textblob libraryActivity 8.02: Training a Sentiment Model Using TFIDF and Logistic Regression

Content preview from The Natural Language Processing Workshop

4. Collecting Text Data with Web Scraping and APIs

Overview

This chapter introduces you to the concept of web scraping. You will first learn how to extract data (such as text, images, lists, and tables) from pages that are written using HTML. You will then learn about the various types of semi-structured data used to create web pages (such as JSON and XML) and extract data from them. Finally, you will use APIs for data extraction from Twitter, using the tweepy package.

Introduction

In the last chapter, we developed a simple classifier using feature extraction methods. We also covered different algorithms that fall under supervised and unsupervised learning. In this chapter, you will learn how to collect text data by scraping web pages, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

The Applied AI and Natural Language Processing Workshop

Publisher Resources

ISBN: 9781800208421

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

The Natural Language Processing Workshop

by Rohan Chopra, Aniruddha M. Godbole, Nipun Sadvilkar, Muzaffar Bashir Shah, Sohom Ghosh, Dwight Gunning, Ankit Bhatia, Nagendra Nagaraj, John Bura, Sumit Kumar Raj, Tom Taulli, Ankit Verma

4. Collecting Text Data with Web Scraping and APIs

Introduction

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.