book

Blueprints for Text Analytics Using Python

by Jens Albrecht, Sidharth Ramachandran, Christian Winkler

December 2020

Intermediate to advanced

422 pages

12h 7m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Approach of the BookPrerequisitesSome Important Libraries to KnowBooks to ReadConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Gaining Early Insights from Textual Data
What You’ll Learn and What We’ll BuildExploratory Data AnalysisIntroducing the DatasetBlueprint: Getting an Overview of the Data with PandasCalculating Summary Statistics for ColumnsChecking for Missing DataPlotting Value DistributionsComparing Value Distributions Across CategoriesVisualizing Developments Over TimeBlueprint: Building a Simple Text Preprocessing PipelinePerforming Tokenization with Regular ExpressionsTreating Stop WordsProcessing a Pipeline with One Line of CodeBlueprints for Word Frequency AnalysisBlueprint: Counting Words with a CounterBlueprint: Creating a Frequency DiagramBlueprint: Creating Word CloudsBlueprint: Ranking with TF-IDFBlueprint: Finding a Keyword-in-ContextBlueprint: Analyzing N-GramsBlueprint: Comparing Frequencies Across Time Intervals and CategoriesCreating Frequency TimelinesCreating Frequency HeatmapsClosing Remarks
2. Extracting Textual Insights with APIs
What You’ll Learn and What We’ll BuildApplication Programming InterfacesBlueprint: Extracting Data from an API Using the Requests ModulePaginationRate LimitingBlueprint: Extracting Twitter Data with TweepyObtaining CredentialsInstalling and Configuring TweepyExtracting Data from the Search APIExtracting Data from a User’s TimelineExtracting Data from the Streaming APIClosing Remarks
3. Scraping Websites and Extracting Data
What You’ll Learn and What We’ll BuildScraping and Data ExtractionIntroducing the Reuters News ArchiveURL GenerationBlueprint: Downloading and Interpreting robots.txtBlueprint: Finding URLs from sitemap.xmlBlueprint: Finding URLs from RSSDownloading DataBlueprint: Downloading HTML Pages with PythonBlueprint: Downloading HTML Pages with wgetExtracting Semistructured DataBlueprint: Extracting Data with Regular ExpressionsBlueprint: Using an HTML Parser for ExtractionBlueprint: SpideringIntroducing the Use CaseError Handling and Production-Quality SoftwareDensity-Based Text ExtractionExtracting Reuters Content with ReadabilitySummary Density-Based Text ExtractionAll-in-One ApproachBlueprint: Scraping the Reuters Archive with ScrapyPossible Problems with ScrapingClosing Remarks and Recommendation
4. Preparing Textual Data for Statistics and Machine Learning
What You’ll Learn and What We’ll BuildA Data Preprocessing PipelineIntroducing the Dataset: Reddit Self-PostsLoading Data Into PandasBlueprint: Standardizing Attribute NamesSaving and Loading a DataFrameCleaning Text DataBlueprint: Identify Noise with Regular ExpressionsBlueprint: Removing Noise with Regular ExpressionsBlueprint: Character Normalization with textacyBlueprint: Pattern-Based Data Masking with textacyTokenizationBlueprint: Tokenization with Regular ExpressionsTokenization with NLTKRecommendations for TokenizationLinguistic Processing with spaCyInstantiating a PipelineProcessing TextBlueprint: Customizing TokenizationBlueprint: Working with Stop WordsBlueprint: Extracting Lemmas Based on Part of SpeechBlueprint: Extracting Noun PhrasesBlueprint: Extracting Named EntitiesFeature Extraction on a Large DatasetBlueprint: Creating One Function to Get It AllBlueprint: Using spaCy on a Large DatasetPersisting the ResultA Note on Execution TimeThere Is MoreLanguage DetectionSpell-CheckingToken NormalizationClosing Remarks and Recommendations
5. Feature Engineering and Syntactic Similarity
What You’ll Learn and What We’ll BuildA Toy Dataset for ExperimentationBlueprint: Building Your Own VectorizerEnumerating the VocabularyVectorizing DocumentsThe Document-Term MatrixThe Similarity MatrixBag-of-Words ModelsBlueprint: Using scikit-learn’s CountVectorizerBlueprint: Calculating SimilaritiesTF-IDF ModelsOptimized Document Vectors with TfidfTransformerIntroducing the ABC DatasetBlueprint: Reducing Feature DimensionsBlueprint: Improving Features by Making Them More SpecificBlueprint: Using Lemmas Instead of Words for Vectorizing DocumentsBlueprint: Limit Word TypesBlueprint: Remove Most Common WordsBlueprint: Adding Context via N-GramsSyntactic Similarity in the ABC DatasetBlueprint: Finding Most Similar Headlines to a Made-up HeadlineBlueprint: Finding the Two Most Similar Documents in a Large Corpus (Much More Difficult)Blueprint: Finding Related WordsTips for Long-Running Programs like Syntactic SimilaritySummary and Conclusion
6. Text Classification Algorithms
What You’ll Learn and What We’ll BuildIntroducing the Java Development Tools Bug DatasetBlueprint: Building a Text Classification SystemStep 1: Data PreparationStep 2: Train-Test SplitStep 3: Training the Machine Learning ModelStep 4: Model EvaluationFinal Blueprint for Text ClassificationBlueprint: Using Cross-Validation to Estimate Realistic Accuracy MetricsBlueprint: Performing Hyperparameter Tuning with Grid SearchBlueprint Recap and ConclusionClosing RemarksFurther Reading
7. How to Explain a Text Classifier
What You’ll Learn and What We’ll BuildBlueprint: Determining Classification Confidence Using Prediction ProbabilityBlueprint: Measuring Feature Importance of Predictive ModelsBlueprint: Using LIME to Explain the Classification ResultsBlueprint: Using ELI5 to Explain the Classification ResultsBlueprint: Using Anchor to Explain the Classification ResultsUsing the Distribution with Masked WordsWorking with Real WordsClosing Remarks
8. Unsupervised Methods: Topic Modeling and Clustering
What You’ll Learn and What We’ll BuildOur Dataset: UN General DebatesChecking Statistics of the CorpusPreparationsNonnegative Matrix Factorization (NMF)Blueprint: Creating a Topic Model Using NMF for DocumentsBlueprint: Creating a Topic Model for Paragraphs Using NMFLatent Semantic Analysis/IndexingBlueprint: Creating a Topic Model for Paragraphs with SVDLatent Dirichlet AllocationBlueprint: Creating a Topic Model for Paragraphs with LDABlueprint: Visualizing LDA ResultsBlueprint: Using Word Clouds to Display and Compare Topic ModelsBlueprint: Calculating Topic Distribution of Documents and Time EvolutionUsing Gensim for Topic ModelingBlueprint: Preparing Data for GensimBlueprint: Performing Nonnegative Matrix Factorization with GensimBlueprint: Using LDA with GensimBlueprint: Calculating Coherence ScoresBlueprint: Finding the Optimal Number of TopicsBlueprint: Creating a Hierarchical Dirichlet Process with GensimBlueprint: Using Clustering to Uncover the Structure of Text DataFurther IdeasSummary and RecommendationConclusion
9. Text Summarization
What You’ll Learn and What We’ll BuildText SummarizationExtractive MethodsData PreprocessingBlueprint: Summarizing Text Using Topic RepresentationIdentifying Important Words with TF-IDF ValuesLSA AlgorithmBlueprint: Summarizing Text Using an Indicator RepresentationMeasuring the Performance of Text Summarization MethodsBlueprint: Summarizing Text Using Machine LearningStep 1: Creating Target LabelsStep 2: Adding Features to Assist Model PredictionStep 3: Build a Machine Learning ModelClosing RemarksFurther Reading

10. Exploring Semantic Relationships with Word Embeddings
What You’ll Learn and What We’ll BuildThe Case for Semantic EmbeddingsWord EmbeddingsAnalogy Reasoning with Word EmbeddingsTypes of EmbeddingsBlueprint: Using Similarity Queries on Pretrained ModelsLoading a Pretrained ModelSimilarity QueriesBlueprints for Training and Evaluating Your Own EmbeddingsData PreparationBlueprint: Training Models with GensimBlueprint: Evaluating Different ModelsBlueprints for Visualizing EmbeddingsBlueprint: Applying Dimensionality ReductionBlueprint: Using the TensorFlow Embedding ProjectorBlueprint: Constructing a Similarity TreeClosing RemarksFurther Reading
11. Performing Sentiment Analysis on Text Data
What You’ll Learn and What We’ll BuildSentiment AnalysisIntroducing the Amazon Customer Reviews DatasetBlueprint: Performing Sentiment Analysis Using Lexicon-Based ApproachesBing Liu LexiconDisadvantages of a Lexicon-Based ApproachSupervised Learning ApproachesPreparing Data for a Supervised Learning ApproachBlueprint: Vectorizing Text Data and Applying a Supervised Machine Learning AlgorithmStep 1: Data PreparationStep 2: Train-Test SplitStep 3: Text VectorizationStep 4: Training the Machine Learning ModelPretrained Language Models Using Deep LearningDeep Learning and Transfer LearningBlueprint: Using the Transfer Learning Technique and a Pretrained Language ModelStep 1: Loading Models and TokenizationStep 2: Model TrainingStep 3: Model EvaluationClosing RemarksFurther Reading
12. Building a Knowledge Graph
What You’ll Learn and What We’ll BuildKnowledge GraphsInformation ExtractionIntroducing the DatasetNamed-Entity RecognitionBlueprint: Using Rule-Based Named-Entity RecognitionBlueprint: Normalizing Named EntitiesMerging Entity TokensCoreference ResolutionBlueprint: Using spaCy’s Token ExtensionsBlueprint: Performing Alias ResolutionBlueprint: Resolving Name VariationsBlueprint: Performing Anaphora Resolution with NeuralCorefName NormalizationEntity LinkingBlueprint: Creating a Co-Occurrence GraphExtracting Co-Occurrences from a DocumentVisualizing the Graph with GephiRelation ExtractionBlueprint: Extracting Relations Using Phrase MatchingBlueprint: Extracting Relations Using Dependency TreesCreating the Knowledge GraphDon’t Blindly Trust the ResultsClosing RemarksFurther Reading
13. Using Text Analytics in Production
What You’ll Learn and What We’ll BuildBlueprint: Using Conda to Create Reproducible Python EnvironmentsBlueprint: Using Containers to Create Reproducible EnvironmentsBlueprint: Creating a REST API for Your Text Analytics ModelBlueprint: Deploying and Scaling Your API Using a Cloud ProviderBlueprint: Automatically Versioning and Deploying BuildsClosing RemarksFurther Reading
Index

Content preview from Blueprints for Text Analytics Using Python

Chapter 3. Scraping Websites and Extracting Data

Often, it will happen that you visit a website and find the content interesting. If there are only a few pages, it’s possible to read everything on your own. But as soon as there is a considerable amount of content, reading everything on your own will not be possible.

To use the powerful text analytics blueprints described in this book, you have to acquire the content first. Most websites won’t have a “download all content” button, so we have to find a clever way to download (“scrape”) the pages.

Usually we are mainly interested in the content part of each individual web page, less so in navigation, etc. As soon as we have the data locally available, we can use powerful extraction techniques to dissect the pages into elements such as title, content, and also some meta-information (publication date, author, and so on).

What You’ll Learn and What We’ll Build

In this chapter, we will show you how to acquire HTML data from websites and use powerful tools to extract the content from these HTML files. We will show this with content from one specific data source, the Reuters news archive.

In the first step, we will download single HTML files and extract data from each one with different methods.

Normally, you will not be interested in single pages. Therefore, we will build a blueprint solution. We will download and analyze a news archive page (which contains links to all articles). After completing this, we know the URLs of the referred ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Text Analytics with Python: A Practitioner's Guide to Natural Language Processing

Publisher Resources

ISBN: 9781492074076Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Blueprints for Text Analytics Using Python

by Jens Albrecht, Sidharth Ramachandran, Christian Winkler

Chapter 3. Scraping Websites and Extracting Data

What You’ll Learn and What We’ll Build

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.