book

Natural Language Processing with Spark NLP

Name: Natural Language Processing with Spark NLP
Author: Alex Thomas
ISBN: 9781492047766

by Alex Thomas

June 2020

Beginner to intermediate

364 pages

8h 58m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
Why Natural Language Processing Is Important and DifficultBackgroundPhilosophyConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
I. Basics
1. Getting Started
IntroductionOther ToolsSetting Up Your EnvironmentPrerequisitesStarting Apache SparkChecking Out the CodeGetting Familiar with Apache SparkStarting Apache Spark with Spark NLPLoading and Viewing Data in Apache SparkHello World with Spark NLP
2. Natural Language Basics
What Is Natural Language?Origins of LanguageSpoken Language Versus Written LanguageLinguisticsPhonetics and PhonologyMorphologySyntaxSemanticsSociolinguistics: Dialects, Registers, and Other VarietiesFormalityContextPragmaticsRoman JakobsonHow To Use PragmaticsWriting SystemsOriginsAlphabetsAbjadsAbugidasSyllabariesLogographsEncodingsASCIIUnicodeUTF-8Exercises: TokenizingTokenize EnglishTokenize GreekTokenize Ge’ez (Amharic)Resources
3. NLP on Apache Spark
Parallelism, Concurrency, Distributing ComputationParallelization Before Apache HadoopMapReduce and Apache HadoopApache SparkArchitecture of Apache SparkPhysical ArchitectureLogical ArchitectureSpark SQL and Spark MLlibTransformersEstimators and ModelsEvaluatorsNLP LibrariesFunctionality LibrariesAnnotation LibrariesNLP in Other LibrariesSpark NLPAnnotation LibraryStagesPretrained PipelinesFinisherExercises: Build a Topic ModelResources
4. Deep Learning Basics
Gradient DescentBackpropagationConvolutional Neural NetworksFiltersPoolingRecurrent Neural NetworksBackpropagation Through TimeElman NetsLSTMsExercise 1Exercise 2Resources
II. Building Blocks
5. Processing Words
TokenizationVocabulary ReductionStemmingLemmatizationStemming Versus LemmatizationSpelling CorrectionNormalizationBag-of-WordsCountVectorizerN-GramVisualizing: Word and Document DistributionsExercisesResources
6. Information Retrieval
Inverted IndicesBuilding an Inverted IndexVector Space ModelStop-Word RemovalInverse Document FrequencyIn SparkExercisesResources
7. Classification and Regression
Bag-of-Words FeaturesRegular Expression FeaturesFeature SelectionModelingNaïve BayesLinear ModelsDecision/Regression TreesDeep Learning AlgorithmsIterationExercises

8. Sequence Modeling with Keras
Sentence Segmentation(Hidden) Markov ModelsSection SegmentationPart-of-Speech TaggingConditional Random FieldChunking and Syntactic ParsingLanguage ModelsRecurrent Neural NetworksExercise: Character N-GramsExercise: Word Language ModelResources
9. Information Extraction
Named-Entity RecognitionCoreference ResolutionAssertion Status DetectionRelationship ExtractionSummaryExercises
10. Topic Modeling
K-MeansLatent Semantic IndexingNonnegative Matrix FactorizationLatent Dirichlet AllocationExercises
11. Word Embeddings
Word2vecGloVefastTextTransformersELMo, BERT, and XLNetdoc2vecExercises
III. Applications
12. Sentiment Analysis and Emotion Detection
Problem Statement and ConstraintsPlan the ProjectDesign the SolutionImplement the SolutionTest and Measure the SolutionBusiness MetricsModel-Centric MetricsInfrastructure MetricsProcess MetricsOffline Versus Online Model MeasurementReviewInitial DeploymentFallback PlansNext StepsConclusion
13. Building Knowledge Bases
Problem Statement and ConstraintsPlan the ProjectDesign the SolutionImplement the SolutionTest and Measure the SolutionBusiness MetricsModel-Centric MetricsInfrastructure MetricsProcess MetricsReviewConclusion
14. Search Engine
Problem Statement and ConstraintsPlan the ProjectDesign the SolutionImplement the SolutionTest and Measure the SolutionBusiness MetricsModel-Centric MetricsReviewConclusion
15. Chatbot
Problem Statement and ConstraintsPlan the ProjectDesign the SolutionImplement the SolutionTest and Measure the SolutionBusiness MetricsModel-Centric MetricsReviewConclusion
16. Object Character Recognition
Kinds of OCR TasksImages of Printed Text and PDFs to TextImages of Handwritten Text to TextImages of Text in Environment to TextImages of Text to TargetNote on Different Writing SystemsProblem Statement and ConstraintsPlan the ProjectImplement the SolutionTest and Measure the SolutionModel-Centric MetricsReviewConclusion
IV. Building NLP Systems
17. Supporting Multiple Languages
Language TypologyScenario: Academic Paper ClassificationText Processing in Different LanguagesCompound WordsMorphological ComplexityTransfer Learning and Multilingual Deep LearningSearch Across LanguagesChecklistConclusion
18. Human Labeling
GuidelinesScenario: Academic Paper ClassificationInter-Labeler AgreementIterative LabelingLabeling TextClassificationTaggingChecklistConclusion
19. Productionizing NLP Applications
Spark NLP Model CacheSpark NLP and TensorFlow IntegrationSpark Optimization BasicsDesign-Level OptimizationProfiling ToolsMonitoringManaging Data ResourcesTesting NLP-Based ApplicationsUnit TestsIntegration TestsSmoke and Sanity TestsPerformance TestsUsability TestsDemoing NLP-Based ApplicationsChecklistsModel Deployment ChecklistScaling and Performance ChecklistTesting ChecklistConclusion
Glossary
Index

Content preview from Natural Language Processing with Spark NLP

Chapter 11. Word Embeddings

Word embeddings are part of distributional semantics, similar to the topic models we discussed in the previous chapter. Unlike topic models, word embeddings do not work on term-document relationships. Instead, word embeddings work with smaller contexts like sentences or subsequences of tokens in a sentence.

The field of word embeddings is a rapidly evolving set of techniques. The most popular technique, Word2vec, was developed in 2013 by Tomas Mikolov et al. at Google. Since then, there has been much research (and hype). The idea is that you use a neural network to build a language model. Once this model is learned, you can take some of the intermediate values in the network as representations of the input term.

In this chapter, we will look at the implementation of Word2vec in code. This will help give us a clear understanding of the fundamentals of this family of techniques. We will discuss the more recent approaches at a higher level because they can be quite resource intensive.

Word2vec

One of the ideas behind deep learning is that the hidden layers are “higher level” representations of the data. This comes from analysis of the visual cortex. As the information travels from the eye through the brain, neurons appear to be associated with more complex shapes. The early layers of neurons recognize only points of light and dark, later neurons recognize lines and curves, and so on. Using this assumption, if we train a language model using a neural network, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Natural Language Processing (NLP)

Bruno Goncalves

Natural Language Processing Projects: Build Next-Generation NLP Applications Using AI Techniques

Akshay Kulkarni, Adarsha Shivananda, Anoosh Kulkarni

Applied Natural Language Processing with Python: Implementing Machine Learning and Deep Learning Algorithms for Natural Language Processing

Taweh Beysolow II

Next-Generation Machine Learning with Spark: Covers XGBoost, LightGBM, Spark NLP, Distributed Deep Learning with Keras, and More

Butch Quinto

Publisher Resources

ISBN: 9781492047759Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Natural Language Processing with Spark NLP

by Alex Thomas

Chapter 11. Word Embeddings

Word2vec

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.