Learning Path: Get Started with Natural Language Processing Using Python, Spark, and Scala

Video description

Whether you’re a programmer with little to no knowledge of Python, or an experienced data scientist or engineer, this Learning Path will walk you through natural language processing, using both Python and Scala, and show you how to implement a range of popular tools including Spark, scikit-learn, SpaCy, NLTK, and gensim for text mining.

You’ll learn the most common techniques for processing text, how to use machine learning to generate annotators and apply them within a data pipeline, and the differences between NLP pipelines and other approaches to semantic text mining. You’ll learn about standard UIMA annotators, custom annotators, and machine-learned annotators, and understand how architectures for text processing pipelines can incorporate some of the most popular big data tools such as Kafka, Spark, SparkSQL, Cassandra, and ElasticSearch.

By the end of the learning path, you will be able to build a natural language processing and entity extraction pipeline, and will have a complete understanding of the capabilities and limitations of natural language text processing.

Materials or downloads needed in advance: Example files

Publisher resources

Download Example Code

Table of contents

  1. Introduction
    1. Course Introduction
    2. About The Author
  2. Getting Started: Basic String Processing In Python
    1. String Operations
    2. Working With Unicode
  3. Converting Text To Symbols: Tokenization In NLTK and spaCy
    1. Splitting Documents
    2. Splitting Sentences
    3. Filtering Stop Words
  4. Going Subsymbolic: Vector Representations
    1. tf-idf Gensim
    2. Word Vectors
    3. Google Word Vectors
    4. Learn Word Vectors
  5. Finding The Structure Of Text: Parsing In spaCy
    1. Dependency Parsing
    2. Sentence Head
    3. Named Entities
  6. Determining How The Writer Feels: Sentiment Analysis In VADER
    1. Sentiment Analysis Intro
    2. Sentiment In VADER
  7. Making Decisions: Text Classification
    1. Text Classification Intro
    2. Classification With TextBlob
    3. Classification With scikit-learn
  8. Indentifying Discussed Topics: LDA In Gensim
    1. LDA Introduction
    2. LDA Gensim
    3. LDA pyLDAvis
  9. Toward Machine Reading: Entity Extraction And Linking
    1. Entity Linking
    2. pyspotlight
    3. FRED
  10. Conclusion
    1. Conclusion
  11. Part 1: Introduction
    1. Welcome to the Course
    2. Natural Language Understanding in Examples
  12. Part 2: NLP Pipelines
    1. Building an NLP Pipeline
  13. Part 3 - Annotators
    1. Commonly Used Annotators
    2. Detecting Positive, Negative Speculative Polarity
    3. Machine Learned Annotators
  14. Part 4: Custom Annotators
    1. NLP Pipelines are Domain Specific
    2. Unified Medical Language System (UMLS)
    3. Coding Custom Annotators
  15. Part 5: Machine Learned Annotators
    1. Training Using Machine Learned Annotators
  16. Part 6: Ontology Enrichment
    1. The Need for Learned and Updated Ontologies
    2. Learning New Medical Concepts and Relationships
  17. Part 7: Architecture
    1. An End-to-End Reference Architecture
    2. Spark, SparkSQL, Cassandra Workflow
    3. ElasticSearch SparkSQL
  18. Part 8: Parting Advice
    1. Language is Source and Domain-Specific
    2. Welcome to the Course
  19. Part 1: Building a natural language processing and entity extraction pipeline on Scala Spark
    1. Notebook 1: Introduction
    2. Annotation Library
    3. Basic Annotators
    4. Vocabulary Analysis
    5. Exercise: Building a stopword annotator
  20. Part 2: Machine Learning Applications for Statistical Natural Language Understanding at Scale
    1. Notebook 2: Introduction
    2. Model-based Annotators
    3. Creating a Binary Classifier
    4. Exercise: Predicting score or popularity
  21. Part 3: Topic Modeling on Natural Language with Scala, Spark and MLLib
    1. Notebook 3: Introduction
    2. K-Means clustering
    3. LDA topic modeling
    4. Exercise: Using topics for score or popularity prediction
  22. Part 4: Deep Learning Applications for Natural Language Understanding with Scala, Spark and MLLib
    1. Notebook 4: Introduction
    2. Word2Vec
    3. Expanding genre entity lists
    4. Exercise: Using Word2Vec based features for score or popularity prediction

Product information

  • Title: Learning Path: Get Started with Natural Language Processing Using Python, Spark, and Scala
  • Author(s): O'Reilly Media, Inc.
  • Release date: March 2017
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491985847