Hands-on NLP with NLTK and Scikit-learn

Video description

There is an overflow of text data online nowadays. As a Python developer, you need to create a new solution using Natural Language Processing for your next project. Your colleagues depend on you to monetize gigabytes of unstructured text data. What do you do?

Hands-on NLP with NLTK and scikit-learn is the answer. This course puts you right on the spot, starting off with building a spam classifier in our first video. At the end of the course, you are going to walk away with three NLP applications: a spam filter, a topic classifier, and a sentiment analyzer. There is no need for fancy mathematical theory, just plain English explanations of core NLP concepts and how to apply those using Python libraries.

Taking this course will help you to precisely create new applications with Python and NLP. You will be able to build actual solutions backed by machine learning and NLP processing models with ease.

What You Will Learn

  • Build end-to-end Natural Language Processing solutions, ranging from getting data for your model to presenting its results.
  • Core NLP concepts such as tokenization, stemming, and stop word removal.
  • Use open source libraries such as NLTK, scikit-learn, and spaCy to perform routine NLP tasks.
  • Classify emails as spam or not-spam using basic NLP techniques and simple machine learning models.
  • Put documents in their relevant topics using techniques such as TF-IDF, SVMs, and LDAs.
  • Common text data processing steps to increase the performance of your machine learning models.


This course is for developers, data scientists, and programmers who want to learn about practical Natural Language Processing with Python in a hands-on way. Developers who have an upcoming project that needs NLP, or a pile of unstructured text data on their hands, and don't know what to do with it, will find this course useful. Prior programming experience with Python is assumed along with being comfortable dealing with machine learning terms such as supervised learning, regression, and classification. No prior Natural Language Processing or text mining experience is needed.

About The Author

Colibri Ltd: Colibri is a technology consultancy company founded in 2015 by James Cross and Ingrid Funie. The company works to help its clients navigate the rapidly changing and complex world of emerging technologies, with deep expertise in areas like big data, data science, machine learning, and cloud computing. Over the past few years, they have worked with some of the world's largest and most prestigious companies, including a tier 1 investment bank, a leading management consultancy group, and one of the world's most popular soft drinks companies, helping each of them to make better sense of its data, and process it in more intelligent ways. The company lives by its motto: Data -> Intelligence -> Action.

James Cross is a Big Data Engineer and certified AWS Solutions Architect with a passion for data-driven applications. He's spent the last 3-5 years helping his clients to design and implement huge-scale, streaming big data platforms, cloud-based analytics stacks, and serverless architectures.

He started his professional career in Investment Banking, working with well-established technologies such as Java and SQL Server, before moving into the Big Data space. Since then he's worked with a huge range of big data tools including most of the Hadoop eco-system, Spark, and many No-SQL technologies such as Cassandra, MongoDB, Redis, and DynamoDB. More recently his focus has been on cloud technologies and how they can be applied to data analytics, culminating in his work at Scout Solutions as CTO, and more recently with Mckinsey.

James is an AWS certified solutions architect with several years' experience designing and implementing solutions on this cloud platform. As CTO of Scout Solutions Ltd, he built a fully serverless set of APIs and an analytics stack based around Lambda and Redshift.

Table of contents

  1. Chapter 1 : Working with Natural Language Data
    1. The Course Overview
    2. Use Python, NLTK, spaCy, and Scikit-learn to Build Your NLP Toolset
    3. Reading a Simple Natural Language File into Memory
    4. Split the Text into Individual Words with Regular Expression
    5. Converting Words into Lists of Lower Case Tokens
    6. Removing Uncommon Words and Stop Words
  2. Chapter 2 : Spam Classification with an Email Dataset
    1. Use an Open Source Dataset, and What Is the Enron Dataset
    2. Loading the Enron Dataset into Memory
    3. Tokenization, Lemmatization, and Stop Word Removal
    4. Bag-of-Words Feature Extraction Process with Scikit-learn
    5. Basic Spam Classification with NLTK's Naive Bayes
  3. Chapter 3 : Sentiment Analysis with a Movie Review Dataset
    1. Understanding the Origin and Features of the Movie Review Dataset
    2. Loading and Cleaning the Review Data
    3. Preprocessing the Dataset to Remove Unwanted Words and Characters
    4. Creating TF-IDF Weighted Natural Language Features
    5. Basic Sentiment Analysis with Logistic Regression Model
  4. Chapter 4 : Boosting the Performance of Your Models with N-grams
    1. Deep Dive into Raw Tokens from the Movie Reviews
    2. Advanced Cleaning of Tokens Using Python String Functions and Regex
    3. Creating N-gram Features Using Scikit-learn
    4. Experimenting with Advanced Scikit-learn Models Using the NLTK Wrapper
    5. Building a Voting Model with Scikit-learn
  5. Chapter 5 : Document Classification with a Newsgroup Dataset
    1. Understanding the Origin and Features of the 20 Newsgroups Dataset
    2. Loading the Newsgroup Data and Extracting Features
    3. Building a Document Classification Pipeline
    4. Creating a Performance Report of the Model on the Test Set
    5. Finding Optimal Hyper-parameters Using Grid Search
  6. Chapter 6 : Advanced Topic Modelling with TF-IDF, LSA, and SVMs
    1. Building a Text Preprocessing Pipeline with NLTK
    2. Creating Hashing Based Features from Natural Language
    3. Classify Documents into 20 Topics with LSA
    4. Document Classification with TF-IDF and SVMs

Product information

  • Title: Hands-on NLP with NLTK and Scikit-learn
  • Author(s): Colibri Ltd
  • Release date: July 2018
  • Publisher(s): Packt Publishing
  • ISBN: 9781789345612