O'Reilly logo
live online training icon Live Online training

Natural Language Processing First Steps

Topic: Data
Thomas Kopinski

Most of the data we process nowadays consists of text corpora. Thus the data scientist’s basic toolset should include the ability to effectively analyze and segment such heterogeneous datasets in order to subsequently apply data science tools (e.g., data cleaning, preparation, visualization etc.) for information mining.

Join expert Thomas Kopinski to learn how to apply state-of-the-art data science methodology to a unique text corpus—namely news articles collected from the English public service broadcaster BBC. You’ll use Python and Jupyter notebooks along with established libraries to preprocess, clean, and analyze the dataset with the goal of revealing interesting insights hidden in the data.

What you'll learn-and how you can apply it

By the end of this live online course, you’ll understand:

  • How to use natural language processing (NLP) to process large text data
  • How to apply statistics and machine learning (ML) algorithms to uncover interesting information
  • How to uncover bias from news corpora

And you’ll be able to:

  • Utilize Jupyter notebooks to apply NLP techniques
  • Set up a system to do ML/NLP on your own
  • Make use of data science techniques and libraries to learn on your own

This training course is for you because...

  • You’re a data science practitioner or enthusiast.
  • You work with data and want to learn more about machine learning.
  • You want to become a data science expert.


  • A working knowledge of Python
  • Familiarity with Jupyter notebooks
  • A basic understanding of data science (useful but not required)

About your instructor

  • Prof. Dr. Thomas Kopinski is leading the data science lab at the University of applied sciences South Westphalia, Germany. He is currently responsible for two large data science projects, one embedded in the field of predictive maintenance for automobile parts (3y funding, ~800.000€) and the deconstruction of nuclear power plants within the context of energy transition in Germany (3y funding, ~1.5M€). A team of six data scientists are currently engaged in these projects applying machine learning techniques to solve complex research tasks under his supervision.

    Before that he was engaged in a wide range of research and industry-related projects in the scope of machine learning. He has +12y of experience in teaching, all related to computer science, machine learning and mathematics and is enthusiastic towards passing on his knowledge he gathered during his work both in the industry and in academia.

    Thomas Kopinski has gathered experience in the industry e.g. working as a freelancer in the mobile tech industry and was responsible for the acquisition of new customers as well as in concepting and the realization of smart application for mobile devices. He earned his PhD from the Université Paris-Saclay where he developed a hand gesture recognition system using deep learning techniques.


The timeframes are only estimates and may vary according to how the class is progressing

Environment setup (20 minutes)

  • Presentation: The working environment (Jupyter notebooks)
  • Jupyter Notebook exercise: Set up and explore the environment
  • Q&A

Working with NLP tools for data ingestion and data wrangling (20 minutes)

  • Presentation: Tools for data manipulation
  • Jupyter Notebook exercise: Work with the data; start manipulating it and putting it together
  • Q&A

Calculate basic metrics on the dataset (10 minutes)

  • Presentation: Applying metrics to gain insights from the data
  • Jupyter Notebook exercise: Perform statistical calculations on subsets of the data
  • Q&A

Break (10 minutes)

Natural language processing techniques (20 minutes)

  • Presentation: Using NLP techniques to get rid of unnecessary data elements not carrying any value and remove mistakes in the data
  • Jupyter Notebook exercise: Perform selected NLP techniques such as lemmatization and stemming
  • Q&A

Data visualization (20 minutes)

  • Presentation: Using data visualization techniques—bar charts, word clouds, etc.—to gain insights
  • Jupyter Notebook exercise: Test data visualization techniques on selected parts of the dataset
  • Q&A

Exploratory data analysis (EDA) (10 minutes)

  • Presentation: EDA concepts and techniques
  • Jupyter Notebook exercise: Perform EDA on the news media dataset
  • Q&A

Break (10 minutes)

Apply basic metrics for document tagging (15 minutes)

  • Presentation: How specific metrics such as TFIDF lay the groundwork for important tasks such as document tagging
  • Jupyter Notebook exercise: Apply metrics to selected parts of the data and interpret the results
  • Q&A

Selected machine learning algorithms for NLP (15 minutes)

  • Presentation: Machine learning and its application in NLP
  • Jupyter Notebook exercise: Work with selected ML models and apply hyperparameter tuning to get satisfactory results
  • Q&A

Word embeddings for NLP (15 minutes)

  • Presentation: The importance of word embeddings in NLP
  • Jupyter Notebook exercise: Deploy and train word embeddings on the dataset

Wrap-up and Q&A (15 minutes)