O'Reilly logo
live online training icon Live Online training

Natural Language Processing (NLP) from Scratch

Bruno Gonçalves

The rise of online social platforms has resulted in an explosion of written text in the form of blogs, posts, tweets, wiki pages, and more. This new wealth of data provides a unique opportunity to explore natural language in its many forms, both as a way of automatically extracting information from written text and as a way of artificially producing text that looks natural.

In this class we introduce viewers to natural language processing from scratch. Each concept is introduced and explained through coding examples using nothing more than just plain Python and numpy. In this way, attendees learn in depth about the underlying concepts and techniques instead of just learning how to use a specific NLP library.

What you'll learn-and how you can apply it

  • Text representation
  • Topic modeling
  • Sentiment analysis
  • Language detection
  • Text classification
  • Document clustering

This training course is for you because...

  • You're a data scientist who is interested in mastering the concepts and ideas behind natural language processing.
  • You have no previous experience in NLP and want to take the first grounded steps
  • You have previous experience in using NLP libraries such as NLTK or Spacy and wish to get a greater understanding of what's going on “under the hood."


  • Attendees should understand basic Python

Course Set-up:

Recommended Preparation:

Recommended Follow-up:

About your instructor

  • Bruno Gonçalves is currently a Senior Data Scientist working at the intersection of Data Science and Finance. Previously, he was a Data Science fellow at NYU's Center for Data Science while on leave from a tenured faculty position at Aix-Marseille Université. Since completing his PhD in the Physics of Complex Systems in 2008 he has been pursuing the use of Data Science and Machine Learning to study Human Behavior. Using large datasets from Twitter, Wikipedia, web access logs, and Yahoo! Meme he studied how we can observe both large scale and individual human behavior in an obtrusive and widespread manner. The main applications have been to the study of Computational Linguistics, Information Diffusion, Behavioral Change and Epidemic Spreading. In 2015 he was awarded the Complex Systems Society's 2015 Junior Scientific Award for "outstanding contributions in Complex Systems Science" and in 2018 is was named a Science Fellow of the Institute for Scientific Interchange in Turin, Italy.


The timeframes are only estimates and may vary according to how the class is progressing

Segment 1 Text Representation (50m)

  • Represent words and numbers
  • Use One-Hot Encoding
  • Implement Bag of Words
  • Apply stopwords
  • Understand TF/IDF
  • Understand Stemming
  • Break 10m

Segment 2 Topic Modeling (60m)

  • Find topics in documents
  • Perform Explicit Semantic Analysis
  • Understand Document clustering
  • Implement Latent Semantic Analysis
  • Implement Non-negative Matrix factorization

Segment 3 Sentiment Analysis (40m)

  • Quantify words and feelings
  • Use Negations and modifiers
  • Understand corpus based approaches
  • Break 10m

Segment 4 Applications (70m)

  • Understand Word2vec word embeddings
  • Define GloVe
  • Apply Language detection