O'Reilly logo
live online training icon Live Online training

Data Ingestion of Hierarchical and Other Data Formats: Cleaning Data for Effective Data Science

Topic: Data
David Mertz, Ph.D.

This course brings you up to speed on ingesting data from hierarchical and special formats.

This live online training has a two-fold focus. In the first part we look at formats that are oriented toward data representation, but simply not for data that can be represented easily in tabular form. JSON, XML, and NoSQL databases are prominent examples of these. In the second part we look at formats that are not per se about data at all, but that often contain important data: HTML, PDF, images, and custom binary or text formats.

Cleaning Data for Effective Data Science is a series of live online trainings, and soon a book, that addresses the central skills required of all data scientists, data analysts, and other developers in data-oriented domains.

All the things a data scientist needs to do before they do what is usually called data science is covered in a broad range of topics:

  • Data Ingestion of Tabular Formats
  • Value Imputation
  • Anomaly Detection
  • Feature Engineering
  • Data Ingestion of Hierarchical and Other Data Formats

What you'll learn-and how you can apply it

  • How to work with hierarchically structured data sources such as JSON and XML
  • How to access NoSQL databases within hierarchical or “object database” structures
  • How to scrape data from web pages and PDF documents
  • How to treat images as data sources

This training course is for you because...

  • You need to do the actual work of a data scientist, all the many hours that come before the theory of machine learning, statistics, data visualization.


  • Students in this course should have a comfortable intermediate level knowledge of Python and/or R, and a very basic familiarity with statistics.

Course Set-up:

Students should have a system with Jupyter Notebooks installed; a recent version of scikit-learn, along with Pandas, NumPy, and matplotlib; and the general scientific Python tool stack. The training materials will be made available as notebooks at a GitHub repository. Optionally, installing R will allow students to follow some additional examples.

Recommended Preparation:

If you need to brush up on the prerequisites, consider the following:

Recommended Follow-up:

About your instructor

  • David Mertz is a data scientist, trainer, and erstwhile startup CTO, who is currently writing the Addison Wesley title Cleaning Data for Successful Data Science: Doing the other 80% of the work. He created the training program for Anaconda, Inc. He was a Director of the Python Software Foundation for six years and remains chair of a few PSF committees. For nine years, David helped with creating the world's fastest—highly-specialized—supercomputer for performing molecular dynamics.


The timeframes are only estimates and may vary according to how the class is progressing

Part 1: Data Ingestion - Hierarchical Formats

  • Segment 1: Introduction (15 minutes)
  • Segment 2: JSON (30 minutes)
  • Segment 3: XML (30 minutes)
  • Break 15 minutes

Part 2: Data Ingestion - Other Data Sources

  • Segment 4: NoSQL Databases (30 minutes)
  • Segment 5: Web Scraping (30 minutes)
  • Segment 6: Image Formats (30 minutes)
  • Break 15 minutes
  • Segment 7: Binary Serialized Data Structures (15 minutes)
  • Segment 8: Custom Text Formats (15 minutes)
  • Segment 9: Exercises (15 minutes)