O'Reilly logo
live online training icon Live Online training

Data Ingestion of Tabular Format: Cleaning Data for Effective Data Science

David Mertz, Ph.D.

This course teaches you to read tabular data in formats like CSV, SQL databases, and fixed width files, and to use data frame libraries in various programming languages to process, transform, and filter that data.

You learn how to read CSV, other delimited, and fixed width files, and become aware of the special pitfalls you are likely to encounter, largely around data typing decisions. You will learn why spreadsheets are such a fragile and brittle way of communicating data, and how to remediate those problems. You also learn to work with SQL data sources. Finally, you gain a basic familiarity with a number of different data frame libraries, in several programming languages, and learn the general differences and similarities among them.

Cleaning Data for Effective Data Science is a series of webinars, and soon a book, that addresses the central skills required of all data scientists, data analysts, and other developers in data-oriented domains.

All the things a data scientist needs to do before they do what is usually called data science is covered in a broad range of topics:

  • Data Ingestion of Tabular Formats
  • Value Imputation
  • Anomaly Detection
  • Feature Engineering
  • Data Ingestion of Hierarchical and Other Data Formats

What you'll learn-and how you can apply it

  • How to read CSV and other delimited files
  • How to read fixed width data files
  • How to read SQL databases into data frames
  • How to work with scientific data in HDF5 and Parquet
  • Understand the many data frame libraries available and their commonalities and differences

This training course is for you because...

  • You need to do the actual work of a data scientist, all the many hours that come before the theory of machine learning, statistics, data visualization.

Prerequisites

  • Students in this course should have a comfortable intermediate level knowledge of Python and/or R, and a very basic familiarity with statistics.

Course Set-up:

Students should have a system with Jupyter Notebooks installed; a recent version of scikit-learn, along with Pandas, NumPy, and matplotlib; and the general scientific Python tool stack. The training materials will be made available as notebooks at a GitHub repository. Optionally, installing R will allow students to follow some additional examples.

Recommended Preparation:

If you need to brush up on the prerequisites, consider the following:

Recommended Follow-up:

About your instructor

  • David Mertz is a data scientist, trainer, and erstwhile startup CTO, who is currently writing the Addison Wesley title Cleaning Data for Successful Data Science: Doing the other 80% of the work. He created the training program for Anaconda, Inc. He was a Director of the Python Software Foundation for six years and remains chair of a few PSF committees. For nine years, David helped with creating the world's fastest—highly-specialized—supercomputer for performing molecular dynamics.

Schedule

The timeframes are only estimates and may vary according to how the class is progressing

  • Segment 1: Tidying up / tidy data (30 minutes)
  • Segment 2: Comma separated values (30 minutes)

Break 15 minutes

  • Segment 3: Spreadsheets considered harmful (30 minutes)
  • Segment 4: SQL RDBMS (30 minutes)
  • Seqment 5: HDF5, SQLite, and Apache Parquet (30 minutes)

Break 15 minutes

  • Segment 6: Data frames in Python, R, and Scala (45 minutes)
  • Segment 7: Exercises (15 minutes)
  • Course wrap-up and next steps