Chapter 8. Data Cleaning

A universal problem when working with data is understanding the completeness of your data. Data engineering depends on the ability to clean, process, and visualize data. Now that you’re familiar with the basic functionality of and integration of data with notebook-based code editors, either locally in a Jupyter Notebook or in the cloud with Google Colab, it’s time to learn how to clean your data. Data is frequently incomplete (missing), inconsistently formatted, or otherwise inaccurate—problems often called messy data. Data cleaning is the process of addressing these problems and preparing the data for analysis.

In this chapter, we’ll explore some publicly available datasets, finding and cleaning up messes with a few packages that you can load into a Colab notebook. You’re going to work with NYPD_Complaint_Data_Historic, a dataset from the open data portal for New York City, NYC Open Data, updated on July 7, 2021. I filtered the data for 2020 to make it a little more manageable for viewing and manipulating. You can filter the data based on your data question and export it as a CSV file. This chapter will show you how to manage, remove, update, and consolidate data and process it with a few useful Python packages. Data analysis is only as accurate as the quality of the dataset or database, and this chapter will provide tools to assess and address common inconsistencies.

Checking for Missing Data

If you’ve ever participated in a data competition, ...

Get Python for Geospatial Data Analysis now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.