O'Reilly logo

Practical Data Analysis - Second Edition by Dr. Sampath Kumar, Hector Cuesta

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Chapter 2. Preprocessing Data

Building real world data analytic solutions requires accurate data. In this chapter, we discuss how to collect, clean, normalize, and transform raw data into a standard format such as Comma-Separated Values (CSV) format or JavaScript Object Notation (JSON), using a tool to process a messy data called OpenRefine.

In this chapter, we will cover the following:

  • Data sources
  • Data scrubbing
  • Data reduction methods
  • Data formats
  • Getting started with OpenRefine

Data sources

Data source is a term for all the technology related to the extraction and storage of data. A data source can be anything from a simple text file to a big database. The raw data can come from observation logs, sensors, transactions, or user behavior.

A dataset

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, interactive tutorials, and more.

Start Free Trial

No credit card required