Normalizing data

Some datasets are nice to see but complicate to process further. Take a look at the matches file we saw in Chapter 3:

Match Date;Home Team;Away Team;Result
02/06;Italy;France;2-1
02/06;Argentina;Hungary;2-1
06/06;Italy;Hungary;3-1
06/06;Argentina;France;2-1
10/06;France;Hungary;3-1
10/06;Italy;Argentina;1-0
...

Imagine you want to answer these questions:

  1. How many teams played?
  2. Which team converted most goals?
  3. Which team won all matches it played?

The dataset is not prepared to answer those questions, at least in an easy way. If you want to answer those questions in a simple way, you will first have to normalize the data, that is, convert it to a suitable format before proceeding. Let's work on it.

Get Pentaho 3.2 Data Integration Beginner's Guide now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.