In this phase, data integration, selection, cleaning, and pre-processing of the data is performed. This is often the most time-consuming part but perhaps the most important step, as it is important to have high-quality data. The more data you have, the more the data is dirty.

Again, this phase is relatable to a database development project. System integration, query and selection, cleaning, and other data preprocessing steps (to be able to use it in a new database model) is expected. This will often involve aggregating the data, building key-foreign key relationships, cleansing, and so on.

Get Statistics for Data Science now with O’Reilly online learning.

O’Reilly members experience live online training, plus books, videos, and digital content from 200+ publishers.