CHAPTER 2Planning for Machine Learning

This chapter looks at planning your machine learning projects, storage types, processing options, and data input. The chapter also covers data quality and methods to validate and clean data before you do any analysis.

The Machine Learning Cycle

A machine learning project is basically a cycle of actions that need to be performed (see Figure 2.1).

Illustration of the machine learning process depicting the cycle of actions such as acquisition of data, preparing and processing of machine tools, and reporting the results.

Figure 2.1: The machine learning process

You can acquire data from many sources; it might be data that's held by your organization or open data from the Internet. There might be one dataset, or there could be 10 or more.

You must come to accept that data will need to be cleaned and checked for quality before any processing can take place. These processes occur during the prepare phase.

The processing phase is where the work gets done. The machine learning routines that you have created perform this phase.

Finally, the results are presented. Reporting can happen in a variety of ways, such as reinvesting the data into a data store or reporting the results as a spreadsheet or report.

It All Starts with a Question

There seems to be a misconception that machine learning, like Big Data, is a case of throwing enough data at the problem that the answers magically appear. As much as I'd like to say this happens all the time, it doesn't. Machine learning projects start with a question or a hunch that ...

Get Machine Learning, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.