CHAPTER 3Data Acquisition Techniques

“Computers aren't the thing. They're the thing that gets us to the thing.”

Joe MacMillan

This quote comes from the television program Halt and Catch Fire; perhaps we should reconsider that statement for our purposes: “Data isn't the thing. Data is the thing that gets us to the thing.” The question to ask is where is the data coming from and does it need cleaning or transforming?

When it comes to machine learning and machine learning projects, you'll spend a large portion of your time on getting the data into the right shape so it can be processed. Welcome to the dark art that is extracting, transforming, and loading data.

Scraping Data

The sad fact of reality is that data is rarely neatly packaged the way we want. Sure, there are exceptions like WikiData and the Facebook Graph API, and there are application programming interfaces (APIs) that will give you nicely prepared data (more on that shortly). But you must be prepared to work with the messy world of scraping data.

Processing scraped data requires a few steps to get it from the usual messy state it's in to something usable.

  1. Figure out where the data is coming from.
  2. Figure out how you're going to get it.
  3. Make it machine readable.
  4. Make sure the values are workable.
  5. Figure out where to store it.

Copy and Paste

There will be a day you'll have to extract data from a web page or a series of web pages. Truth be told, they tend to be a mess, but some are better than others. A first ...

Get Machine Learning, 2nd Edition now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.