Chapter 2. Loading Data
2.0 Introduction
The first step in any machine learning endeavor is to get the raw data into our system. The raw data might be a logfile, dataset file, database, or cloud blob store such as Amazon S3. Furthermore, often we will want to retrieve data from multiple sources.
The recipes in this chapter look at methods of loading data from a variety of sources, including CSV files and SQL databases. We also cover methods of generating simulated data with desirable properties for experimentation. Finally, while there are many ways to load data in the Python ecosystem, we will focus on using the pandas library’s extensive set of methods for loading external data, and using scikit-learn—an open source machine learning library in Python—for generating simulated data.
2.1 Loading a Sample Dataset
Problem
You want to load a preexisting sample dataset from the scikit-learn library.
Solution
scikit-learn comes with a number of popular datasets for you to use:
# Load scikit-learn's datasets
from
sklearn
import
datasets
# Load digits dataset
digits
=
datasets
.
load_digits
()
# Create features matrix
features
=
digits
.
data
# Create target vector
target
=
digits
.
target
# View first observation
features
[
0
]
array([ 0., 0., 5., 13., 9., 1., 0., 0., 0., 0., 13., 15., 10., 15., 5., 0., 0., 3., 15., 2., 0., 11., 8., 0., 0., 4., 12., 0., 0., 8., 8., 0., 0., 5., 8., 0., 0., 9., 8., 0., 0., 4., 11., 0., 1., 12., 7., 0., 0., 2., 14., 5., 10., 12., 0., 0., 0., 0., 6., 13., ...
Get Machine Learning with Python Cookbook, 2nd Edition now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.