Chapter 13. Capstone: Python for Data Analytics
At the end of Chapter 8 you extended what you learned about R to explore and test relationships in the mpg dataset. We’ll do the same in this chapter, using Python. We’ve conducted the same work in Excel and R, so I’ll focus less on the whys of our analysis in favor of the hows of doing it in Python.
To get started, let’s call in all the necessary modules. Some of these are new: from scipy
, we’ll import the stats
submodule. To do this, we’ll use the from
keyword to tell Python what module to look for, then the usual import
keyword to choose a sub-module. As the name suggests, we’ll use the stats
submodule of scipy
to conduct our statistical analysis. We’ll also be using a new package called sklearn
, or scikit-learn, to validate our model on a train/test split. This package has become a dominant resource for machine learning and also comes installed with Anaconda.
In
[
1
]:
import
pandas
as
pd
import
seaborn
as
sns
import
matplotlib.pyplot
as
plt
from
scipy
import
stats
from
sklearn
import
linear_model
from
sklearn
import
model_selection
from
sklearn
import
metrics
With the usecols
argument of read_csv()
we can specify which columns to read into the DataFrame:
In
[
2
]:
mpg
=
pd
.
read_csv
(
'datasets/mpg/mpg.csv'
,
usecols
=
[
'mpg'
,
'weight'
,
'horsepower'
,
'origin'
,
'cylinders'
])
mpg
.
head
()
Out
[
2
]:
mpg
cylinders
horsepower
weight
origin
0
18.0
8
130
3504
USA
1
15.0
8
165
3693
USA
2
18.0
8
150
3436
USA
3
16.0
8
150
3433
USA
4
17.0
8
140
3449
USA ...
Get Advancing into Analytics now with the O’Reilly learning platform.
O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.