Chapter 3. Classification Walkthrough: Titanic Dataset

This chapter will walk through a common classification problem using the Titanic dataset. Later chapters will dive into and expand on the common steps performed during an analysis.

Project Layout Suggestion

An excellent tool for performing exploratory data analysis is Jupyter. Jupyter is an open-source notebook environment that supports Python and other languages. It allows you to create cells of code or Markdown content.

I tend to use Jupyter in two modes. One is for exploratory data analysis and quickly trying things out. The other is more of a deliverable style where I format a report using Markdown cells and insert code cells to illustrate important points or discoveries. If you aren’t careful, your notebooks might need some refactoring and application of software engineering practices (remove globals, use functions and classes, etc.).

The cookiecutter data science package suggests a layout to create an analysis that allows for easy reproduction and sharing code.

Imports

This example is based mostly on pandas, scikit-learn, and Yellowbrick. The pandas library gives us tooling for easy data munging. The scikit-learn library has great predictive modeling, and Yellowbrick is a visualization library for evaluating models:

>>> import matplotlib.pyplot as plt
>>> import pandas as pd
>>> from sklearn import (
...     ensemble,
...     preprocessing,
...     tree,
... )
>>> from sklearn.metrics import (
...     auc,
...     confusion_matrix ...

Get Machine Learning Pocket Reference now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.