Chapter 3. Classification Walkthrough: Titanic Dataset
This chapter will walk through a common classification problem using the Titanic dataset. Later chapters will dive into and expand on the common steps performed during an analysis.
Project Layout Suggestion
An excellent tool for performing exploratory data analysis is Jupyter. Jupyter is an open-source notebook environment that supports Python and other languages. It allows you to create cells of code or Markdown content.
I tend to use Jupyter in two modes. One is for exploratory data analysis and quickly trying things out. The other is more of a deliverable style where I format a report using Markdown cells and insert code cells to illustrate important points or discoveries. If you aren’t careful, your notebooks might need some refactoring and application of software engineering practices (remove globals, use functions and classes, etc.).
The cookiecutter data science package suggests a layout to create an analysis that allows for easy reproduction and sharing code.
Imports
This example is based mostly on pandas, scikit-learn, and Yellowbrick. The pandas library gives us tooling for easy data munging. The scikit-learn library has great predictive modeling, and Yellowbrick is a visualization library for evaluating models:
>>>importmatplotlib.pyplotasplt>>>importpandasaspd>>>fromsklearnimport(...ensemble,...preprocessing,...tree,...)>>>fromsklearn.metricsimport(...auc,...confusion_matrix ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access