book

Data Science Projects with Python

Name: Data Science Projects with Python
Author: Stephen Klosterman
ISBN: 9781838551025

by Stephen Klosterman

April 2019

Beginner to intermediate

374 pages

8h 11m

English

Packt Publishing

Read now

Unlock full access

Preface
About the BookAbout the AuthorObjectivesAudienceApproachHardware RequirementsSoftware RequirementsInstallation and SetupConventions
Chapter 1:
Data Exploration and CleaningIntroductionPython and the Anaconda Package Management SystemIndexing and the Slice OperatorExercise 1: Examining Anaconda and Getting Familiar with PythonDifferent Types of Data Science ProblemsLoading the Case Study Data with Jupyter and pandasExercise 2: Loading the Case Study Data in a Jupyter NotebookGetting Familiar with Data and Performing Data CleaningThe Business ProblemData Exploration Steps Exercise 3: Verifying Basic Data IntegrityBoolean MasksExercise 4: Continuing Verification of Data IntegrityExercise 5: Exploring and Cleaning the Data Data Quality Assurance and ExplorationExercise 6: Exploring the Credit Limit and Demographic FeaturesDeep Dive: Categorical FeaturesExercise 7: Implementing OHE for a Categorical FeatureExploring the Financial History Features in the DatasetActivity 1: Exploring Remaining Financial Features in the DatasetSummary
Chapter 2:
Introduction to Scikit-Learn and Model EvaluationIntroductionExploring the Response Variable and Concluding the Initial ExplorationIntroduction to Scikit-LearnGenerating Synthetic DataData for a Linear RegressionExercise 8: Linear Regression in Scikit-LearnModel Performance Metrics for Binary ClassificationSplitting the Data: Training and Testing setsClassification AccuracyTrue Positive Rate, False Positive Rate, and Confusion MatrixExercise 9: Calculating the True and False Positive and Negative Rates and Confusion Matrix in PythonDiscovering Predicted Probabilities: How Does Logistic Regression Make Predictions? Exercise 10: Obtaining Predicted Probabilities from a Trained Logistic Regression ModelThe Receiver Operating Characteristic (ROC) CurvePrecisionActivity 2: Performing Logistic Regression with a New Feature and Creating a Precision-Recall CurveSummary
Chapter 3:
Details of Logistic Regression and Feature ExplorationIntroductionExamining the Relationships between Features and the ResponsePearson CorrelationF-testExercise 11: F-test and Univariate Feature SelectionFiner Points of the F-test: Equivalence to t-test for Two Classes and CautionsHypotheses and Next StepsExercise 12: Visualizing the Relationship between Features and ResponseUnivariate Feature Selection: What It Does and Doesn't DoUnderstanding Logistic Regression with function Syntax in Python and the Sigmoid FunctionExercise 13: Plotting the Sigmoid FunctionScope of FunctionsWhy is Logistic Regression Considered a Linear Model?Exercise 14: Examining the Appropriateness of Features for Logistic RegressionFrom Logistic Regression Coefficients to Predictions Using the SigmoidExercise 15: Linear Decision Boundary of Logistic RegressionActivity 3: Fitting a Logistic Regression Model and Directly Using the CoefficientsSummary
Chapter 4:
The Bias-Variance Trade-offIntroductionEstimating the Coefficients and Intercepts of Logistic Regression Gradient Descent to Find Optimal Parameter ValuesExercise 16: Using Gradient Descent to Minimize a Cost FunctionAssumptions of Logistic RegressionThe Motivation for Regularization: The Bias-Variance Trade-offExercise 17: Generating and Modeling Synthetic Classification Data Lasso (L1) and Ridge (L2) RegularizationCross Validation: Choosing the Regularization Parameter and Other HyperparametersExercise 18: Reducing Overfitting on the Synthetic Data Classification ProblemOptions for Logistic Regression in Scikit-LearnScaling Data, Pipelines, and Interaction Features in Scikit-LearnActivity 4: Cross-Validation and Feature Engineering with the Case Study DataSummary
Chapter 5:
Decision Trees and Random ForestsIntroductionDecision treesThe Terminology of Decision Trees and Connections to Machine LearningExercise 19: A Decision Tree in scikit-learnTraining Decision Trees: Node ImpurityFeatures Used for the First splits: Connections to Univariate Feature Selection and InteractionsTraining Decision Trees: A Greedy AlgorithmTraining Decision Trees: Different Stopping CriteriaUsing Decision Trees: Advantages and Predicted ProbabilitiesA More Convenient Approach to Cross-ValidationExercise 20: Finding Optimal Hyperparameters for a Decision TreeRandom Forests: Ensembles of Decision TreesRandom Forest: Predictions and InterpretabilityExercise 21: Fitting a Random ForestCheckerboard GraphActivity 5: Cross-Validation Grid Search with Random ForestSummary
Chapter 6:
Imputation of Missing Data, Financial Analysis, and Delivery to ClientIntroductionReview of Modeling ResultsDealing with Missing Data: Imputation StrategiesPreparing Samples with Missing DataExercise 22: Cleaning the DatasetExercise 23: Mode and Random Imputation of PAY_1A Predictive Model for PAY_1Exercise 24: Building a Multiclass Classification Model for ImputationUsing the Imputation Model and Comparing it to Other MethodsConfirming Model Performance on the Unseen Test SetFinancial AnalysisFinancial Conversation with the ClientExercise 25: Characterizing Costs and SavingsActivity 6: Deriving Financial InsightsFinal Thoughts on Delivering the Predictive Model to the ClientSummary
Appendix
Chapter 1: Data Exploration and CleaningActivity 1: Exploring Remaining Financial Features in the DatasetChapter 2: Introduction to Scikit-Learn and Model EvaluationActivity 2: Performing Logistic Regression with a New Feature and Creating a Precision-Recall CurveChapter 3: Details of Logistic Regression and Feature ExplorationActivity 3: Fitting a Logistic Regression Model and Directly Using the CoefficientsChapter 4: The Bias-Variance Trade-offActivity 4: Cross-Validation and Feature Engineering with the Case Study DataChapter 5: Decision Trees and Random ForestsActivity 5: Cross-Validation Grid Search with Random ForestChapter 6: Imputation of Missing Data, Financial Analysis, and Delivery to ClientActivity 6: Deriving Financial Insights

Overview

Data Science Projects with Python introduces you to data science and machine learning using Python through practical examples. In this book, you'll learn to analyze, visualize, and model data, applying techniques like logistic regression and random forests. With a case-study method, you'll build confidence implementing insights in real-world scenarios.

What this Book will help me do

Set up a data science environment with necessary Python libraries such as pandas and scikit-learn.
Effectively visualize data insights through Matplotlib and summary statistics.
Apply machine learning models including logistic regression and random forests to solve data problems.
Identify optimal models through evaluation metrics like k-fold cross-validation.
Develop confidence in data preparation and modeling techniques for real-world data challenges.

Author(s)

Stephen Klosterman is a seasoned data scientist with a keen interest in practical applications of machine learning. He combines a strong academic foundation with real-world experience to craft relatable content. Stephen excels in breaking down complex topics into approachable lessons, helping learners grow their data science expertise step by step.

Who is it for?

This book is ideal for data analysts, scientists, and business professionals looking to enhance their skills in Python and data science. If you have some experience in Python and a foundational understanding of algebra and statistics, you'll find this book approachable. It offers an excellent gateway to mastering advanced data analysis techniques. Whether you're seeking to explore machine learning or apply data insights, this book supports your growth.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781838551025

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills