Practical Data Science with Python

Book description

Learn to effectively manage data and execute data science projects from start to finish using Python

Key Features

  • Understand and utilize data science tools in Python, such as specialized machine learning algorithms and statistical modeling
  • Build a strong data science foundation with the best data science tools available in Python
  • Add value to yourself, your organization, and society by extracting actionable insights from raw data

Book Description

Practical Data Science with Python teaches you core data science concepts, with real-world and realistic examples, and strengthens your grip on the basic as well as advanced principles of data preparation and storage, statistics, probability theory, machine learning, and Python programming, helping you build a solid foundation to gain proficiency in data science.

The book starts with an overview of basic Python skills and then introduces foundational data science techniques, followed by a thorough explanation of the Python code needed to execute the techniques. You'll understand the code by working through the examples. The code has been broken down into small chunks (a few lines or a function at a time) to enable thorough discussion.

As you progress, you will learn how to perform data analysis while exploring the functionalities of key data science Python packages, including pandas, SciPy, and scikit-learn. Finally, the book covers ethics and privacy concerns in data science and suggests resources for improving data science skills, as well as ways to stay up to date on new data science developments.

By the end of the book, you should be able to comfortably use Python for basic data science projects and should have the skills to execute the data science process on any data source.

What you will learn

  • Use Python data science packages effectively
  • Clean and prepare data for data science work, including feature engineering and feature selection
  • Data modeling, including classic statistical models (such as t-tests), and essential machine learning algorithms, such as random forests and boosted models
  • Evaluate model performance
  • Compare and understand different machine learning methods
  • Interact with Excel spreadsheets through Python
  • Create automated data science reports through Python
  • Get to grips with text analytics techniques

Who this book is for

The book is intended for beginners, including students starting or about to start a data science, analytics, or related program (e.g. Bachelor's, Master's, bootcamp, online courses), recent college graduates who want to learn new skills to set them apart in the job market, professionals who want to learn hands-on data science techniques in Python, and those who want to shift their career to data science.

The book requires basic familiarity with Python. A "getting started with Python" section has been included to get complete novices up to speed.

Table of contents

  1. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Get in touch
  2. Part I - An Introduction and the Basics
  3. Introduction to Data Science
    1. The data science origin story
    2. The top data science tools and skills
      1. Python
      2. Other programming languages
      3. GUIs and platforms
      4. Cloud tools
      5. Statistical methods and math
      6. Collecting, organizing, and preparing data
      7. Software development
      8. Business understanding and communication
    3. Specializations in and around data science
      1. Machine learning
        1. Business intelligence
        2. Deep learning
        3. Data engineering
        4. Big data
        5. Statistical methods
      2. Natural Language Processing (NLP)
      3. Artificial Intelligence (AI)
      4. Choosing how to specialize
    4. Data science project methodologies
      1. Using data science in other fields
      2. CRISP-DM
      3. TDSP
        1. Further reading on data science project management strategies
      4. Other tools
    5. Test your knowledge
    6. Summary
  4. Getting Started with Python
    1. Installing Python with Anaconda and getting started
      1. Installing Anaconda
      2. Running Python code
        1. The Python shell
        2. The IPython shell
        3. Jupyter
        4. Why the command line?
        5. Command line basics
      3. Installing and using a code text editor – VS Code
        1. Editing Python code with VS Code
        2. Running a Python file
      4. Installing Python packages and creating virtual environments
    2. Python basics
      1. Numbers
      2. Strings
      3. Variables
      4. Lists, tuples, sets, and dictionaries
        1. Lists
        2. Tuples
        3. Sets
        4. Dictionaries
      5. Loops and comprehensions
      6. Booleans and conditionals
      7. Packages and modules
      8. Functions
      9. Classes
      10. Multithreading and multiprocessing
    3. Software engineering best practices
      1. Debugging errors and utilizing documentation
        1. Debugging
        2. Documentation
      2. Version control with Git
      3. Code style
      4. Productivity tips
    4. Test your knowledge
    5. Summary
  5. Part II - Dealing with Data
  6. SQL and Built-in File Handling Modules in Python
    1. Introduction
    2. Loading, reading, and writing files with base Python
      1. Opening a file and reading its contents
      2. Using the built-in JSON module
      3. Saving credentials or data in a Python file
      4. Saving Python objects with pickle
    3. Using SQLite and SQL
      1. Creating a SQLite database and storing data
    4. Using the SQLAlchemy package in Python
    5. Test your knowledge
    6. Summary
  7. Loading and Wrangling Data with Pandas and NumPy
    1. Data wrangling and analyzing iTunes data
      1. Loading and saving data with Pandas
        1. Understanding the DataFrame structure and combining/concatenating multiple DataFrames
      2. Exploratory Data Analysis (EDA) and basic data cleaning with Pandas
        1. Examining the top and bottom of the data
        2. Examining the data's dimensions, datatypes, and missing values
        3. Investigating statistical properties of the data
        4. Plotting with DataFrames
      3. Cleaning data
      4. Filtering DataFrames
        1. Removing irrelevant data
        2. Dealing with missing values
        3. Dealing with outliers
        4. Dealing with duplicate values
        5. Ensuring datatypes are correct
        6. Standardizing data formats
      5. Data transformations
      6. Using replace, map, and apply to clean and transform data
      7. Using GroupBy
      8. Writing DataFrames to disk
      9. Wrangling and analyzing Bitcoin price data
    2. Understanding NumPy basics
    3. Using NumPy mathematical functions
    4. Test your knowledge
    5. Summary
  8. Exploratory Data Analysis and Visualization
    1. EDA and visualization libraries in Python
    2. Performing EDA with Seaborn and pandas
      1. Making boxplots and letter-value plots
      2. Making histograms and violin plots
      3. Making scatter plots with Matplotlib and Seaborn
      4. Examining correlations and making correlograms
      5. Making missing value plots
    3. Using EDA Python packages
    4. Using visualization best practices
      1. Saving plots for sharing and reports
    5. Making plots with Plotly
    6. Test your knowledge
    7. Summary
  9. Data Wrangling Documents and Spreadsheets
    1. Parsing and processing Word and PDF documents
      1. Reading text from Word documents
        1. Extracting insights from Word documents: common words and phrases
        2. Analyzing words and phrases from the text
      2. Reading text from PDFs
    2. Reading and writing data with Excel files
      1. Using pandas for wrangling Excel files
        1. Analyzing the data
      2. Using openpyxl for wrangling Excel files
    3. Test your knowledge
    4. Summary
  10. Web Scraping
    1. Understanding the structure of the internet
      1. GET and POST requests, and HTML
    2. Performing simple web scraping
      1. Using urllib
      2. Using the requests package
      3. Scraping several files
        1. Extracting the data from the scraped files
    3. Parsing HTML from scraped pages
      1. Using XPath, lxml, and bs4 to extract data from webpages
        1. Collecting data from several pages
    4. Using APIs to collect data
      1. Using API wrappers
    5. The ethics and legality of web scraping
    6. Test your knowledge
    7. Summary
  11. Part III - Statistics for Data Science
  12. Probability, Distributions, and Sampling
    1. Probability basics
      1. Independent and conditional probabilities
      2. Bayes' Theorem
      3. Frequentist versus Bayesian
    2. Distributions
      1. The normal distribution and using scipy to generate distributions
      2. Descriptive statistics of distributions
        1. Variants of the normal distribution
      3. Fitting distributions to data to get parameters
      4. The Student's t-distribution
      5. The Bernoulli distribution
      6. The binomial distribution
      7. The uniform distribution
      8. The exponential and Poisson distributions
      9. The Weibull distribution
      10. The Zipfian distribution
    3. Sampling from data
      1. The law of large numbers
      2. The central limit theorem
      3. Random sampling
      4. Bootstrap sampling and confidence intervals
    4. Test your knowledge
    5. Summary
  13. Statistical Testing for Data Science
    1. Statistical testing basics and sample comparison tests
      1. The t-test and z-test
        1. One-sample, two-sided t-test
        2. The z-test
        3. One-sided tests
        4. Two-sample t- and z-tests: A/B testing
        5. Paired t- and z-tests
        6. Other A/B testing methods
        7. Testing between several groups with ANOVA
        8. Post-hoc ANOVA tests
        9. Assumptions for these methods
    2. Other statistical tests
      1. Testing if data belongs to a distribution
      2. Generalized ESD outlier test
      3. The Pearson correlation test
    3. Test your knowledge
    4. Summary
  14. Part IV - Machine Learning
  15. Preparing Data for Machine Learning: Feature Selection, Feature Engineering, and Dimensionality Reduction
    1. Types of machine learning
    2. Feature selection
      1. The curse of dimensionality
      2. Overfitting and underfitting, and the bias-variance trade-off
      3. Methods for feature selection
      4. Variance thresholding – removing features with too much and too little variance
      5. Univariate statistics feature selection
        1. Correlation
      6. Mutual information score and chi-squared
        1. The chi-squared test
        2. ANOVA
      7. Using the univariate statistics for feature selection
    3. Feature engineering
      1. Data cleaning and preparation
        1. Converting strings to dates
        2. Outlier cleaning strategies
      2. Combining multiple columns
      3. Transforming numeric data
        1. Standardization
        2. Making data more Gaussian with the Yeo-Johnson transform
      4. Extracting datetime features
      5. Binning
      6. One-hot encoding and label encoding
      7. Simplifying categorical columns
        1. One-hot encoding
    4. Dimensionality reduction
      1. Principle Component Analysis (PCA)
    5. Test your knowledge
    6. Summary
  16. Machine Learning for Classification
    1. Machine learning classification algorithms
      1. Logistic regression for binary classification
        1. Getting predictions from our model
      2. How logistic regression works
        1. Odds ratio and the logit
        2. Examining feature importances with sklearn
        3. Using statmodels for logistic regression
        4. Maximum likelihood estimation, optimizers, and the logistic regression algorithm
        5. Regularization
        6. Hyperparameters and cross-validation
        7. Logistic regression (and other models) with big data
      3. Naïve Bayes for binary classification
      4. k-nearest neighbors (KNN)
      5. Multiclass classification
        1. Logistic regression
        2. One-versus-rest and one-versus-one formulations
        3. Multi-label classification
      6. Choosing a model to use
        1. The "no free lunch" theorem
        2. Computational complexity of models
    2. Test your knowledge
    3. Summary
  17. Evaluating Machine Learning Classification Models and Sampling for Classification
    1. Evaluating classification algorithm performance with metrics
      1. Train-validation-test splits
      2. Accuracy
      3. Cohen's Kappa
      4. Confusion matrix
      5. Precision, recall, and F1 score
      6. AUC score and the ROC curve
      7. Choosing the optimal cutoff threshold
    2. Sampling and balancing classification data
      1. Downsampling
      2. Oversampling
      3. SMOTE and other synthetic sampling methods
    3. Test your knowledge
    4. Summary
  18. Machine Learning with Regression
    1. Linear regression
      1. Linear regression with sklearn
      2. Linear regression with statsmodels
      3. Regularized linear regression
      4. Regression with KNN in sklearn
      5. Evaluating regression models
        1. R2 or the coefficient of determination
        2. Adjusted R2
        3. Information criteria
        4. Mean squared error
        5. Mean absolute error
      6. Linear regression assumptions
    2. Regression models on big data
    3. Forecasting
    4. Test your knowledge
    5. Summary
  19. Optimizing Models and Using AutoML
    1. Hyperparameter optimization with search methods
      1. Using grid search
      2. Using random search
      3. Using Bayesian search
      4. Other advanced search methods
    2. Using learning curves
    3. Optimizing the number of features with ML models
    4. Using AutoML with PyCaret
      1. The no free lunch theorem
      2. AutoML solutions
      3. Using PyCaret
    5. Test your knowledge
    6. Summary
  20. Tree-Based Machine Learning Models
    1. Decision trees
      1. Random forests
      2. Random forests with sklearn
      3. Random forests with H2O
    2. Feature importance from tree-based methods
      1. Using H2O for feature importance
      2. Using sklearn random forest feature importances
    3. Boosted trees: AdaBoost, XGboost, LightGBM, and CatBoost
      1. AdaBoost
      2. XGBoost
        1. XGBoost with PyCaret
        2. XGBoost with the xgboost package
        3. Training boosted models on a GPU
      3. LightGBM
        1. LightGBM plotting
        2. Using LightGBM directly
      4. CatBoost
        1. Using CatBoost natively
      5. Using early stopping with boosting algorithms
    4. Test your knowledge
    5. Summary
  21. Support Vector Machine (SVM) Machine Learning Models
    1. How SVMs work
      1. SVMs for classification
      2. SVMs for regression
    2. Using SVMs
      1. Using SVMs in sklearn
      2. Tuning SVMs with pycaret
    3. Test your knowledge
    4. Summary
  22. Part V - Text Analysis and Reporting
  23. Clustering with Machine Learning
    1. Using k-means clustering
      1. Clustering metrics
      2. Optimizing k in k-means
        1. Examining the clusters
    2. Hierarchical clustering
    3. DBSCAN
    4. Other unsupervised methods
    5. Test your knowledge
    6. Summary
  24. Working with Text
    1. Text preprocessing
      1. Basic text cleaning
      2. Stemming and Lemmatizing
      3. Preparing text with spaCy
      4. Word vectors
      5. TFIDF vectors
    2. Basic text analysis
      1. Word frequency plots
      2. Wordclouds
      3. Zipf's law
      4. Word collocations
      5. Parts of speech
    3. Unsupervised learning
      1. Topic modeling
      2. Topic modeling with pycaret
      3. Topic modeling with Top2Vec
    4. Supervised learning
      1. Classification
    5. Sentiment analysis
    6. Test your knowledge
    7. Summary
  25. Part VI - Wrapping Up
  26. Data Storytelling and Automated Reporting/Dashboarding
    1. Data storytelling
      1. Data storytelling example
    2. Automated reporting and dashboarding
      1. Automated reporting options
      2. Automated dashboarding
      3. Scheduling tasks to run automatically
    3. Test your knowledge
    4. Summary
  27. Ethics and Privacy
    1. The ethics of machine learning algorithms
      1. Bias
      2. How to decrease ML biases
      3. Carefully evaluating performance and consequences
    2. Data privacy
    3. Data privacy regulations and laws
    4. k-anonymity, l-diversity, and t-closeness
    5. Differential privacy
    6. Using data science for the common good
    7. Other ethical considerations
    8. Test your knowledge
    9. Summary
  28. Staying Up to Date and the Future of Data Science
    1. Blogs, newsletters, books, and academic sources
      1. Blogs
      2. Newsletters
      3. Books
      4. Academic sources
    2. Data science competition websites
    3. Online learning platforms
    4. Cloud services
    5. Other places to keep an eye on
    6. Strategies for staying up to date
    7. Other data science topics we didn't cover
    8. The future of data science
    9. Summary
  29. Other Books You May Enjoy
  30. Index

Product information

  • Title: Practical Data Science with Python
  • Author(s): Nathan George
  • Release date: September 2021
  • Publisher(s): Packt Publishing
  • ISBN: 9781801071970