Hands-On Data Analysis with Pandas

Book description

Get to grips with pandas - a versatile and high-performance Python library for data manipulation, analysis, and discovery

Key Features

  • Perform efficient data analysis and manipulation tasks using pandas
  • Apply pandas to different real-world domains with the help of step-by-step demonstrations
  • Get accustomed to using pandas as an effective data exploration tool

Book Description

Data analysis has become a necessary skill in a variety of domains where knowing how to work with data and extract insights can generate significant value.

Hands-On Data Analysis with Pandas will show you how to analyze your data, get started with machine learning, and work effectively with Python libraries often used for data science, such as pandas, NumPy, matplotlib, seaborn, and scikit-learn. Using real-world datasets, you will learn how to use the powerful pandas library to perform data wrangling to reshape, clean, and aggregate your data. Then, you will be able to conduct exploratory data analysis by calculating summary statistics and visualizing the data to find patterns. In the concluding chapters, you will explore some applications of anomaly detection, regression, clustering, and classification using scikit-learn to make predictions based on past data.

By the end of this book, you will be equipped with the skills you need to use pandas to ensure the veracity of your data, visualize it for effective decision-making, and reliably reproduce analyses across multiple datasets.

What you will learn

  • Understand how data analysts and scientists gather and analyze data
  • Perform data analysis and data wrangling using Python
  • Combine, group, and aggregate data from multiple sources
  • Create data visualizations with pandas, matplotlib, and seaborn
  • Apply machine learning (ML) algorithms to identify patterns and make predictions
  • Use Python data science libraries to analyze real-world datasets
  • Use pandas to solve common data representation and analysis problems
  • Build Python scripts, modules, and packages for reusable analysis code

Who this book is for

This book is for data analysts, data science beginners, and Python developers who want to explore each stage of data analysis and scientific computing using a wide range of datasets. You will also find this book useful if you are a data scientist looking to implement pandas in machine learning. Working knowledge of Python programming language will be beneficial.

Table of contents

  1. Title Page
  2. Copyright and Credits
    1. Hands-On Data Analysis with Pandas
  3. Dedication
  4. About Packt
    1. Why subscribe?
  5. Foreword
  6. Contributors
    1. About the author
    2. About the reviewer
    3. Packt is searching for authors like you
  7. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the color images
    4. Conventions used
    5. Get in touch
      1. Reviews
  8. Section 1: Getting Started with Pandas
  9. Introduction to Data Analysis
    1. Chapter materials
    2. Fundamentals of data analysis
      1. Data collection
      2. Data wrangling
      3. Exploratory data analysis
      4. Drawing conclusions
    3. Statistical foundations
      1. Sampling
      2. Descriptive statistics
        1. Measures of central tendency
          1. Mean
          2. Median
          3. Mode
        2. Measures of spread
          1. Range
          2. Variance
          3. Standard deviation
          4. Coefficient of variation
          5. Interquartile range
          6. Quartile coefficient of dispersion
        3. Summarizing data
        4. Common distributions
        5. Scaling data
        6. Quantifying relationships between variables
        7. Pitfalls of summary statistics
      3. Prediction and forecasting
      4. Inferential statistics
    4. Setting up a virtual environment
      1. Virtual environments
        1. venv
          1. Windows
          2. Linux/macOS
        2. Anaconda
      2. Installing the required Python packages
      3. Why pandas?
      4. Jupyter Notebooks
        1. Launching JupyterLab
        2. Validating the virtual environment
        3. Closing JupyterLab
    5. Summary
    6. Exercises
    7. Further reading
  10. Working with Pandas DataFrames
    1. Chapter materials
    2. Pandas data structures
      1. Series
      2. Index
      3. DataFrame
    3. Bringing data into a pandas DataFrame
      1. From a Python object
      2. From a file
      3. From a database
      4. From an API
    4. Inspecting a DataFrame object
      1. Examining the data
      2. Describing and summarizing the data
    5. Grabbing subsets of the data
      1. Selection
      2. Slicing
      3. Indexing
      4. Filtering
    6. Adding and removing data
      1. Creating new data
      2. Deleting unwanted data
    7. Summary
    8. Exercises
    9. Further reading
  11. Section 2: Using Pandas for Data Analysis
  12. Data Wrangling with Pandas
    1. Chapter materials
    2. What is data wrangling?
      1. Data cleaning
      2. Data transformation
        1. The wide data format
        2. The long data format
      3. Data enrichment
    3. Collecting temperature data
    4. Cleaning up the data
      1. Renaming columns
      2. Type conversion
      3. Reordering, reindexing, and sorting data
    5. Restructuring the data
      1. Pivoting DataFrames
      2. Melting DataFrames
    6. Handling duplicate, missing, or invalid data
      1. Finding the problematic data
      2. Mitigating the issues
    7. Summary
    8. Exercises
    9. Further reading
  13. Aggregating Pandas DataFrames
    1. Chapter materials
    2. Database-style operations on DataFrames
      1. Querying DataFrames
      2. Merging DataFrames
    3. DataFrame operations
      1. Arithmetic and statistics
      2. Binning and thresholds
      3. Applying functions
      4. Window calculations
      5. Pipes
    4. Aggregations with pandas and numpy
      1. Summarizing DataFrames
      2. Using groupby
      3. Pivot tables and crosstabs
    5. Time series
      1. Time-based selection and filtering
      2. Shifting for lagged data
      3. Differenced data
      4. Resampling
      5. Merging
    6. Summary
    7. Exercises
    8. Further reading
  14. Visualizing Data with Pandas and Matplotlib
    1. Chapter materials
    2. An introduction to matplotlib
      1. The basics
      2. Plot components
      3. Additional options
    3. Plotting with pandas
      1. Evolution over time
      2. Relationships between variables
      3. Distributions
      4. Counts and frequencies
    4. The pandas.plotting subpackage
      1. Scatter matrices
      2. Lag plots
      3. Autocorrelation plots
      4. Bootstrap plots
    5. Summary
    6. Exercises
    7. Further reading
  15. Plotting with Seaborn and Customization Techniques
    1. Chapter materials
    2. Utilizing seaborn for advanced plotting
      1. Categorical data
      2. Correlations and heatmaps
      3. Regression plots
      4. Distributions
      5. Faceting
    3. Formatting
      1. Titles and labels
      2. Legends
      3. Formatting axes
    4. Customizing visualizations
      1. Adding reference lines
      2. Shading regions
      3. Annotations
      4. Colors
    5. Summary
    6. Exercises
    7. Further reading
  16. Section 3: Applications - Real-World Analyses Using Pandas
  17. Financial Analysis - Bitcoin and the Stock Market
    1. Chapter materials
    2. Building a Python package
      1. Package structure
      2. Overview of the stock_analysis package
    3. Data extraction with pandas
      1. The StockReader class
      2. Bitcoin historical data from HTML
      3. S&P 500 historical data from Yahoo! Finance
      4. FAANG historical data from IEX
    4. Exploratory data analysis
      1. The Visualizer class family
      2. Visualizing a stock
      3. Visualizing multiple assets
    5. Technical analysis of financial instruments
      1. The StockAnalyzer class
      2. The AssetGroupAnalyzer class
      3. Comparing assets
    6. Modeling performance
      1. The StockModeler class
      2. Time series decomposition
      3. ARIMA
      4. Linear regression with statsmodels
      5. Comparing models
    7. Summary
    8. Exercises
    9. Further reading
  18. Rule-Based Anomaly Detection
    1. Chapter materials
    2. Simulating login attempts
      1. Assumptions
      2. The login_attempt_simulator package
        1. Helper functions
        2. The LoginAttemptSimulator class
      3. Simulating from the command line
    3. Exploratory data analysis
    4. Rule-based anomaly detection
      1. Percent difference
      2. Tukey fence
      3. Z-score
      4. Evaluating performance
    5. Summary
    6. Exercises
    7. Further reading
  19. Section 4: Introduction to Machine Learning with Scikit-Learn
  20. Getting Started with Machine Learning in Python
    1. Chapter materials
    2. Learning the lingo
    3. Exploratory data analysis
      1. Red wine quality data
      2. White and red wine chemical properties data
      3. Planets and exoplanets data
    4. Preprocessing data
      1. Training and testing sets
      2. Scaling and centering data
      3. Encoding data
      4. Imputing
      5. Additional transformers
      6. Pipelines
    5. Clustering
      1. k-means
        1. Grouping planets by orbit characteristics
        2. Elbow point method for determining k
        3. Interpreting centroids and visualizing the cluster space
      2. Evaluating clustering results
    6. Regression
      1. Linear regression
        1. Predicting the length of a year on a planet
        2. Interpreting the linear regression equation
        3. Making predictions
      2. Evaluating regression results
        1. Analyzing residuals
        2. Metrics
    7. Classification
      1. Logistic regression
        1. Predicting red wine quality
        2. Determining wine type by chemical properties
      2. Evaluating classification results
        1. Confusion matrix
        2. Classification metrics
          1. Accuracy and error rate
          2. Precision and recall
          3. F score
          4. Sensitivity and specificity
        3. ROC curve
        4. Precision-recall curve
    8. Summary
    9. Exercises
    10. Further reading
  21. Making Better Predictions - Optimizing Models
    1. Chapter materials
    2. Hyperparameter tuning with grid search
    3. Feature engineering
      1. Interaction terms and polynomial features
      2. Dimensionality reduction
      3. Feature unions
      4. Feature importances
    4. Ensemble methods
      1. Random forest
      2. Gradient boosting
      3. Voting
    5. Inspecting classification prediction confidence
    6. Addressing class imbalance
      1. Under-sampling
      2. Over-sampling
    7. Regularization
    8. Summary
    9. Exercises
    10. Further reading
  22. Machine Learning Anomaly Detection
    1. Chapter materials
    2. Exploring the data
    3. Unsupervised methods
      1. Isolation forest
      2. Local outlier factor
      3. Comparing models
    4. Supervised methods
      1. Baselining
        1. Dummy classifier
        2. Naive Bayes
      2. Logistic regression
    5. Online learning
      1. Creating the PartialFitPipeline subclass
      2. Stochastic gradient descent classifier
        1. Building our initial model
        2. Evaluating the model
        3. Updating the model
        4. Presenting our results
        5. Further improvements
    6. Summary
    7. Exercises
    8. Further reading
  23. Section 5: Additional Resources
  24. The Road Ahead
    1. Data resources
      1. Python packages
        1. Seaborn
        2. Scikit-learn
      2. Searching for data
      3. APIs
      4. Websites
        1. Finance
        2. Government data
        3. Health and economy
        4. Social networks
        5. Sports
        6. Miscellaneous
    2. Practicing working with data
    3. Python practice
    4. Summary
    5. Exercises
    6. Further reading
  25. Solutions
  26. Appendix
    1. Data analysis workflow
    2. Choosing the appropriate visualization
    3. Machine learning workflow
  27. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think

Product information

  • Title: Hands-On Data Analysis with Pandas
  • Author(s): Stefanie Molin
  • Release date: July 2019
  • Publisher(s): Packt Publishing
  • ISBN: 9781789615326