Hands-On Exploratory Data Analysis with Python

Book description

Discover techniques to summarize the characteristics of your data using PyPlot, NumPy, SciPy, and pandas

Key Features

  • Understand the fundamental concepts of exploratory data analysis using Python
  • Find missing values in your data and identify the correlation between different variables
  • Practice graphical exploratory analysis techniques using Matplotlib and the Seaborn Python package

Book Description

Exploratory Data Analysis (EDA) is an approach to data analysis that involves the application of diverse techniques to gain insights into a dataset. This book will help you gain practical knowledge of the main pillars of EDA - data cleaning, data preparation, data exploration, and data visualization.

You'll start by performing EDA using open source datasets and perform simple to advanced analyses to turn data into meaningful insights. You'll then learn various descriptive statistical techniques to describe the basic characteristics of data and progress to performing EDA on time-series data. As you advance, you'll learn how to implement EDA techniques for model development and evaluation and build predictive models to visualize results. Using Python for data analysis, you'll work with real-world datasets, understand data, summarize its characteristics, and visualize it for business intelligence.

By the end of this EDA book, you'll have developed the skills required to carry out a preliminary investigation on any dataset, yield insights into data, present your results with visual aids, and build a model that correctly predicts future outcomes.

What you will learn

  • Import, clean, and explore data to perform preliminary analysis using powerful Python packages
  • Identify and transform erroneous data using different data wrangling techniques
  • Explore the use of multiple regression to describe non-linear relationships
  • Discover hypothesis testing and explore techniques of time-series analysis
  • Understand and interpret results obtained from graphical analysis
  • Build, train, and optimize predictive models to estimate results
  • Perform complex EDA techniques on open source datasets

Who this book is for

This EDA book is for anyone interested in data analysis, especially students, statisticians, data analysts, and data scientists. The practical concepts presented in this book can be applied in various disciplines to enhance decision-making processes with data analysis and synthesis. Fundamental knowledge of Python programming and statistical concepts is all you need to get started with this book.

Table of contents

  1. Title Page
  2. Copyright and Credits
    1. Hands-On Exploratory Data Analysis with Python
  3. About Packt
    1. Why subscribe?
  4. Contributors
    1. About the authors
    2. About the reviewer
    3. Packt is searching for authors like you
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
      1. Download the example code files
      2. Download the color images
      3. Conventions used
    4. Get in touch
      1. Reviews
  6. Section 1: The Fundamentals of EDA
  7. Exploratory Data Analysis Fundamentals
    1. Understanding data science
    2. The significance of EDA
      1. Steps in EDA
    3. Making sense of data
      1. Numerical data
        1. Discrete data
        2. Continuous data
      2. Categorical data
      3. Measurement scales
        1. Nominal
        2. Ordinal
        3. Interval
        4. Ratio
    4. Comparing EDA with classical and Bayesian analysis
    5. Software tools available for EDA
    6. Getting started with EDA
      1. NumPy
      2. Pandas
      3. SciPy
      4. Matplotlib
    7. Summary
    8. Further reading
  8. Visual Aids for EDA
    1. Technical requirements
    2. Line chart
      1. Steps involved
    3. Bar charts
    4. Scatter plot
      1. Bubble chart
      2. Scatter plot using seaborn
    5. Area plot and stacked plot
    6. Pie chart
    7. Table chart
    8. Polar chart
    9. Histogram
    10. Lollipop chart
    11. Choosing the best chart
    12. Other libraries to explore
    13. Summary
    14. Further reading
  9. EDA with Personal Email
    1. Technical requirements
    2. Loading the dataset
    3. Data transformation
      1. Data cleansing
      2. Loading the CSV file
      3. Converting the date
      4. Removing NaN values
      5. Applying descriptive statistics
      6. Data refactoring
      7. Dropping columns
      8. Refactoring timezones
    4. Data analysis
      1. Number of emails
      2. Time of day
      3. Average emails per day and hour
      4. Number of emails per day
      5. Most frequently used words
    5. Summary
    6. Further reading
  10. Data Transformation
    1. Technical requirements
    2. Background
    3. Merging database-style dataframes
      1. Concatenating along with an axis
      2. Using df.merge with an inner join
      3. Using the pd.merge() method with a left join
      4. Using the pd.merge() method with a right join
      5. Using pd.merge() methods with outer join
      6. Merging on index
      7. Reshaping and pivoting
    4. Transformation techniques
      1. Performing data deduplication
      2. Replacing values
      3. Handling missing data
        1. NaN values in pandas objects
        2. Dropping missing values
          1. Dropping by rows
          2. Dropping by columns
        3. Mathematical operations with NaN
        4. Filling missing values
        5. Backward and forward filling
        6. Interpolating missing values
      4. Renaming axis indexes
      5. Discretization and binning
      6. Outlier detection and filtering
      7. Permutation and random sampling
        1. Random sampling without replacement
        2. Random sampling with replacement
      8. Computing indicators/dummy variables
      9. String manipulation
    5. Benefits of data transformation
      1. Challenges
    6. Summary
    7. Further reading
  11. Section 2: Descriptive Statistics
  12. Descriptive Statistics
    1. Technical requirements
    2. Understanding statistics
      1. Distribution function
        1. Uniform distribution
        2. Normal distribution
        3. Exponential distribution
        4. Binomial distribution
      2. Cumulative distribution function
      3. Descriptive statistics
    3. Measures of central tendency
      1. Mean/average
      2. Median
      3. Mode
    4. Measures of dispersion
      1. Standard deviation
      2. Variance
      3. Skewness
      4. Kurtosis
        1. Types of kurtosis
      5. Calculating percentiles
      6. Quartiles
        1. Visualizing quartiles
    5. Summary
    6. Further reading
  13. Grouping Datasets
    1. Technical requirements
    2. Understanding groupby()
    3. Groupby mechanics
      1. Selecting a subset of columns
      2. Max and min
      3. Mean
    4. Data aggregation
      1. Group-wise operations
        1. Renaming grouped aggregation columns
      2. Group-wise transformations
    5. Pivot tables and cross-tabulations
      1. Pivot tables
      2. Cross-tabulations
    6. Summary
    7. Further reading
  14. Correlation
    1. Technical requirements
    2. Introducing correlation
    3. Types of analysis
      1. Understanding univariate analysis
      2. Understanding bivariate analysis
      3. Understanding multivariate analysis
    4. Discussing multivariate analysis using the Titanic dataset
    5. Outlining Simpson's paradox
    6. Correlation does not imply causation
    7. Summary
    8. Further reading
  15. Time Series Analysis
    1. Technical requirements
    2. Understanding the time series dataset
      1. Fundamentals of TSA
        1. Univariate time series
      2. Characteristics of time series data
    3. TSA with Open Power System Data
      1. Data cleaning
      2. Time-based indexing
      3. Visualizing time series
      4. Grouping time series data
      5. Resampling time series data
    4. Summary
    5. Further reading
  16. Section 3: Model Development and Evaluation
  17. Hypothesis Testing and Regression
    1. Technical requirements
    2. Hypothesis testing
      1. Hypothesis testing principle
      2. statsmodels library
      3. Average reading time
      4. Types of hypothesis testing
      5. T-test
    3. p-hacking
    4. Understanding regression
      1. Types of regression
        1. Simple linear regression
        2. Multiple linear regression
        3. Nonlinear regression
    5. Model development and evaluation
      1. Constructing a linear regression model
        1. Model evaluation
        2. Computing accuracy
        3. Understanding accuracy
      2. Implementing a multiple linear regression model
    6. Summary
    7. Further reading
  18. Model Development and Evaluation
    1. Technical requirements
    2. Types of machine learning
    3. Understanding supervised learning
      1. Regression
      2. Classification
    4. Understanding unsupervised learning
      1. Applications of unsupervised learning
      2. Clustering using MiniBatch K-means clustering
        1. Extracting keywords
        2. Plotting clusters
        3. Word cloud
    5. Understanding reinforcement learning
      1. Difference between supervised and reinforcement learning
      2. Applications of reinforcement learning
    6. Unified machine learning workflow
      1. Data preprocessing
        1. Data collection
        2. Data analysis
        3. Data cleaning, normalization, and transformation
      2. Data preparation
      3. Training sets and corpus creation
      4. Model creation and training
      5. Model evaluation
      6. Best model selection and evaluation
      7. Model deployment
    7. Summary
    8. Further reading
  19. EDA on Wine Quality Data Analysis
    1. Technical requirements
    2. Disclosing the wine quality dataset
      1. Loading the dataset
      2. Descriptive statistics
      3. Data wrangling
    3. Analyzing red wine
      1. Finding correlated columns
      2. Alcohol versus quality
      3. Alcohol versus pH
    4. Analyzing white wine
      1. Red wine versus white wine
      2. Adding a new attribute
      3. Converting into a categorical column
      4. Concatenating dataframes
      5. Grouping columns
      6. Univariate analysis
      7. Multivariate analysis on the combined dataframe
      8. Discrete categorical attributes
      9. 3-D visualization
    5. Model development and evaluation
    6. Summary
    7. Further reading
  20. Appendix
    1. String manipulation
      1. Creating strings
      2. Accessing characters in Python
      3. String slicing
      4. Deleting/updating from a string
      5. Escape sequencing in Python
      6. Formatting strings
    2. Using pandas vectorized string functions
      1. Using string functions with a pandas DataFrame
    3. Using regular expressions
    4. Further reading
  21. Other Books You May Enjoy
    1. Leave a review - let other readers know what you think

Product information

  • Title: Hands-On Exploratory Data Analysis with Python
  • Author(s): Suresh Kumar Mukhiya, Usman Ahmed
  • Release date: March 2020
  • Publisher(s): Packt Publishing
  • ISBN: 9781789537253