O'Reilly logo

Stay ahead with the world's most comprehensive technology and business learning platform.

With Safari, you learn the way you learn best. Get unlimited access to videos, live online training, learning paths, books, tutorials, and more.

Start Free Trial

No credit card required

Pandas for Everyone: Python Data Analysis, First Edition

Book Description

The Hands-On, Example-Rich Introduction to Pandas Data Analysis in Python

 

Today, analysts must manage data characterized by extraordinary variety, velocity, and volume. Using the open source Pandas library, you can use Python to rapidly automate and perform virtually any data analysis task, no matter how large or complex. Pandas can help you ensure the veracity of your data, visualize it for effective decision-making, and reliably reproduce analyses across multiple datasets.

 

Pandas for Everyone brings together practical knowledge and insight for solving real problems with Pandas, even if you’re new to Python data analysis. Daniel Y. Chen introduces key concepts through simple but practical examples, incrementally building on them to solve more difficult, real-world problems.

 

Chen gives you a jumpstart on using Pandas with a realistic dataset and covers combining datasets, handling missing data, and structuring datasets for easier analysis and visualization. He demonstrates powerful data cleaning techniques, from basic string manipulation to applying functions simultaneously across dataframes.

 

Once your data is ready, Chen guides you through fitting models for prediction, clustering, inference, and exploration. He provides tips on performance and scalability, and introduces you to the wider Python data analysis ecosystem.

  • Work with DataFrames and Series, and import or export data
  • Create plots with matplotlib, seaborn, and pandas
  • Combine datasets and handle missing data
  • Reshape, tidy, and clean datasets so they’re easier to work with
  • Convert data types and manipulate text strings
  • Apply functions to scale data manipulations
  • Aggregate, transform, and filter large datasets with groupby
  • Leverage Pandas’ advanced date and time capabilities
  • Fit linear models using statsmodels and scikit-learn libraries
  • Use generalized linear modeling to fit models with different response variables
  • Compare multiple models to select the “best”
  • Regularize to overcome overfitting and improve performance
  • Use clustering in unsupervised machine learning

Register your product at informit.com/register for convenient access to downloads, updates, and/or corrections as they become available.

Table of Contents

  1. Cover Page
  2. About This E-Book
  3. Title Page
  4. Copyright Page
  5. Dedication Page
  6. Contents
  7. Foreword
  8. Preface
  9. Acknowledgments
  10. About the Author
  11. I Introduction
    1. 1 Pandas DataFrame Basics
      1. 1.1 Introduction
      2. 1.2 Loading Your First Data Set
      3. 1.3 Looking at Columns, Rows, and Cells
      4. 1.4 Grouped and Aggregated Calculations
      5. 1.5 Basic Plot
      6. 1.6 Conclusion
    2. 2 Pandas Data Structures
      1. 2.1 Introduction
      2. 2.2 Creating Your Own Data
      3. 2.3 The Series
      4. 2.4 The DataFrame
      5. 2.5 Making Changes to Series and DataFrames
      6. 2.6 Exporting and Importing Data
      7. 2.7 Conclusion
    3. 3 Introduction to Plotting
      1. 3.1 Introduction
      2. 3.2 Matplotlib
      3. 3.3 Statistical Graphics Using matplotlib
      4. 3.4 Seaborn
      5. 3.5 Pandas Objects
      6. 3.6 Seaborn Themes and Styles
      7. 3.7 Conclusion
  12. II Data Manipulation
    1. 4 Data Assembly
      1. 4.1 Introduction
      2. 4.2 Tidy Data
      3. 4.3 Concatenation
      4. 4.4 Merging Multiple Data Sets
      5. 4.5 Conclusion
    2. 5 Missing Data
      1. 5.1 Introduction
      2. 5.2 What Is a NaN Value?
      3. 5.3 Where Do Missing Values Come From?
      4. 5.4 Working With Missing Data
      5. 5.5 Conclusion
    3. 6 Tidy Data
      1. 6.1 Introduction
      2. 6.2 Columns Contain Values, Not Variables
      3. 6.3 Columns Contain Multiple Variables
      4. 6.4 Variables in Both Rows and Columns
      5. 6.5 Multiple Observational Units in a Table (Normalization)
      6. 6.6 Observational Units Across Multiple Tables
      7. 6.7 Conclusion
  13. III Data Munging
    1. 7 Data Types
      1. 7.1 Introduction
      2. 7.2 Data Types
      3. 7.3 Converting Types
      4. 7.4 Categorical Data
      5. 7.5 Conclusion
    2. 8 Strings and Text Data
      1. 8.1 Introduction
      2. 8.2 Strings
      3. 8.3 String Methods
      4. 8.4 More String Methods
      5. 8.5 String Formatting
      6. 8.6 Regular Expressions (RegEx)
      7. 8.7 The regex Library
      8. 8.8 Conclusion
    3. 9 Apply
      1. 9.1 Introduction
      2. 9.2 Functions
      3. 9.3 Apply (Basics)
      4. 9.4 Apply (More Advanced)
      5. 9.5 Vectorized Functions
      6. 9.6 Lambda Functions
      7. 9.7 Conclusion
    4. 10 Groupby Operations: Split–Apply–Combine
      1. 10.1 Introduction
      2. 10.2 Aggregate
      3. 10.3 Transform
      4. 10.4 Filter
      5. 10.5 The pandas.core.groupby .DataFrameGroupBy Object
      6. 10.6 Working With a MultiIndex
      7. 10.7 Conclusion
    5. 11 The datetime Data Type
      1. 11.1 Introduction
      2. 11.2 Python’s datetime Object
      3. 11.3 Converting to datetime
      4. 11.4 Loading Data That Include Dates
      5. 11.5 Extracting Date Components
      6. 11.6 Date Calculations and Timedeltas
      7. 11.7 Datetime Methods
      8. 11.8 Getting Stock Data
      9. 11.9 Subsetting Data Based on Dates
      10. 11.10 Date Ranges
      11. 11.11 Shifting Values
      12. 11.12 Resampling
      13. 11.13 Time Zones
      14. 11.14 Conclusion
  14. IV Data Modeling
    1. 12 Linear Models
      1. 12.1 Introduction
      2. 12.2 Simple Linear Regression
      3. 12.3 Multiple Regression
      4. 12.4 Keeping Index Labels From sklearn
      5. 12.5 Conclusion
    2. 13 Generalized Linear Models
      1. 13.1 Introduction
      2. 13.2 Logistic Regression
      3. 13.3 Poisson Regression
      4. 13.4 More Generalized Linear Models
      5. 13.5 Survival Analysis
      6. 13.6 Conclusion
    3. 14 Model Diagnostics
      1. 14.1 Introduction
      2. 14.2 Residuals
      3. 14.3 Comparing Multiple Models
      4. 14.4 k-Fold Cross-Validation
      5. 14.5 Conclusion
    4. 15 Regularization
      1. 15.1 Introduction
      2. 15.2 Why Regularize?
      3. 15.3 LASSO Regression
      4. 15.4 Ridge Regression
      5. 15.5 Elastic Net
      6. 15.6 Cross-Validation
      7. 15.7 Conclusion
    5. 16 Clustering
      1. 16.1 Introduction
      2. 16.2 k-Means
      3. 16.3 Hierarchical Clustering
      4. 16.4 Conclusion
  15. V Conclusion
    1. 17 Life Outside of Pandas
      1. 17.1 The (Scientific) Computing Stack
      2. 17.2 Performance
      3. 17.3 Going Bigger and Faster
    2. 18 Toward a Self-Directed Learner
      1. 18.1 It’s Dangerous to Go Alone!
      2. 18.2 Local Meetups
      3. 18.3 Conferences
      4. 18.4 The Internet
      5. 18.5 Podcasts
      6. 18.6 Conclusion
  16. VI Appendixes
    1. A Installation
      1. A.1 Installing Anaconda
      2. A.2 Uninstall Anaconda
    2. B Command Line
      1. B.1 Installation
      2. B.2 Basics
    3. C Project Templates
    4. D Using Python
      1. D.1 Command Line and Text Editor
      2. D.2 Python and IPython
      3. D.3 Jupyter
      4. D.4 Integrated Development Environments (IDEs)
    5. E Working Directories
    6. F Environments
    7. G Install Packages
      1. G.1 Updating Packages
    8. H Importing Libraries
    9. I Lists
    10. J Tuples
    11. K Dictionaries
    12. L Slicing Values
    13. M Loops
    14. N Comprehensions
    15. O Functions
      1. O.1 Default Parameters
      2. O.2 Arbitrary Parameters
    16. P Ranges and Generators
    17. Q Multiple Assignment
    18. R numpy ndarray
    19. S Classes
    20. T Odo: The Shapeshifter
  17. Index
  18. Code Snippets