Python Data Science Handbook, 2nd Edition

Book description

Python is a first-class tool for many researchers, primarily because of its libraries for storing, manipulating, and gaining insight from data. Several resources exist for individual pieces of this data science stack, but only with the new edition of Python Data Science Handbook do you get them all—IPython, NumPy, pandas, Matplotlib, Scikit-Learn, and other related tools.

Working scientists and data crunchers familiar with reading and writing Python code will find the second edition of this comprehensive desk reference ideal for tackling day-to-day issues: manipulating, transforming, and cleaning data; visualizing different types of data; and using data to build statistical or machine learning models. Quite simply, this is the must-have reference for scientific computing in Python.

With this handbook, you'll learn how:

  • IPython and Jupyter provide computational environments for scientists using Python
  • NumPy includes the ndarray for efficient storage and manipulation of dense data arrays
  • Pandas contains the DataFrame for efficient storage and manipulation of labeled/columnar data
  • Matplotlib includes capabilities for a flexible range of data visualizations
  • Scikit-learn helps you build efficient and clean Python implementations of the most important and established machine learning algorithms

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. What Is Data Science?
    2. Who Is This Book For?
    3. Why Python?
    4. Outline of the Book
    5. Installation Considerations
    6. Conventions Used in This Book
    7. Using Code Examples
    8. O’Reilly Online Learning
    9. How to Contact Us
  2. I. Jupyter: Beyond Normal Python
  3. 1. Getting Started in IPython and Jupyter
    1. Launching the IPython Shell
    2. Launching the Jupyter Notebook
    3. Help and Documentation in IPython
      1. Accessing Documentation with ?
      2. Accessing Source Code with ??
      3. Exploring Modules with Tab Completion
    4. Keyboard Shortcuts in the IPython Shell
      1. Navigation Shortcuts
      2. Text Entry Shortcuts
      3. Command History Shortcuts
      4. Miscellaneous Shortcuts
  4. 2. Enhanced Interactive Features
    1. IPython Magic Commands
      1. Running External Code: %run
      2. Timing Code Execution: %timeit
      3. Help on Magic Functions: ?, %magic, and %lsmagic
    2. Input and Output History
      1. IPython’s In and Out Objects
      2. Underscore Shortcuts and Previous Outputs
      3. Suppressing Output
      4. Related Magic Commands
    3. IPython and Shell Commands
      1. Quick Introduction to the Shell
      2. Shell Commands in IPython
      3. Passing Values to and from the Shell
      4. Shell-Related Magic Commands
  5. 3. Debugging and Profiling
    1. Errors and Debugging
      1. Controlling Exceptions: %xmode
      2. Debugging: When Reading Tracebacks Is Not Enough
    2. Profiling and Timing Code
      1. Timing Code Snippets: %timeit and %time
      2. Profiling Full Scripts: %prun
      3. Line-by-Line Profiling with %lprun
      4. Profiling Memory Use: %memit and %mprun
    3. More IPython Resources
      1. Web Resources
      2. Books
  6. II. Introduction to NumPy
  7. 4. Understanding Data Types in Python
    1. A Python Integer Is More Than Just an Integer
    2. A Python List Is More Than Just a List
    3. Fixed-Type Arrays in Python
    4. Creating Arrays from Python Lists
    5. Creating Arrays from Scratch
    6. NumPy Standard Data Types
  8. 5. The Basics of NumPy Arrays
    1. NumPy Array Attributes
    2. Array Indexing: Accessing Single Elements
    3. Array Slicing: Accessing Subarrays
      1. One-Dimensional Subarrays
      2. Multidimensional Subarrays
      3. Subarrays as No-Copy Views
      4. Creating Copies of Arrays
    4. Reshaping of Arrays
    5. Array Concatenation and Splitting
      1. Concatenation of Arrays
      2. Splitting of Arrays
  9. 6. Computation on NumPy Arrays: Universal Functions
    1. The Slowness of Loops
    2. Introducing Ufuncs
    3. Exploring NumPy’s Ufuncs
      1. Array Arithmetic
      2. Absolute Value
      3. Trigonometric Functions
      4. Exponents and Logarithms
      5. Specialized Ufuncs
    4. Advanced Ufunc Features
      1. Specifying Output
      2. Aggregations
      3. Outer Products
    5. Ufuncs: Learning More
  10. 7. Aggregations: min, max, and Everything in Between
    1. Summing the Values in an Array
    2. Minimum and Maximum
      1. Multidimensional Aggregates
      2. Other Aggregation Functions
    3. Example: What Is the Average Height of US Presidents?
  11. 8. Computation on Arrays: Broadcasting
    1. Introducing Broadcasting
    2. Rules of Broadcasting
      1. Broadcasting Example 1
      2. Broadcasting Example 2
      3. Broadcasting Example 3
    3. Broadcasting in Practice
      1. Centering an Array
      2. Plotting a Two-Dimensional Function
  12. 9. Comparisons, Masks, and Boolean Logic
    1. Example: Counting Rainy Days
    2. Comparison Operators as Ufuncs
    3. Working with Boolean Arrays
      1. Counting Entries
      2. Boolean Operators
    4. Boolean Arrays as Masks
    5. Using the Keywords and/or Versus the Operators &/|
  13. 10. Fancy Indexing
    1. Exploring Fancy Indexing
    2. Combined Indexing
    3. Example: Selecting Random Points
    4. Modifying Values with Fancy Indexing
    5. Example: Binning Data
  14. 11. Sorting Arrays
    1. Fast Sorting in NumPy: np.sort and np.argsort
    2. Sorting Along Rows or Columns
    3. Partial Sorts: Partitioning
    4. Example: k-Nearest Neighbors
  15. 12. Structured Data: NumPy’s Structured Arrays
    1. Exploring Structured Array Creation
    2. More Advanced Compound Types
    3. Record Arrays: Structured Arrays with a Twist
    4. On to Pandas
  16. III. Data Manipulation with Pandas
  17. 13. Introducing Pandas Objects
    1. The Pandas Series Object
      1. Series as Generalized NumPy Array
      2. Series as Specialized Dictionary
      3. Constructing Series Objects
    2. The Pandas DataFrame Object
      1. DataFrame as Generalized NumPy Array
      2. DataFrame as Specialized Dictionary
      3. Constructing DataFrame Objects
    3. The Pandas Index Object
      1. Index as Immutable Array
      2. Index as Ordered Set
  18. 14. Data Indexing and Selection
    1. Data Selection in Series
      1. Series as Dictionary
      2. Series as One-Dimensional Array
      3. Indexers: loc and iloc
    2. Data Selection in DataFrames
      1. DataFrame as Dictionary
      2. DataFrame as Two-Dimensional Array
      3. Additional Indexing Conventions
  19. 15. Operating on Data in Pandas
    1. Ufuncs: Index Preservation
    2. Ufuncs: Index Alignment
      1. Index Alignment in Series
      2. Index Alignment in DataFrames
    3. Ufuncs: Operations Between DataFrames and Series
  20. 16. Handling Missing Data
    1. Trade-offs in Missing Data Conventions
    2. Missing Data in Pandas
      1. None as a Sentinel Value
      2. NaN: Missing Numerical Data
      3. NaN and None in Pandas
    3. Pandas Nullable Dtypes
    4. Operating on Null Values
      1. Detecting Null Values
      2. Dropping Null Values
      3. Filling Null Values
  21. 17. Hierarchical Indexing
    1. A Multiply Indexed Series
      1. The Bad Way
      2. The Better Way: The Pandas MultiIndex
      3. MultiIndex as Extra Dimension
    2. Methods of MultiIndex Creation
      1. Explicit MultiIndex Constructors
      2. MultiIndex Level Names
      3. MultiIndex for Columns
    3. Indexing and Slicing a MultiIndex
      1. Multiply Indexed Series
      2. Multiply Indexed DataFrames
    4. Rearranging Multi-Indexes
      1. Sorted and Unsorted Indices
      2. Stacking and Unstacking Indices
      3. Index Setting and Resetting
  22. 18. Combining Datasets: concat and append
    1. Recall: Concatenation of NumPy Arrays
    2. Simple Concatenation with pd.concat
      1. Duplicate Indices
      2. Concatenation with Joins
      3. The append Method
  23. 19. Combining Datasets: merge and join
    1. Relational Algebra
    2. Categories of Joins
      1. One-to-One Joins
      2. Many-to-One Joins
      3. Many-to-Many Joins
    3. Specification of the Merge Key
      1. The on Keyword
      2. The left_on and right_on Keywords
      3. The left_index and right_index Keywords
    4. Specifying Set Arithmetic for Joins
    5. Overlapping Column Names: The suffixes Keyword
    6. Example: US States Data
  24. 20. Aggregation and Grouping
    1. Planets Data
    2. Simple Aggregation in Pandas
    3. groupby: Split, Apply, Combine
      1. Split, Apply, Combine
      2. The GroupBy Object
      3. Aggregate, Filter, Transform, Apply
      4. Specifying the Split Key
      5. Grouping Example
  25. 21. Pivot Tables
    1. Motivating Pivot Tables
    2. Pivot Tables by Hand
    3. Pivot Table Syntax
      1. Multilevel Pivot Tables
      2. Additional Pivot Table Options
    4. Example: Birthrate Data
  26. 22. Vectorized String Operations
    1. Introducing Pandas String Operations
    2. Tables of Pandas String Methods
      1. Methods Similar to Python String Methods
      2. Methods Using Regular Expressions
      3. Miscellaneous Methods
    3. Example: Recipe Database
      1. A Simple Recipe Recommender
      2. Going Further with Recipes
  27. 23. Working with Time Series
    1. Dates and Times in Python
      1. Native Python Dates and Times: datetime and dateutil
      2. Typed Arrays of Times: NumPy’s datetime64
      3. Dates and Times in Pandas: The Best of Both Worlds
    2. Pandas Time Series: Indexing by Time
    3. Pandas Time Series Data Structures
    4. Regular Sequences: pd.date_range
    5. Frequencies and Offsets
    6. Resampling, Shifting, and Windowing
      1. Resampling and Converting Frequencies
      2. Time Shifts
      3. Rolling Windows
    7. Example: Visualizing Seattle Bicycle Counts
      1. Visualizing the Data
      2. Digging into the Data
  28. 24. High-Performance Pandas: eval and query
    1. Motivating query and eval: Compound Expressions
    2. pandas.eval for Efficient Operations
    3. DataFrame.eval for Column-Wise Operations
      1. Assignment in DataFrame.eval
      2. Local Variables in DataFrame.eval
    4. The DataFrame.query Method
    5. Performance: When to Use These Functions
    6. Further Resources
  29. IV. Visualization with Matplotlib
  30. 25. General Matplotlib Tips
    1. Importing Matplotlib
    2. Setting Styles
    3. show or No show? How to Display Your Plots
      1. Plotting from a Script
      2. Plotting from an IPython Shell
      3. Plotting from a Jupyter Notebook
      4. Saving Figures to File
      5. Two Interfaces for the Price of One
  31. 26. Simple Line Plots
    1. Adjusting the Plot: Line Colors and Styles
    2. Adjusting the Plot: Axes Limits
    3. Labeling Plots
    4. Matplotlib Gotchas
  32. 27. Simple Scatter Plots
    1. Scatter Plots with plt.plot
    2. Scatter Plots with plt.scatter
    3. plot Versus scatter: A Note on Efficiency
    4. Visualizing Uncertainties
      1. Basic Errorbars
      2. Continuous Errors
  33. 28. Density and Contour Plots
    1. Visualizing a Three-Dimensional Function
    2. Histograms, Binnings, and Density
    3. Two-Dimensional Histograms and Binnings
      1. plt.hist2d: Two-Dimensional Histogram
      2. plt.hexbin: Hexagonal Binnings
      3. Kernel Density Estimation
  34. 29. Customizing Plot Legends
    1. Choosing Elements for the Legend
    2. Legend for Size of Points
    3. Multiple Legends
  35. 30. Customizing Colorbars
    1. Customizing Colorbars
      1. Choosing the Colormap
      2. Color Limits and Extensions
      3. Discrete Colorbars
    2. Example: Handwritten Digits
  36. 31. Multiple Subplots
    1. plt.axes: Subplots by Hand
    2. plt.subplot: Simple Grids of Subplots
    3. plt.subplots: The Whole Grid in One Go
    4. plt.GridSpec: More Complicated Arrangements
  37. 32. Text and Annotation
    1. Example: Effect of Holidays on US Births
    2. Transforms and Text Position
    3. Arrows and Annotation
  38. 33. Customizing Ticks
    1. Major and Minor Ticks
    2. Hiding Ticks or Labels
    3. Reducing or Increasing the Number of Ticks
    4. Fancy Tick Formats
    5. Summary of Formatters and Locators
  39. 34. Customizing Matplotlib: Configurations and Stylesheets
    1. Plot Customization by Hand
    2. Changing the Defaults: rcParams
    3. Stylesheets
      1. Default Style
      2. FiveThiryEight Style
      3. ggplot Style
      4. Bayesian Methods for Hackers Style
      5. Dark Background Style
      6. Grayscale Style
      7. Seaborn Style
  40. 35. Three-Dimensional Plotting in Matplotlib
    1. Three-Dimensional Points and Lines
    2. Three-Dimensional Contour Plots
    3. Wireframes and Surface Plots
    4. Surface Triangulations
    5. Example: Visualizing a Möbius Strip
  41. 36. Visualization with Seaborn
    1. Exploring Seaborn Plots
      1. Histograms, KDE, and Densities
      2. Pair Plots
      3. Faceted Histograms
    2. Categorical Plots
      1. Joint Distributions
      2. Bar Plots
    3. Example: Exploring Marathon Finishing Times
    4. Further Resources
    5. Other Python Visualization Libraries
  42. V. Machine Learning
  43. 37. What Is Machine Learning?
    1. Categories of Machine Learning
    2. Qualitative Examples of Machine Learning Applications
      1. Classification: Predicting Discrete Labels
      2. Regression: Predicting Continuous Labels
      3. Clustering: Inferring Labels on Unlabeled Data
      4. Dimensionality Reduction: Inferring Structure of Unlabeled Data
    3. Summary
  44. 38. Introducing Scikit-Learn
    1. Data Representation in Scikit-Learn
      1. The Features Matrix
      2. The Target Array
    2. The Estimator API
      1. Basics of the API
      2. Supervised Learning Example: Simple Linear Regression
      3. Supervised Learning Example: Iris Classification
      4. Unsupervised Learning Example: Iris Dimensionality
      5. Unsupervised Learning Example: Iris Clustering
    3. Application: Exploring Handwritten Digits
      1. Loading and Visualizing the Digits Data
      2. Unsupervised Learning Example: Dimensionality Reduction
      3. Classification on Digits
    4. Summary
  45. 39. Hyperparameters and Model Validation
    1. Thinking About Model Validation
      1. Model Validation the Wrong Way
      2. Model Validation the Right Way: Holdout Sets
      3. Model Validation via Cross-Validation
    2. Selecting the Best Model
      1. The Bias-Variance Trade-off
      2. Validation Curves in Scikit-Learn
    3. Learning Curves
    4. Validation in Practice: Grid Search
    5. Summary
  46. 40. Feature Engineering
    1. Categorical Features
    2. Text Features
    3. Image Features
    4. Derived Features
    5. Imputation of Missing Data
    6. Feature Pipelines
  47. 41. In Depth: Naive Bayes Classification
    1. Bayesian Classification
    2. Gaussian Naive Bayes
    3. Multinomial Naive Bayes
      1. Example: Classifying Text
    4. When to Use Naive Bayes
  48. 42. In Depth: Linear Regression
    1. Simple Linear Regression
    2. Basis Function Regression
      1. Polynomial Basis Functions
      2. Gaussian Basis Functions
    3. Regularization
      1. Ridge Regression (L2 Regularization)
      2. Lasso Regression (L1 Regularization)
    4. Example: Predicting Bicycle Traffic
  49. 43. In Depth: Support Vector Machines
    1. Motivating Support Vector Machines
    2. Support Vector Machines: Maximizing the Margin
      1. Fitting a Support Vector Machine
      2. Beyond Linear Boundaries: Kernel SVM
      3. Tuning the SVM: Softening Margins
    3. Example: Face Recognition
    4. Summary
  50. 44. In Depth: Decision Trees and Random Forests
    1. Motivating Random Forests: Decision Trees
      1. Creating a Decision Tree
      2. Decision Trees and Overfitting
    2. Ensembles of Estimators: Random Forests
    3. Random Forest Regression
    4. Example: Random Forest for Classifying Digits
    5. Summary
  51. 45. In Depth: Principal Component Analysis
    1. Introducing Principal Component Analysis
      1. PCA as Dimensionality Reduction
      2. PCA for Visualization: Handwritten Digits
      3. What Do the Components Mean?
      4. Choosing the Number of Components
    2. PCA as Noise Filtering
    3. Example: Eigenfaces
    4. Summary
  52. 46. In Depth: Manifold Learning
    1. Manifold Learning: “HELLO”
    2. Multidimensional Scaling
      1. MDS as Manifold Learning
      2. Nonlinear Embeddings: Where MDS Fails
    3. Nonlinear Manifolds: Locally Linear Embedding
    4. Some Thoughts on Manifold Methods
    5. Example: Isomap on Faces
    6. Example: Visualizing Structure in Digits
  53. 47. In Depth: k-Means Clustering
    1. Introducing k-Means
    2. Expectation–Maximization
    3. Examples
      1. Example 1: k-Means on Digits
      2. Example 2: k-Means for Color Compression
  54. 48. In Depth: Gaussian Mixture Models
    1. Motivating Gaussian Mixtures: Weaknesses of k-Means
    2. Generalizing E–M: Gaussian Mixture Models
    3. Choosing the Covariance Type
    4. Gaussian Mixture Models as Density Estimation
    5. Example: GMMs for Generating New Data
  55. 49. In Depth: Kernel Density Estimation
    1. Motivating Kernel Density Estimation: Histograms
    2. Kernel Density Estimation in Practice
    3. Selecting the Bandwidth via Cross-Validation
    4. Example: Not-so-Naive Bayes
      1. Anatomy of a Custom Estimator
      2. Using Our Custom Estimator
  56. 50. Application: A Face Detection Pipeline
    1. HOG Features
    2. HOG in Action: A Simple Face Detector
      1. 1. Obtain a Set of Positive Training Samples
      2. 2. Obtain a Set of Negative Training Samples
      3. 3. Combine Sets and Extract HOG Features
      4. 4. Train a Support Vector Machine
      5. 5. Find Faces in a New Image
    3. Caveats and Improvements
    4. Further Machine Learning Resources
  57. Index
  58. About the Author

Product information

  • Title: Python Data Science Handbook, 2nd Edition
  • Author(s): Jake VanderPlas
  • Release date: December 2022
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098121228