book

Python Data Science Handbook, 2nd Edition

Name: Python Data Science Handbook, 2nd Edition
Author: Jake VanderPlas
ISBN: 9781098121228

by Jake VanderPlas

December 2022

Beginner to intermediate

588 pages

13h 43m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
What Is Data Science?Who Is This Book For?Why Python?Outline of the BookInstallation ConsiderationsConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact Us
I. Jupyter: Beyond Normal Python
1. Getting Started in IPython and Jupyter
Launching the IPython ShellLaunching the Jupyter NotebookHelp and Documentation in IPythonAccessing Documentation with ?Accessing Source Code with ??Exploring Modules with Tab CompletionKeyboard Shortcuts in the IPython ShellNavigation ShortcutsText Entry ShortcutsCommand History ShortcutsMiscellaneous Shortcuts
2. Enhanced Interactive Features
IPython Magic CommandsRunning External Code: %runTiming Code Execution: %timeitHelp on Magic Functions: ?, %magic, and %lsmagicInput and Output HistoryIPython’s In and Out ObjectsUnderscore Shortcuts and Previous OutputsSuppressing OutputRelated Magic CommandsIPython and Shell CommandsQuick Introduction to the ShellShell Commands in IPythonPassing Values to and from the ShellShell-Related Magic Commands
3. Debugging and Profiling
Errors and DebuggingControlling Exceptions: %xmodeDebugging: When Reading Tracebacks Is Not EnoughProfiling and Timing CodeTiming Code Snippets: %timeit and %timeProfiling Full Scripts: %prunLine-by-Line Profiling with %lprunProfiling Memory Use: %memit and %mprunMore IPython ResourcesWeb ResourcesBooks
II. Introduction to NumPy
4. Understanding Data Types in Python
A Python Integer Is More Than Just an IntegerA Python List Is More Than Just a ListFixed-Type Arrays in PythonCreating Arrays from Python ListsCreating Arrays from ScratchNumPy Standard Data Types
5. The Basics of NumPy Arrays
NumPy Array AttributesArray Indexing: Accessing Single ElementsArray Slicing: Accessing SubarraysOne-Dimensional SubarraysMultidimensional SubarraysSubarrays as No-Copy ViewsCreating Copies of ArraysReshaping of ArraysArray Concatenation and SplittingConcatenation of ArraysSplitting of Arrays
6. Computation on NumPy Arrays: Universal Functions
The Slowness of LoopsIntroducing UfuncsExploring NumPy’s UfuncsArray ArithmeticAbsolute ValueTrigonometric FunctionsExponents and LogarithmsSpecialized UfuncsAdvanced Ufunc FeaturesSpecifying OutputAggregationsOuter ProductsUfuncs: Learning More
7. Aggregations: min, max, and Everything in Between
Summing the Values in an ArrayMinimum and MaximumMultidimensional AggregatesOther Aggregation FunctionsExample: What Is the Average Height of US Presidents?

8. Computation on Arrays: Broadcasting
Introducing BroadcastingRules of BroadcastingBroadcasting Example 1Broadcasting Example 2Broadcasting Example 3Broadcasting in PracticeCentering an ArrayPlotting a Two-Dimensional Function
9. Comparisons, Masks, and Boolean Logic
Example: Counting Rainy DaysComparison Operators as UfuncsWorking with Boolean ArraysCounting EntriesBoolean OperatorsBoolean Arrays as MasksUsing the Keywords and/or Versus the Operators &/|
10. Fancy Indexing
Exploring Fancy IndexingCombined IndexingExample: Selecting Random PointsModifying Values with Fancy IndexingExample: Binning Data
11. Sorting Arrays
Fast Sorting in NumPy: np.sort and np.argsortSorting Along Rows or ColumnsPartial Sorts: PartitioningExample: k-Nearest Neighbors
12. Structured Data: NumPy’s Structured Arrays
Exploring Structured Array CreationMore Advanced Compound TypesRecord Arrays: Structured Arrays with a TwistOn to Pandas
III. Data Manipulation with Pandas
13. Introducing Pandas Objects
The Pandas Series ObjectSeries as Generalized NumPy ArraySeries as Specialized DictionaryConstructing Series ObjectsThe Pandas DataFrame ObjectDataFrame as Generalized NumPy ArrayDataFrame as Specialized DictionaryConstructing DataFrame ObjectsThe Pandas Index ObjectIndex as Immutable ArrayIndex as Ordered Set
14. Data Indexing and Selection
Data Selection in SeriesSeries as DictionarySeries as One-Dimensional ArrayIndexers: loc and ilocData Selection in DataFramesDataFrame as DictionaryDataFrame as Two-Dimensional ArrayAdditional Indexing Conventions
15. Operating on Data in Pandas
Ufuncs: Index PreservationUfuncs: Index AlignmentIndex Alignment in SeriesIndex Alignment in DataFramesUfuncs: Operations Between DataFrames and Series
16. Handling Missing Data
Trade-offs in Missing Data ConventionsMissing Data in PandasNone as a Sentinel ValueNaN: Missing Numerical DataNaN and None in PandasPandas Nullable DtypesOperating on Null ValuesDetecting Null ValuesDropping Null ValuesFilling Null Values
17. Hierarchical Indexing
A Multiply Indexed SeriesThe Bad WayThe Better Way: The Pandas MultiIndexMultiIndex as Extra DimensionMethods of MultiIndex CreationExplicit MultiIndex ConstructorsMultiIndex Level NamesMultiIndex for ColumnsIndexing and Slicing a MultiIndexMultiply Indexed SeriesMultiply Indexed DataFramesRearranging Multi-IndexesSorted and Unsorted IndicesStacking and Unstacking IndicesIndex Setting and Resetting
18. Combining Datasets: concat and append
Recall: Concatenation of NumPy ArraysSimple Concatenation with pd.concatDuplicate IndicesConcatenation with JoinsThe append Method
19. Combining Datasets: merge and join
Relational AlgebraCategories of JoinsOne-to-One JoinsMany-to-One JoinsMany-to-Many JoinsSpecification of the Merge KeyThe on KeywordThe left_on and right_on KeywordsThe left_index and right_index KeywordsSpecifying Set Arithmetic for JoinsOverlapping Column Names: The suffixes KeywordExample: US States Data
20. Aggregation and Grouping
Planets DataSimple Aggregation in Pandasgroupby: Split, Apply, CombineSplit, Apply, CombineThe GroupBy ObjectAggregate, Filter, Transform, ApplySpecifying the Split KeyGrouping Example
21. Pivot Tables
Motivating Pivot TablesPivot Tables by HandPivot Table SyntaxMultilevel Pivot TablesAdditional Pivot Table OptionsExample: Birthrate Data
22. Vectorized String Operations
Introducing Pandas String OperationsTables of Pandas String MethodsMethods Similar to Python String MethodsMethods Using Regular ExpressionsMiscellaneous MethodsExample: Recipe DatabaseA Simple Recipe RecommenderGoing Further with Recipes
23. Working with Time Series
Dates and Times in PythonNative Python Dates and Times: datetime and dateutilTyped Arrays of Times: NumPy’s datetime64Dates and Times in Pandas: The Best of Both WorldsPandas Time Series: Indexing by TimePandas Time Series Data StructuresRegular Sequences: pd.date_rangeFrequencies and OffsetsResampling, Shifting, and WindowingResampling and Converting FrequenciesTime ShiftsRolling WindowsExample: Visualizing Seattle Bicycle CountsVisualizing the DataDigging into the Data
24. High-Performance Pandas: eval and query
Motivating query and eval: Compound Expressionspandas.eval for Efficient OperationsDataFrame.eval for Column-Wise OperationsAssignment in DataFrame.evalLocal Variables in DataFrame.evalThe DataFrame.query MethodPerformance: When to Use These FunctionsFurther Resources
IV. Visualization with Matplotlib
25. General Matplotlib Tips
Importing MatplotlibSetting Stylesshow or No show? How to Display Your PlotsPlotting from a ScriptPlotting from an IPython ShellPlotting from a Jupyter NotebookSaving Figures to FileTwo Interfaces for the Price of One
26. Simple Line Plots
Adjusting the Plot: Line Colors and StylesAdjusting the Plot: Axes LimitsLabeling PlotsMatplotlib Gotchas
27. Simple Scatter Plots
Scatter Plots with plt.plotScatter Plots with plt.scatterplot Versus scatter: A Note on EfficiencyVisualizing UncertaintiesBasic ErrorbarsContinuous Errors
28. Density and Contour Plots
Visualizing a Three-Dimensional FunctionHistograms, Binnings, and DensityTwo-Dimensional Histograms and Binningsplt.hist2d: Two-Dimensional Histogramplt.hexbin: Hexagonal BinningsKernel Density Estimation
29. Customizing Plot Legends
Choosing Elements for the LegendLegend for Size of PointsMultiple Legends
30. Customizing Colorbars
Customizing ColorbarsChoosing the ColormapColor Limits and ExtensionsDiscrete ColorbarsExample: Handwritten Digits
31. Multiple Subplots
plt.axes: Subplots by Handplt.subplot: Simple Grids of Subplotsplt.subplots: The Whole Grid in One Goplt.GridSpec: More Complicated Arrangements
32. Text and Annotation
Example: Effect of Holidays on US BirthsTransforms and Text PositionArrows and Annotation
33. Customizing Ticks
Major and Minor TicksHiding Ticks or LabelsReducing or Increasing the Number of TicksFancy Tick FormatsSummary of Formatters and Locators
34. Customizing Matplotlib: Configurations and Stylesheets
Plot Customization by HandChanging the Defaults: rcParamsStylesheetsDefault StyleFiveThiryEight Styleggplot StyleBayesian Methods for Hackers StyleDark Background StyleGrayscale StyleSeaborn Style
35. Three-Dimensional Plotting in Matplotlib
Three-Dimensional Points and LinesThree-Dimensional Contour PlotsWireframes and Surface PlotsSurface TriangulationsExample: Visualizing a Möbius Strip
36. Visualization with Seaborn
Exploring Seaborn PlotsHistograms, KDE, and DensitiesPair PlotsFaceted HistogramsCategorical PlotsJoint DistributionsBar PlotsExample: Exploring Marathon Finishing TimesFurther ResourcesOther Python Visualization Libraries
V. Machine Learning
37. What Is Machine Learning?
Categories of Machine LearningQualitative Examples of Machine Learning ApplicationsClassification: Predicting Discrete LabelsRegression: Predicting Continuous LabelsClustering: Inferring Labels on Unlabeled DataDimensionality Reduction: Inferring Structure of Unlabeled DataSummary
38. Introducing Scikit-Learn
Data Representation in Scikit-LearnThe Features MatrixThe Target ArrayThe Estimator APIBasics of the APISupervised Learning Example: Simple Linear RegressionSupervised Learning Example: Iris ClassificationUnsupervised Learning Example: Iris DimensionalityUnsupervised Learning Example: Iris ClusteringApplication: Exploring Handwritten DigitsLoading and Visualizing the Digits DataUnsupervised Learning Example: Dimensionality ReductionClassification on DigitsSummary
39. Hyperparameters and Model Validation
Thinking About Model ValidationModel Validation the Wrong WayModel Validation the Right Way: Holdout SetsModel Validation via Cross-ValidationSelecting the Best ModelThe Bias-Variance Trade-offValidation Curves in Scikit-LearnLearning CurvesValidation in Practice: Grid SearchSummary
40. Feature Engineering
Categorical FeaturesText FeaturesImage FeaturesDerived FeaturesImputation of Missing DataFeature Pipelines
41. In Depth: Naive Bayes Classification
Bayesian ClassificationGaussian Naive BayesMultinomial Naive BayesExample: Classifying TextWhen to Use Naive Bayes
42. In Depth: Linear Regression
Simple Linear RegressionBasis Function RegressionPolynomial Basis FunctionsGaussian Basis FunctionsRegularizationRidge Regression (L2 Regularization)Lasso Regression (L1 Regularization)Example: Predicting Bicycle Traffic
43. In Depth: Support Vector Machines
Motivating Support Vector MachinesSupport Vector Machines: Maximizing the MarginFitting a Support Vector MachineBeyond Linear Boundaries: Kernel SVMTuning the SVM: Softening MarginsExample: Face RecognitionSummary
44. In Depth: Decision Trees and Random Forests
Motivating Random Forests: Decision TreesCreating a Decision TreeDecision Trees and OverfittingEnsembles of Estimators: Random ForestsRandom Forest RegressionExample: Random Forest for Classifying DigitsSummary
45. In Depth: Principal Component Analysis
Introducing Principal Component AnalysisPCA as Dimensionality ReductionPCA for Visualization: Handwritten DigitsWhat Do the Components Mean?Choosing the Number of ComponentsPCA as Noise FilteringExample: EigenfacesSummary
46. In Depth: Manifold Learning
Manifold Learning: “HELLO”Multidimensional ScalingMDS as Manifold LearningNonlinear Embeddings: Where MDS FailsNonlinear Manifolds: Locally Linear EmbeddingSome Thoughts on Manifold MethodsExample: Isomap on FacesExample: Visualizing Structure in Digits
47. In Depth: k-Means Clustering
Introducing k-MeansExpectation–MaximizationExamplesExample 1: k-Means on DigitsExample 2: k-Means for Color Compression
48. In Depth: Gaussian Mixture Models
Motivating Gaussian Mixtures: Weaknesses of k-MeansGeneralizing E–M: Gaussian Mixture ModelsChoosing the Covariance TypeGaussian Mixture Models as Density EstimationExample: GMMs for Generating New Data
49. In Depth: Kernel Density Estimation
Motivating Kernel Density Estimation: HistogramsKernel Density Estimation in PracticeSelecting the Bandwidth via Cross-ValidationExample: Not-so-Naive BayesAnatomy of a Custom EstimatorUsing Our Custom Estimator
50. Application: A Face Detection Pipeline
HOG FeaturesHOG in Action: A Simple Face Detector1. Obtain a Set of Positive Training Samples2. Obtain a Set of Negative Training Samples3. Combine Sets and Extract HOG Features4. Train a Support Vector Machine5. Find Faces in a New ImageCaveats and ImprovementsFurther Machine Learning Resources
Index
About the Author

Content preview from Python Data Science Handbook, 2nd Edition

Chapter 22. Vectorized String Operations

One strength of Python is its relative ease in handling and manipulating string data. Pandas builds on this and provides a comprehensive set of vectorized string operations that are an important part of the type of munging required when working with (read: cleaning up) real-world data. In this chapter, we’ll walk through some of the Pandas string operations, and then take a look at using them to partially clean up a very messy dataset of recipes collected from the internet.

Introducing Pandas String Operations

We saw in previous chapters how tools like NumPy and Pandas generalize arithmetic operations so that we can easily and quickly perform the same operation on many array elements. For example:

In [1]: import numpy as np
        x = np.array([2, 3, 5, 7, 11, 13])
        x * 2
Out[1]: array([ 4,  6, 10, 14, 22, 26])

This vectorization of operations simplifies the syntax of operating on arrays of data: we no longer have to worry about the size or shape of the array, but just about what operation we want done. For arrays of strings, NumPy does not provide such simple access, and thus you’re stuck using a more verbose loop syntax:

In [2]: data = ['peter', 'Paul', 'MARY', 'gUIDO']
        [s.capitalize() for s in data]
Out[2]: ['Peter', 'Paul', 'Mary', 'Guido']

This is perhaps sufficient to work with some data, but it will break if there are any missing values, so this approach requires putting in extra checks:

In [3]: data = ['peter', 'Paul', None, 'MARY', 'gUIDO' ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098121211Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills