book

Python Data Science Handbook, 2nd Edition

by Jake VanderPlas

December 2022

Beginner to intermediate

588 pages

13h 43m

English

O'Reilly Media, Inc.

Read now

Unlock full access

What Is Data Science?Who Is This Book For?Why Python?Outline of the BookInstallation ConsiderationsConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact Us
Launching the IPython ShellLaunching the Jupyter NotebookHelp and Documentation in IPythonAccessing Documentation with ?Accessing Source Code with ??Exploring Modules with Tab CompletionKeyboard Shortcuts in the IPython ShellNavigation ShortcutsText Entry ShortcutsCommand History ShortcutsMiscellaneous Shortcuts
IPython Magic CommandsRunning External Code: %runTiming Code Execution: %timeitHelp on Magic Functions: ?, %magic, and %lsmagicInput and Output HistoryIPython’s In and Out ObjectsUnderscore Shortcuts and Previous OutputsSuppressing OutputRelated Magic CommandsIPython and Shell CommandsQuick Introduction to the ShellShell Commands in IPythonPassing Values to and from the ShellShell-Related Magic Commands
Errors and DebuggingControlling Exceptions: %xmodeDebugging: When Reading Tracebacks Is Not EnoughProfiling and Timing CodeTiming Code Snippets: %timeit and %timeProfiling Full Scripts: %prunLine-by-Line Profiling with %lprunProfiling Memory Use: %memit and %mprunMore IPython ResourcesWeb ResourcesBooks
A Python Integer Is More Than Just an IntegerA Python List Is More Than Just a ListFixed-Type Arrays in PythonCreating Arrays from Python ListsCreating Arrays from ScratchNumPy Standard Data Types
NumPy Array AttributesArray Indexing: Accessing Single ElementsArray Slicing: Accessing SubarraysOne-Dimensional SubarraysMultidimensional SubarraysSubarrays as No-Copy ViewsCreating Copies of ArraysReshaping of ArraysArray Concatenation and SplittingConcatenation of ArraysSplitting of Arrays
The Slowness of LoopsIntroducing UfuncsExploring NumPy’s UfuncsArray ArithmeticAbsolute ValueTrigonometric FunctionsExponents and LogarithmsSpecialized UfuncsAdvanced Ufunc FeaturesSpecifying OutputAggregationsOuter ProductsUfuncs: Learning More
Summing the Values in an ArrayMinimum and MaximumMultidimensional AggregatesOther Aggregation FunctionsExample: What Is the Average Height of US Presidents?

Introducing BroadcastingRules of BroadcastingBroadcasting Example 1Broadcasting Example 2Broadcasting Example 3Broadcasting in PracticeCentering an ArrayPlotting a Two-Dimensional Function
Example: Counting Rainy DaysComparison Operators as UfuncsWorking with Boolean ArraysCounting EntriesBoolean OperatorsBoolean Arrays as MasksUsing the Keywords and/or Versus the Operators &/|
Exploring Fancy IndexingCombined IndexingExample: Selecting Random PointsModifying Values with Fancy IndexingExample: Binning Data
Fast Sorting in NumPy: np.sort and np.argsortSorting Along Rows or ColumnsPartial Sorts: PartitioningExample: k-Nearest Neighbors
Exploring Structured Array CreationMore Advanced Compound TypesRecord Arrays: Structured Arrays with a TwistOn to Pandas
The Pandas Series ObjectSeries as Generalized NumPy ArraySeries as Specialized DictionaryConstructing Series ObjectsThe Pandas DataFrame ObjectDataFrame as Generalized NumPy ArrayDataFrame as Specialized DictionaryConstructing DataFrame ObjectsThe Pandas Index ObjectIndex as Immutable ArrayIndex as Ordered Set
Data Selection in SeriesSeries as DictionarySeries as One-Dimensional ArrayIndexers: loc and ilocData Selection in DataFramesDataFrame as DictionaryDataFrame as Two-Dimensional ArrayAdditional Indexing Conventions
Ufuncs: Index PreservationUfuncs: Index AlignmentIndex Alignment in SeriesIndex Alignment in DataFramesUfuncs: Operations Between DataFrames and Series
Trade-offs in Missing Data ConventionsMissing Data in PandasNone as a Sentinel ValueNaN: Missing Numerical DataNaN and None in PandasPandas Nullable DtypesOperating on Null ValuesDetecting Null ValuesDropping Null ValuesFilling Null Values
A Multiply Indexed SeriesThe Bad WayThe Better Way: The Pandas MultiIndexMultiIndex as Extra DimensionMethods of MultiIndex CreationExplicit MultiIndex ConstructorsMultiIndex Level NamesMultiIndex for ColumnsIndexing and Slicing a MultiIndexMultiply Indexed SeriesMultiply Indexed DataFramesRearranging Multi-IndexesSorted and Unsorted IndicesStacking and Unstacking IndicesIndex Setting and Resetting
Recall: Concatenation of NumPy ArraysSimple Concatenation with pd.concatDuplicate IndicesConcatenation with JoinsThe append Method
Relational AlgebraCategories of JoinsOne-to-One JoinsMany-to-One JoinsMany-to-Many JoinsSpecification of the Merge KeyThe on KeywordThe left_on and right_on KeywordsThe left_index and right_index KeywordsSpecifying Set Arithmetic for JoinsOverlapping Column Names: The suffixes KeywordExample: US States Data
Planets DataSimple Aggregation in Pandasgroupby: Split, Apply, CombineSplit, Apply, CombineThe GroupBy ObjectAggregate, Filter, Transform, ApplySpecifying the Split KeyGrouping Example
Motivating Pivot TablesPivot Tables by HandPivot Table SyntaxMultilevel Pivot TablesAdditional Pivot Table OptionsExample: Birthrate Data
Introducing Pandas String OperationsTables of Pandas String MethodsMethods Similar to Python String MethodsMethods Using Regular ExpressionsMiscellaneous MethodsExample: Recipe DatabaseA Simple Recipe RecommenderGoing Further with Recipes
Dates and Times in PythonNative Python Dates and Times: datetime and dateutilTyped Arrays of Times: NumPy’s datetime64Dates and Times in Pandas: The Best of Both WorldsPandas Time Series: Indexing by TimePandas Time Series Data StructuresRegular Sequences: pd.date_rangeFrequencies and OffsetsResampling, Shifting, and WindowingResampling and Converting FrequenciesTime ShiftsRolling WindowsExample: Visualizing Seattle Bicycle CountsVisualizing the DataDigging into the Data
Motivating query and eval: Compound Expressionspandas.eval for Efficient OperationsDataFrame.eval for Column-Wise OperationsAssignment in DataFrame.evalLocal Variables in DataFrame.evalThe DataFrame.query MethodPerformance: When to Use These FunctionsFurther Resources
Importing MatplotlibSetting Stylesshow or No show? How to Display Your PlotsPlotting from a ScriptPlotting from an IPython ShellPlotting from a Jupyter NotebookSaving Figures to FileTwo Interfaces for the Price of One
Adjusting the Plot: Line Colors and StylesAdjusting the Plot: Axes LimitsLabeling PlotsMatplotlib Gotchas
Scatter Plots with plt.plotScatter Plots with plt.scatterplot Versus scatter: A Note on EfficiencyVisualizing UncertaintiesBasic ErrorbarsContinuous Errors
Visualizing a Three-Dimensional FunctionHistograms, Binnings, and DensityTwo-Dimensional Histograms and Binningsplt.hist2d: Two-Dimensional Histogramplt.hexbin: Hexagonal BinningsKernel Density Estimation
Choosing Elements for the LegendLegend for Size of PointsMultiple Legends
Customizing ColorbarsChoosing the ColormapColor Limits and ExtensionsDiscrete ColorbarsExample: Handwritten Digits
plt.axes: Subplots by Handplt.subplot: Simple Grids of Subplotsplt.subplots: The Whole Grid in One Goplt.GridSpec: More Complicated Arrangements
Example: Effect of Holidays on US BirthsTransforms and Text PositionArrows and Annotation
Major and Minor TicksHiding Ticks or LabelsReducing or Increasing the Number of TicksFancy Tick FormatsSummary of Formatters and Locators
Plot Customization by HandChanging the Defaults: rcParamsStylesheetsDefault StyleFiveThiryEight Styleggplot StyleBayesian Methods for Hackers StyleDark Background StyleGrayscale StyleSeaborn Style
Three-Dimensional Points and LinesThree-Dimensional Contour PlotsWireframes and Surface PlotsSurface TriangulationsExample: Visualizing a Möbius Strip
Exploring Seaborn PlotsHistograms, KDE, and DensitiesPair PlotsFaceted HistogramsCategorical PlotsJoint DistributionsBar PlotsExample: Exploring Marathon Finishing TimesFurther ResourcesOther Python Visualization Libraries
Categories of Machine LearningQualitative Examples of Machine Learning ApplicationsClassification: Predicting Discrete LabelsRegression: Predicting Continuous LabelsClustering: Inferring Labels on Unlabeled DataDimensionality Reduction: Inferring Structure of Unlabeled DataSummary
Data Representation in Scikit-LearnThe Features MatrixThe Target ArrayThe Estimator APIBasics of the APISupervised Learning Example: Simple Linear RegressionSupervised Learning Example: Iris ClassificationUnsupervised Learning Example: Iris DimensionalityUnsupervised Learning Example: Iris ClusteringApplication: Exploring Handwritten DigitsLoading and Visualizing the Digits DataUnsupervised Learning Example: Dimensionality ReductionClassification on DigitsSummary
Thinking About Model ValidationModel Validation the Wrong WayModel Validation the Right Way: Holdout SetsModel Validation via Cross-ValidationSelecting the Best ModelThe Bias-Variance Trade-offValidation Curves in Scikit-LearnLearning CurvesValidation in Practice: Grid SearchSummary
Categorical FeaturesText FeaturesImage FeaturesDerived FeaturesImputation of Missing DataFeature Pipelines
Bayesian ClassificationGaussian Naive BayesMultinomial Naive BayesExample: Classifying TextWhen to Use Naive Bayes
Simple Linear RegressionBasis Function RegressionPolynomial Basis FunctionsGaussian Basis FunctionsRegularizationRidge Regression (L2 Regularization)Lasso Regression (L1 Regularization)Example: Predicting Bicycle Traffic
Motivating Support Vector MachinesSupport Vector Machines: Maximizing the MarginFitting a Support Vector MachineBeyond Linear Boundaries: Kernel SVMTuning the SVM: Softening MarginsExample: Face RecognitionSummary
Motivating Random Forests: Decision TreesCreating a Decision TreeDecision Trees and OverfittingEnsembles of Estimators: Random ForestsRandom Forest RegressionExample: Random Forest for Classifying DigitsSummary
Introducing Principal Component AnalysisPCA as Dimensionality ReductionPCA for Visualization: Handwritten DigitsWhat Do the Components Mean?Choosing the Number of ComponentsPCA as Noise FilteringExample: EigenfacesSummary
Manifold Learning: “HELLO”Multidimensional ScalingMDS as Manifold LearningNonlinear Embeddings: Where MDS FailsNonlinear Manifolds: Locally Linear EmbeddingSome Thoughts on Manifold MethodsExample: Isomap on FacesExample: Visualizing Structure in Digits
Introducing k-MeansExpectation–MaximizationExamplesExample 1: k-Means on DigitsExample 2: k-Means for Color Compression
Motivating Gaussian Mixtures: Weaknesses of k-MeansGeneralizing E–M: Gaussian Mixture ModelsChoosing the Covariance TypeGaussian Mixture Models as Density EstimationExample: GMMs for Generating New Data
Motivating Kernel Density Estimation: HistogramsKernel Density Estimation in PracticeSelecting the Bandwidth via Cross-ValidationExample: Not-so-Naive BayesAnatomy of a Custom EstimatorUsing Our Custom Estimator
HOG FeaturesHOG in Action: A Simple Face Detector1. Obtain a Set of Positive Training Samples2. Obtain a Set of Negative Training Samples3. Combine Sets and Extract HOG Features4. Train a Support Vector Machine5. Find Faces in a New ImageCaveats and ImprovementsFurther Machine Learning Resources

Content preview from Python Data Science Handbook, 2nd Edition

Chapter 39. Hyperparameters and Model Validation

In the previous chapter, we saw the basic recipe for applying a supervised machine learning model:

Choose a class of model.
Choose model hyperparameters.
Fit the model to the training data.
Use the model to predict labels for new data.

The first two pieces of this—the choice of model and choice of hyperparameters—are perhaps the most important part of using these tools and techniques effectively. In order to make informed choices, we need a way to validate that our model and our hyperparameters are a good fit to the data. While this may sound simple, there are some pitfalls that you must avoid to do this effectively.

Thinking About Model Validation

In principle, model validation is very simple: after choosing a model and its hyperparameters, we can estimate how effective it is by applying it to some of the training data and comparing the predictions to the known values.

This section will first show a naive approach to model validation and why it fails, before exploring the use of holdout sets and cross-validation for more robust model evaluation.

Model Validation the Wrong Way

Let’s start with the naive approach to validation using the Iris dataset, which we saw in the previous chapter. We will start by loading the data:

In [1]: from sklearn.datasets import load_iris
        iris = load_iris()
        X = iris.data
        y = iris.target

Next, we choose a model and hyperparameters. Here we’ll use a k-nearest neighbors classifier with n_neighbors=1 ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Publisher Resources