book

Python Data Science Handbook, 2nd Edition

by Jake VanderPlas

December 2022

Beginner to intermediate

588 pages

13h 43m

English

O'Reilly Media, Inc.

Read now

Unlock full access

What Is Data Science?Who Is This Book For?Why Python?Outline of the BookInstallation ConsiderationsConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact Us
Launching the IPython ShellLaunching the Jupyter NotebookHelp and Documentation in IPythonAccessing Documentation with ?Accessing Source Code with ??Exploring Modules with Tab CompletionKeyboard Shortcuts in the IPython ShellNavigation ShortcutsText Entry ShortcutsCommand History ShortcutsMiscellaneous Shortcuts
IPython Magic CommandsRunning External Code: %runTiming Code Execution: %timeitHelp on Magic Functions: ?, %magic, and %lsmagicInput and Output HistoryIPython’s In and Out ObjectsUnderscore Shortcuts and Previous OutputsSuppressing OutputRelated Magic CommandsIPython and Shell CommandsQuick Introduction to the ShellShell Commands in IPythonPassing Values to and from the ShellShell-Related Magic Commands
Errors and DebuggingControlling Exceptions: %xmodeDebugging: When Reading Tracebacks Is Not EnoughProfiling and Timing CodeTiming Code Snippets: %timeit and %timeProfiling Full Scripts: %prunLine-by-Line Profiling with %lprunProfiling Memory Use: %memit and %mprunMore IPython ResourcesWeb ResourcesBooks
A Python Integer Is More Than Just an IntegerA Python List Is More Than Just a ListFixed-Type Arrays in PythonCreating Arrays from Python ListsCreating Arrays from ScratchNumPy Standard Data Types
NumPy Array AttributesArray Indexing: Accessing Single ElementsArray Slicing: Accessing SubarraysOne-Dimensional SubarraysMultidimensional SubarraysSubarrays as No-Copy ViewsCreating Copies of ArraysReshaping of ArraysArray Concatenation and SplittingConcatenation of ArraysSplitting of Arrays
The Slowness of LoopsIntroducing UfuncsExploring NumPy’s UfuncsArray ArithmeticAbsolute ValueTrigonometric FunctionsExponents and LogarithmsSpecialized UfuncsAdvanced Ufunc FeaturesSpecifying OutputAggregationsOuter ProductsUfuncs: Learning More
Summing the Values in an ArrayMinimum and MaximumMultidimensional AggregatesOther Aggregation FunctionsExample: What Is the Average Height of US Presidents?

Introducing BroadcastingRules of BroadcastingBroadcasting Example 1Broadcasting Example 2Broadcasting Example 3Broadcasting in PracticeCentering an ArrayPlotting a Two-Dimensional Function
Example: Counting Rainy DaysComparison Operators as UfuncsWorking with Boolean ArraysCounting EntriesBoolean OperatorsBoolean Arrays as MasksUsing the Keywords and/or Versus the Operators &/|
Exploring Fancy IndexingCombined IndexingExample: Selecting Random PointsModifying Values with Fancy IndexingExample: Binning Data
Fast Sorting in NumPy: np.sort and np.argsortSorting Along Rows or ColumnsPartial Sorts: PartitioningExample: k-Nearest Neighbors
Exploring Structured Array CreationMore Advanced Compound TypesRecord Arrays: Structured Arrays with a TwistOn to Pandas
The Pandas Series ObjectSeries as Generalized NumPy ArraySeries as Specialized DictionaryConstructing Series ObjectsThe Pandas DataFrame ObjectDataFrame as Generalized NumPy ArrayDataFrame as Specialized DictionaryConstructing DataFrame ObjectsThe Pandas Index ObjectIndex as Immutable ArrayIndex as Ordered Set
Data Selection in SeriesSeries as DictionarySeries as One-Dimensional ArrayIndexers: loc and ilocData Selection in DataFramesDataFrame as DictionaryDataFrame as Two-Dimensional ArrayAdditional Indexing Conventions
Ufuncs: Index PreservationUfuncs: Index AlignmentIndex Alignment in SeriesIndex Alignment in DataFramesUfuncs: Operations Between DataFrames and Series
Trade-offs in Missing Data ConventionsMissing Data in PandasNone as a Sentinel ValueNaN: Missing Numerical DataNaN and None in PandasPandas Nullable DtypesOperating on Null ValuesDetecting Null ValuesDropping Null ValuesFilling Null Values
A Multiply Indexed SeriesThe Bad WayThe Better Way: The Pandas MultiIndexMultiIndex as Extra DimensionMethods of MultiIndex CreationExplicit MultiIndex ConstructorsMultiIndex Level NamesMultiIndex for ColumnsIndexing and Slicing a MultiIndexMultiply Indexed SeriesMultiply Indexed DataFramesRearranging Multi-IndexesSorted and Unsorted IndicesStacking and Unstacking IndicesIndex Setting and Resetting
Recall: Concatenation of NumPy ArraysSimple Concatenation with pd.concatDuplicate IndicesConcatenation with JoinsThe append Method
Relational AlgebraCategories of JoinsOne-to-One JoinsMany-to-One JoinsMany-to-Many JoinsSpecification of the Merge KeyThe on KeywordThe left_on and right_on KeywordsThe left_index and right_index KeywordsSpecifying Set Arithmetic for JoinsOverlapping Column Names: The suffixes KeywordExample: US States Data
Planets DataSimple Aggregation in Pandasgroupby: Split, Apply, CombineSplit, Apply, CombineThe GroupBy ObjectAggregate, Filter, Transform, ApplySpecifying the Split KeyGrouping Example
Motivating Pivot TablesPivot Tables by HandPivot Table SyntaxMultilevel Pivot TablesAdditional Pivot Table OptionsExample: Birthrate Data
Introducing Pandas String OperationsTables of Pandas String MethodsMethods Similar to Python String MethodsMethods Using Regular ExpressionsMiscellaneous MethodsExample: Recipe DatabaseA Simple Recipe RecommenderGoing Further with Recipes
Dates and Times in PythonNative Python Dates and Times: datetime and dateutilTyped Arrays of Times: NumPy’s datetime64Dates and Times in Pandas: The Best of Both WorldsPandas Time Series: Indexing by TimePandas Time Series Data StructuresRegular Sequences: pd.date_rangeFrequencies and OffsetsResampling, Shifting, and WindowingResampling and Converting FrequenciesTime ShiftsRolling WindowsExample: Visualizing Seattle Bicycle CountsVisualizing the DataDigging into the Data
Motivating query and eval: Compound Expressionspandas.eval for Efficient OperationsDataFrame.eval for Column-Wise OperationsAssignment in DataFrame.evalLocal Variables in DataFrame.evalThe DataFrame.query MethodPerformance: When to Use These FunctionsFurther Resources
Importing MatplotlibSetting Stylesshow or No show? How to Display Your PlotsPlotting from a ScriptPlotting from an IPython ShellPlotting from a Jupyter NotebookSaving Figures to FileTwo Interfaces for the Price of One
Adjusting the Plot: Line Colors and StylesAdjusting the Plot: Axes LimitsLabeling PlotsMatplotlib Gotchas
Scatter Plots with plt.plotScatter Plots with plt.scatterplot Versus scatter: A Note on EfficiencyVisualizing UncertaintiesBasic ErrorbarsContinuous Errors
Visualizing a Three-Dimensional FunctionHistograms, Binnings, and DensityTwo-Dimensional Histograms and Binningsplt.hist2d: Two-Dimensional Histogramplt.hexbin: Hexagonal BinningsKernel Density Estimation
Choosing Elements for the LegendLegend for Size of PointsMultiple Legends
Customizing ColorbarsChoosing the ColormapColor Limits and ExtensionsDiscrete ColorbarsExample: Handwritten Digits
plt.axes: Subplots by Handplt.subplot: Simple Grids of Subplotsplt.subplots: The Whole Grid in One Goplt.GridSpec: More Complicated Arrangements
Example: Effect of Holidays on US BirthsTransforms and Text PositionArrows and Annotation
Major and Minor TicksHiding Ticks or LabelsReducing or Increasing the Number of TicksFancy Tick FormatsSummary of Formatters and Locators
Plot Customization by HandChanging the Defaults: rcParamsStylesheetsDefault StyleFiveThiryEight Styleggplot StyleBayesian Methods for Hackers StyleDark Background StyleGrayscale StyleSeaborn Style
Three-Dimensional Points and LinesThree-Dimensional Contour PlotsWireframes and Surface PlotsSurface TriangulationsExample: Visualizing a Möbius Strip
Exploring Seaborn PlotsHistograms, KDE, and DensitiesPair PlotsFaceted HistogramsCategorical PlotsJoint DistributionsBar PlotsExample: Exploring Marathon Finishing TimesFurther ResourcesOther Python Visualization Libraries
Categories of Machine LearningQualitative Examples of Machine Learning ApplicationsClassification: Predicting Discrete LabelsRegression: Predicting Continuous LabelsClustering: Inferring Labels on Unlabeled DataDimensionality Reduction: Inferring Structure of Unlabeled DataSummary
Data Representation in Scikit-LearnThe Features MatrixThe Target ArrayThe Estimator APIBasics of the APISupervised Learning Example: Simple Linear RegressionSupervised Learning Example: Iris ClassificationUnsupervised Learning Example: Iris DimensionalityUnsupervised Learning Example: Iris ClusteringApplication: Exploring Handwritten DigitsLoading and Visualizing the Digits DataUnsupervised Learning Example: Dimensionality ReductionClassification on DigitsSummary
Thinking About Model ValidationModel Validation the Wrong WayModel Validation the Right Way: Holdout SetsModel Validation via Cross-ValidationSelecting the Best ModelThe Bias-Variance Trade-offValidation Curves in Scikit-LearnLearning CurvesValidation in Practice: Grid SearchSummary
Categorical FeaturesText FeaturesImage FeaturesDerived FeaturesImputation of Missing DataFeature Pipelines
Bayesian ClassificationGaussian Naive BayesMultinomial Naive BayesExample: Classifying TextWhen to Use Naive Bayes
Simple Linear RegressionBasis Function RegressionPolynomial Basis FunctionsGaussian Basis FunctionsRegularizationRidge Regression (L2 Regularization)Lasso Regression (L1 Regularization)Example: Predicting Bicycle Traffic
Motivating Support Vector MachinesSupport Vector Machines: Maximizing the MarginFitting a Support Vector MachineBeyond Linear Boundaries: Kernel SVMTuning the SVM: Softening MarginsExample: Face RecognitionSummary
Motivating Random Forests: Decision TreesCreating a Decision TreeDecision Trees and OverfittingEnsembles of Estimators: Random ForestsRandom Forest RegressionExample: Random Forest for Classifying DigitsSummary
Introducing Principal Component AnalysisPCA as Dimensionality ReductionPCA for Visualization: Handwritten DigitsWhat Do the Components Mean?Choosing the Number of ComponentsPCA as Noise FilteringExample: EigenfacesSummary
Manifold Learning: “HELLO”Multidimensional ScalingMDS as Manifold LearningNonlinear Embeddings: Where MDS FailsNonlinear Manifolds: Locally Linear EmbeddingSome Thoughts on Manifold MethodsExample: Isomap on FacesExample: Visualizing Structure in Digits
Introducing k-MeansExpectation–MaximizationExamplesExample 1: k-Means on DigitsExample 2: k-Means for Color Compression
Motivating Gaussian Mixtures: Weaknesses of k-MeansGeneralizing E–M: Gaussian Mixture ModelsChoosing the Covariance TypeGaussian Mixture Models as Density EstimationExample: GMMs for Generating New Data
Motivating Kernel Density Estimation: HistogramsKernel Density Estimation in PracticeSelecting the Bandwidth via Cross-ValidationExample: Not-so-Naive BayesAnatomy of a Custom EstimatorUsing Our Custom Estimator
HOG FeaturesHOG in Action: A Simple Face Detector1. Obtain a Set of Positive Training Samples2. Obtain a Set of Negative Training Samples3. Combine Sets and Extract HOG Features4. Train a Support Vector Machine5. Find Faces in a New ImageCaveats and ImprovementsFurther Machine Learning Resources

Content preview from Python Data Science Handbook, 2nd Edition

Chapter 38. Introducing Scikit-Learn

Several Python libraries provide solid implementations of a range of machine learning algorithms. One of the best known is Scikit-Learn, a package that provides efficient versions of a large number of common algorithms. Scikit-Learn is characterized by a clean, uniform, and streamlined API, as well as by very useful and complete documentation. A benefit of this uniformity is that once you understand the basic use and syntax of Scikit-Learn for one type of model, switching to a new model or algorithm is straightforward.

This chapter provides an overview of the Scikit-Learn API. A solid understanding of these API elements will form the foundation for understanding the deeper practical discussion of machine learning algorithms and approaches in the following chapters.

We will start by covering data representation in Scikit-Learn, then delve into the Estimator API, and finally go through a more interesting example of using these tools for exploring a set of images of handwritten digits.

Data Representation in Scikit-Learn

Machine learning is about creating models from data; for that reason, we’ll start by discussing how data can be represented. The best way to think about data within Scikit-Learn is in terms of tables.

A basic table is a two-dimensional grid of data, in which the rows represent individual elements of the dataset, and the columns represent quantities related to each of these elements. For example, consider the Iris dataset, famously ...