book

Python Data Science Handbook

Name: Python Data Science Handbook
Author: Jake VanderPlas
ISBN: 9781491912133

by Jake VanderPlas

November 2016

Beginner to intermediate

548 pages

13h 58m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Preface
What Is Data Science?Who Is This Book For?Why Python?Python 2 Versus Python 3Outline of This BookUsing Code ExamplesInstallation ConsiderationsConventions Used in This BookO’Reilly SafariHow to Contact Us
1. IPython: Beyond Normal Python
Shell or Notebook?Launching the IPython ShellLaunching the Jupyter NotebookHelp and Documentation in IPythonAccessing Documentation with ?Accessing Source Code with ??Exploring Modules with Tab CompletionKeyboard Shortcuts in the IPython ShellNavigation ShortcutsText Entry ShortcutsCommand History ShortcutsMiscellaneous ShortcutsIPython Magic CommandsPasting Code Blocks: %paste and %cpasteRunning External Code: %runTiming Code Execution: %timeitHelp on Magic Functions: ?, %magic, and %lsmagicInput and Output HistoryIPython’s In and Out ObjectsUnderscore Shortcuts and Previous OutputsSuppressing OutputRelated Magic CommandsIPython and Shell CommandsQuick Introduction to the ShellShell Commands in IPythonPassing Values to and from the ShellShell-Related Magic CommandsErrors and DebuggingControlling Exceptions: %xmodeDebugging: When Reading Tracebacks Is Not EnoughProfiling and Timing CodeTiming Code Snippets: %timeit and %timeProfiling Full Scripts: %prunLine-by-Line Profiling with %lprunProfiling Memory Use: %memit and %mprunMore IPython ResourcesWeb ResourcesBooks
2. Introduction to NumPy
Understanding Data Types in PythonA Python Integer Is More Than Just an IntegerA Python List Is More Than Just a ListFixed-Type Arrays in PythonCreating Arrays from Python ListsCreating Arrays from ScratchNumPy Standard Data TypesThe Basics of NumPy ArraysNumPy Array AttributesArray Indexing: Accessing Single ElementsArray Slicing: Accessing SubarraysReshaping of ArraysArray Concatenation and SplittingComputation on NumPy Arrays: Universal FunctionsThe Slowness of LoopsIntroducing UFuncsExploring NumPy’s UFuncsAdvanced Ufunc FeaturesUfuncs: Learning MoreAggregations: Min, Max, and Everything in BetweenSumming the Values in an ArrayMinimum and MaximumExample: What Is the Average Height of US Presidents?Computation on Arrays: BroadcastingIntroducing BroadcastingRules of BroadcastingBroadcasting in PracticeComparisons, Masks, and Boolean LogicExample: Counting Rainy DaysComparison Operators as ufuncsWorking with Boolean ArraysBoolean Arrays as MasksFancy IndexingExploring Fancy IndexingCombined IndexingExample: Selecting Random PointsModifying Values with Fancy IndexingExample: Binning DataSorting ArraysFast Sorting in NumPy: np.sort and np.argsortPartial Sorts: PartitioningExample: k-Nearest NeighborsStructured Data: NumPy’s Structured ArraysCreating Structured ArraysMore Advanced Compound TypesRecordArrays: Structured Arrays with a TwistOn to Pandas
3. Data Manipulation with Pandas
Installing and Using PandasIntroducing Pandas ObjectsThe Pandas Series ObjectThe Pandas DataFrame ObjectThe Pandas Index ObjectData Indexing and SelectionData Selection in SeriesData Selection in DataFrameOperating on Data in PandasUfuncs: Index PreservationUFuncs: Index AlignmentUfuncs: Operations Between DataFrame and SeriesHandling Missing DataTrade-Offs in Missing Data ConventionsMissing Data in PandasOperating on Null ValuesHierarchical IndexingA Multiply Indexed SeriesMethods of MultiIndex CreationIndexing and Slicing a MultiIndexRearranging Multi-IndicesData Aggregations on Multi-IndicesCombining Datasets: Concat and AppendRecall: Concatenation of NumPy ArraysSimple Concatenation with pd.concatCombining Datasets: Merge and JoinRelational AlgebraCategories of JoinsSpecification of the Merge KeySpecifying Set Arithmetic for JoinsOverlapping Column Names: The suffixes KeywordExample: US States DataAggregation and GroupingPlanets DataSimple Aggregation in PandasGroupBy: Split, Apply, CombinePivot TablesMotivating Pivot TablesPivot Tables by HandPivot Table SyntaxExample: Birthrate DataVectorized String OperationsIntroducing Pandas String OperationsTables of Pandas String MethodsExample: Recipe DatabaseWorking with Time SeriesDates and Times in PythonPandas Time Series: Indexing by TimePandas Time Series Data StructuresFrequencies and OffsetsResampling, Shifting, and WindowingWhere to Learn MoreExample: Visualizing Seattle Bicycle CountsHigh-Performance Pandas: eval() and query()Motivating query() and eval(): Compound Expressionspandas.eval() for Efficient OperationsDataFrame.eval() for Column-Wise OperationsDataFrame.query() MethodPerformance: When to Use These FunctionsFurther Resources
4. Visualization with Matplotlib
General Matplotlib TipsImporting matplotlibSetting Stylesshow() or No show()? How to Display Your PlotsSaving Figures to FileTwo Interfaces for the Price of OneSimple Line PlotsAdjusting the Plot: Line Colors and StylesAdjusting the Plot: Axes LimitsLabeling PlotsSimple Scatter PlotsScatter Plots with plt.plotScatter Plots with plt.scatterplot Versus scatter: A Note on EfficiencyVisualizing ErrorsBasic ErrorbarsContinuous ErrorsDensity and Contour PlotsVisualizing a Three-Dimensional FunctionHistograms, Binnings, and DensityTwo-Dimensional Histograms and BinningsCustomizing Plot LegendsChoosing Elements for the LegendLegend for Size of PointsMultiple LegendsCustomizing ColorbarsCustomizing ColorbarsExample: Handwritten DigitsMultiple Subplotsplt.axes: Subplots by Handplt.subplot: Simple Grids of Subplotsplt.subplots: The Whole Grid in One Goplt.GridSpec: More Complicated ArrangementsText and AnnotationExample: Effect of Holidays on US BirthsTransforms and Text PositionArrows and AnnotationCustomizing TicksMajor and Minor TicksHiding Ticks or LabelsReducing or Increasing the Number of TicksFancy Tick FormatsSummary of Formatters and LocatorsCustomizing Matplotlib: Configurations and StylesheetsPlot Customization by HandChanging the Defaults: rcParamsStylesheetsThree-Dimensional Plotting in MatplotlibThree-Dimensional Points and LinesThree-Dimensional Contour PlotsWireframes and Surface PlotsSurface TriangulationsGeographic Data with BasemapMap ProjectionsDrawing a Map BackgroundPlotting Data on MapsExample: California CitiesExample: Surface Temperature DataVisualization with SeabornSeaborn Versus MatplotlibExploring Seaborn PlotsExample: Exploring Marathon Finishing TimesFurther ResourcesMatplotlib ResourcesOther Python Graphics Libraries
5. Machine Learning
What Is Machine Learning?Categories of Machine LearningQualitative Examples of Machine Learning ApplicationsSummaryIntroducing Scikit-LearnData Representation in Scikit-LearnScikit-Learn’s Estimator APIApplication: Exploring Handwritten DigitsSummaryHyperparameters and Model ValidationThinking About Model ValidationSelecting the Best ModelLearning CurvesValidation in Practice: Grid SearchSummaryFeature EngineeringCategorical FeaturesText FeaturesImage FeaturesDerived FeaturesImputation of Missing DataFeature PipelinesIn Depth: Naive Bayes ClassificationBayesian ClassificationGaussian Naive BayesMultinomial Naive BayesWhen to Use Naive BayesIn Depth: Linear RegressionSimple Linear RegressionBasis Function RegressionRegularizationExample: Predicting Bicycle TrafficIn-Depth: Support Vector MachinesMotivating Support Vector MachinesSupport Vector Machines: Maximizing the MarginExample: Face RecognitionSupport Vector Machine SummaryIn-Depth: Decision Trees and Random ForestsMotivating Random Forests: Decision TreesEnsembles of Estimators: Random ForestsRandom Forest RegressionExample: Random Forest for Classifying DigitsSummary of Random ForestsIn Depth: Principal Component AnalysisIntroducing Principal Component AnalysisPCA as Noise FilteringExample: EigenfacesPrincipal Component Analysis SummaryIn-Depth: Manifold LearningManifold Learning: “HELLO”Multidimensional Scaling (MDS)MDS as Manifold LearningNonlinear Embeddings: Where MDS FailsNonlinear Manifolds: Locally Linear EmbeddingSome Thoughts on Manifold MethodsExample: Isomap on FacesExample: Visualizing Structure in DigitsIn Depth: k-Means ClusteringIntroducing k-Meansk-Means Algorithm: Expectation–MaximizationExamplesIn Depth: Gaussian Mixture ModelsMotivating GMM: Weaknesses of k-MeansGeneralizing E–M: Gaussian Mixture ModelsGMM as Density EstimationExample: GMM for Generating New DataIn-Depth: Kernel Density EstimationMotivating KDE: HistogramsKernel Density Estimation in PracticeExample: KDE on a SphereExample: Not-So-Naive BayesApplication: A Face Detection PipelineHOG FeaturesHOG in Action: A Simple Face DetectorCaveats and ImprovementsFurther Machine Learning ResourcesMachine Learning in PythonGeneral Machine Learning
Index

Content preview from Python Data Science Handbook

Chapter 3. Data Manipulation with Pandas

In the previous chapter, we dove into detail on NumPy and its ndarray object, which provides efficient storage and manipulation of dense typed arrays in Python. Here we’ll build on this knowledge by looking in detail at the data structures provided by the Pandas library. Pandas is a newer package built on top of NumPy, and provides an efficient implementation of a DataFrame. DataFrames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

As we saw, NumPy’s ndarray data structure provides essential features for the type of clean, well-organized data typically seen in numerical computing tasks. While it serves this purpose very well, its limitations become clear when we need more flexibility (attaching labels to data, working with missing data, etc.) and when attempting operations that do not map well to element-wise broadcasting (groupings, pivots, etc.), each of which is an important piece of analyzing the less structured data available in many forms in the world around us. Pandas, and in particular its Series and DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of “data munging” tasks that occupy much of a data scientist’s ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Python Data Science Handbook, 2nd Edition

Publisher Resources

ISBN: 9781491912126Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Python Data Science Handbook

by Jake VanderPlas

Chapter 3. Data Manipulation with Pandas

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.