book

Python for Data Analysis, 3rd Edition

by Wes McKinney

August 2022

Beginner to intermediate

582 pages

13h 6m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Includes

Has Sandbox

Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgmentsIn Memoriam: John D. Hunter (1968–2012)Acknowledgments for the Third Edition (2022)Acknowledgments for the Second Edition (2017)Acknowledgments for the First Edition (2012)
1.1 What Is This Book About?What Kinds of Data?1.2 Why Python for Data Analysis?Python as GlueSolving the “Two-Language” ProblemWhy Not Python?1.3 Essential Python LibrariesNumPypandasmatplotlibIPython and JupyterSciPyscikit-learnstatsmodelsOther Packages1.4 Installation and SetupMiniconda on WindowsGNU/LinuxMiniconda on macOSInstalling Necessary PackagesIntegrated Development Environments and Text Editors1.5 Community and Conferences1.6 Navigating This BookCode ExamplesData for ExamplesImport Conventions
2.1 The Python Interpreter2.2 IPython BasicsRunning the IPython ShellRunning the Jupyter NotebookTab CompletionIntrospection2.3 Python Language BasicsLanguage SemanticsScalar TypesControl Flow2.4 Conclusion
3.1 Data Structures and SequencesTupleListDictionarySetBuilt-In Sequence FunctionsList, Set, and Dictionary Comprehensions3.2 FunctionsNamespaces, Scope, and Local FunctionsReturning Multiple ValuesFunctions Are ObjectsAnonymous (Lambda) FunctionsGeneratorsErrors and Exception Handling3.3 Files and the Operating SystemBytes and Unicode with Files3.4 Conclusion
4.1 The NumPy ndarray: A Multidimensional Array ObjectCreating ndarraysData Types for ndarraysArithmetic with NumPy ArraysBasic Indexing and SlicingBoolean IndexingFancy IndexingTransposing Arrays and Swapping Axes4.2 Pseudorandom Number Generation4.3 Universal Functions: Fast Element-Wise Array Functions4.4 Array-Oriented Programming with ArraysExpressing Conditional Logic as Array OperationsMathematical and Statistical MethodsMethods for Boolean ArraysSortingUnique and Other Set Logic4.5 File Input and Output with Arrays4.6 Linear Algebra4.7 Example: Random WalksSimulating Many Random Walks at Once4.8 Conclusion
5.1 Introduction to pandas Data StructuresSeriesDataFrameIndex Objects5.2 Essential FunctionalityReindexingDropping Entries from an AxisIndexing, Selection, and FilteringArithmetic and Data AlignmentFunction Application and MappingSorting and RankingAxis Indexes with Duplicate Labels5.3 Summarizing and Computing Descriptive StatisticsCorrelation and CovarianceUnique Values, Value Counts, and Membership5.4 Conclusion
6.1 Reading and Writing Data in Text FormatReading Text Files in PiecesWriting Data to Text FormatWorking with Other Delimited FormatsJSON DataXML and HTML: Web Scraping6.2 Binary Data FormatsReading Microsoft Excel FilesUsing HDF5 Format6.3 Interacting with Web APIs6.4 Interacting with Databases6.5 Conclusion
7.1 Handling Missing DataFiltering Out Missing DataFilling In Missing Data7.2 Data TransformationRemoving DuplicatesTransforming Data Using a Function or MappingReplacing ValuesRenaming Axis IndexesDiscretization and BinningDetecting and Filtering OutliersPermutation and Random SamplingComputing Indicator/Dummy Variables7.3 Extension Data Types7.4 String ManipulationPython Built-In String Object MethodsRegular ExpressionsString Functions in pandas7.5 Categorical DataBackground and MotivationCategorical Extension Type in pandasComputations with CategoricalsCategorical Methods7.6 Conclusion
8.1 Hierarchical IndexingReordering and Sorting LevelsSummary Statistics by LevelIndexing with a DataFrame’s columns8.2 Combining and Merging DatasetsDatabase-Style DataFrame JoinsMerging on IndexConcatenating Along an AxisCombining Data with Overlap8.3 Reshaping and PivotingReshaping with Hierarchical IndexingPivoting “Long” to “Wide” FormatPivoting “Wide” to “Long” Format8.4 Conclusion
9.1 A Brief matplotlib API PrimerFigures and SubplotsColors, Markers, and Line StylesTicks, Labels, and LegendsAnnotations and Drawing on a SubplotSaving Plots to Filematplotlib Configuration9.2 Plotting with pandas and seabornLine PlotsBar PlotsHistograms and Density PlotsScatter or Point PlotsFacet Grids and Categorical Data9.3 Other Python Visualization Tools9.4 Conclusion

10.1 How to Think About Group OperationsIterating over GroupsSelecting a Column or Subset of ColumnsGrouping with Dictionaries and SeriesGrouping with FunctionsGrouping by Index Levels10.2 Data AggregationColumn-Wise and Multiple Function ApplicationReturning Aggregated Data Without Row Indexes10.3 Apply: General split-apply-combineSuppressing the Group KeysQuantile and Bucket AnalysisExample: Filling Missing Values with Group-Specific ValuesExample: Random Sampling and PermutationExample: Group Weighted Average and CorrelationExample: Group-Wise Linear Regression10.4 Group Transforms and “Unwrapped” GroupBys10.5 Pivot Tables and Cross-TabulationCross-Tabulations: Crosstab10.6 Conclusion
11.1 Date and Time Data Types and ToolsConverting Between String and Datetime11.2 Time Series BasicsIndexing, Selection, SubsettingTime Series with Duplicate Indices11.3 Date Ranges, Frequencies, and ShiftingGenerating Date RangesFrequencies and Date OffsetsShifting (Leading and Lagging) Data11.4 Time Zone HandlingTime Zone Localization and ConversionOperations with Time Zone-Aware Timestamp ObjectsOperations Between Different Time Zones11.5 Periods and Period ArithmeticPeriod Frequency ConversionQuarterly Period FrequenciesConverting Timestamps to Periods (and Back)Creating a PeriodIndex from Arrays11.6 Resampling and Frequency ConversionDownsamplingUpsampling and InterpolationResampling with PeriodsGrouped Time Resampling11.7 Moving Window FunctionsExponentially Weighted FunctionsBinary Moving Window FunctionsUser-Defined Moving Window Functions11.8 Conclusion
12.1 Interfacing Between pandas and Model Code12.2 Creating Model Descriptions with PatsyData Transformations in Patsy FormulasCategorical Data and Patsy12.3 Introduction to statsmodelsEstimating Linear ModelsEstimating Time Series Processes12.4 Introduction to scikit-learn12.5 Conclusion
13.1 Bitly Data from 1.USA.govCounting Time Zones in Pure PythonCounting Time Zones with pandas13.2 MovieLens 1M DatasetMeasuring Rating Disagreement13.3 US Baby Names 1880–2010Analyzing Naming Trends13.4 USDA Food Database13.5 2012 Federal Election Commission DatabaseDonation Statistics by Occupation and EmployerBucketing Donation AmountsDonation Statistics by State13.6 Conclusion
A.1 ndarray Object InternalsNumPy Data Type HierarchyA.2 Advanced Array ManipulationReshaping ArraysC Versus FORTRAN OrderConcatenating and Splitting ArraysRepeating Elements: tile and repeatFancy Indexing Equivalents: take and putA.3 BroadcastingBroadcasting over Other AxesSetting Array Values by BroadcastingA.4 Advanced ufunc Usageufunc Instance MethodsWriting New ufuncs in PythonA.5 Structured and Record ArraysNested Data Types and Multidimensional FieldsWhy Use Structured Arrays?A.6 More About SortingIndirect Sorts: argsort and lexsortAlternative Sort AlgorithmsPartially Sorting Arraysnumpy.searchsorted: Finding Elements in a Sorted ArrayA.7 Writing Fast NumPy Functions with NumbaCreating Custom numpy.ufunc Objects with NumbaA.8 Advanced Array Input and OutputMemory-Mapped FilesHDF5 and Other Array Storage OptionsA.9 Performance TipsThe Importance of Contiguous Memory
B.1 Terminal Keyboard ShortcutsB.2 About Magic CommandsThe %run CommandExecuting Code from the ClipboardB.3 Using the Command HistorySearching and Reusing the Command HistoryInput and Output VariablesB.4 Interacting with the Operating SystemShell Commands and AliasesDirectory Bookmark SystemB.5 Software Development ToolsInteractive DebuggerTiming Code: %time and %timeitBasic Profiling: %prun and %run -pProfiling a Function Line by LineB.6 Tips for Productive Code Development Using IPythonReloading Module DependenciesCode Design TipsB.7 Advanced IPython FeaturesProfiles and ConfigurationB.8 Conclusion

Content preview from Python for Data Analysis, 3rd Edition

Chapter 4. NumPy Basics: Arrays and Vectorized Computation

NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python. Many computational packages providing scientific functionality use NumPy’s array objects as one of the standard interface lingua francas for data exchange. Much of the knowledge about NumPy that I cover is transferable to pandas as well.

Here are some of the things you’ll find in NumPy:

ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities
Mathematical functions for fast operations on entire arrays of data without having to write loops
Tools for reading/writing array data to disk and working with memory-mapped files
Linear algebra, random number generation, and Fourier transform capabilities
A C API for connecting NumPy with libraries written in C, C++, or FORTRAN

Because NumPy provides a comprehensive and well-documented C API, it is straightforward to pass data to external libraries written in a low-level language, and for external libraries to return data to Python as NumPy arrays. This feature has made Python a language of choice for wrapping legacy C, C++, or FORTRAN codebases and giving them a dynamic and accessible interface.

While NumPy by itself does not provide modeling or scientific functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools with array computing semantics, ...