book

Python for Data Analysis

Name: Python for Data Analysis
Author: Wes McKinney
ISBN: 9781449319793

by Wes McKinney

October 2012

Beginner to intermediate

463 pages

12h 53m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Python for Data Analysis
A Note Regarding Supplemental Files
Preface
Conventions Used in This BookUsing Code ExamplesSafari® Books OnlineHow to Contact Us
1. Preliminaries
What Is This Book About?Why Python for Data Analysis?Python as GlueSolving the “Two-Language” ProblemWhy Not Python?Essential Python LibrariesNumPypandasmatplotlibIPythonSciPyInstallation and SetupWindowsApple OS XGNU/LinuxPython 2 and Python 3Integrated Development Environments (IDEs)Community and ConferencesNavigating This BookCode ExamplesData for ExamplesImport ConventionsJargonAcknowledgements
2. Introductory Examples
1.usa.gov data from bit.lyCounting Time Zones in Pure PythonCounting Time Zones with pandasMovieLens 1M Data SetMeasuring rating disagreementUS Baby Names 1880-2010Analyzing Naming TrendsMeasuring the increase in naming diversityThe “Last letter” RevolutionBoy names that became girl names (and vice versa)Conclusions and The Path Ahead
3. IPython: An Interactive Computing and Development Environment
IPython BasicsTab CompletionIntrospectionThe %run CommandInterrupting running codeExecuting Code from the ClipboardIPython interaction with editors and IDEsKeyboard ShortcutsExceptions and TracebacksMagic CommandsQt-based Rich GUI ConsoleMatplotlib Integration and Pylab ModeUsing the Command HistorySearching and Reusing the Command HistoryInput and Output VariablesLogging the Input and OutputInteracting with the Operating SystemShell Commands and AliasesDirectory Bookmark SystemSoftware Development ToolsInteractive DebuggerOther ways to make use of the debuggerTiming Code: %time and %timeitBasic Profiling: %prun and %run -pProfiling a Function Line-by-LineIPython HTML NotebookTips for Productive Code Development Using IPythonReloading Module DependenciesCode Design TipsKeep relevant objects and data aliveFlat is better than nestedOvercome a fear of longer filesAdvanced IPython FeaturesMaking Your Own Classes IPython-friendlyProfiles and ConfigurationCredits
4. NumPy Basics: Arrays and Vectorized Computation
The NumPy ndarray: A Multidimensional Array ObjectCreating ndarraysData Types for ndarraysOperations between Arrays and ScalarsBasic Indexing and SlicingIndexing with slicesBoolean IndexingFancy IndexingTransposing Arrays and Swapping AxesUniversal Functions: Fast Element-wise Array FunctionsData Processing Using ArraysExpressing Conditional Logic as Array OperationsMathematical and Statistical MethodsMethods for Boolean ArraysSortingUnique and Other Set LogicFile Input and Output with ArraysStoring Arrays on Disk in Binary FormatSaving and Loading Text FilesLinear AlgebraRandom Number GenerationExample: Random WalksSimulating Many Random Walks at Once
5. Getting Started with pandas
Introduction to pandas Data StructuresSeriesDataFrameIndex ObjectsEssential FunctionalityReindexingDropping entries from an axisIndexing, selection, and filteringArithmetic and data alignmentArithmetic methods with fill valuesOperations between DataFrame and SeriesFunction application and mappingSorting and rankingAxis indexes with duplicate valuesSummarizing and Computing Descriptive StatisticsCorrelation and CovarianceUnique Values, Value Counts, and MembershipHandling Missing DataFiltering Out Missing DataFilling in Missing DataHierarchical IndexingReordering and Sorting LevelsSummary Statistics by LevelUsing a DataFrame’s ColumnsOther pandas TopicsInteger IndexingPanel Data
6. Data Loading, Storage, and File Formats
Reading and Writing Data in Text FormatReading Text Files in PiecesWriting Data Out to Text FormatManually Working with Delimited FormatsJSON DataXML and HTML: Web ScrapingParsing XML with lxml.objectifyBinary Data FormatsUsing HDF5 FormatReading Microsoft Excel FilesInteracting with HTML and Web APIsInteracting with DatabasesStoring and Loading Data in MongoDB
7. Data Wrangling: Clean, Transform, Merge, Reshape
Combining and Merging Data SetsDatabase-style DataFrame MergesMerging on IndexConcatenating Along an AxisCombining Data with OverlapReshaping and PivotingReshaping with Hierarchical IndexingPivoting “long” to “wide” FormatData TransformationRemoving DuplicatesTransforming Data Using a Function or MappingReplacing ValuesRenaming Axis IndexesDiscretization and BinningDetecting and Filtering OutliersPermutation and Random SamplingComputing Indicator/Dummy VariablesString ManipulationString Object MethodsRegular expressionsVectorized string functions in pandasExample: USDA Food Database

8. Plotting and Visualization
A Brief matplotlib API PrimerFigures and SubplotsAdjusting the spacing around subplotsColors, Markers, and Line StylesTicks, Labels, and LegendsSetting the title, axis labels, ticks, and ticklabelsAdding legendsAnnotations and Drawing on a SubplotSaving Plots to Filematplotlib ConfigurationPlotting Functions in pandasLine PlotsBar PlotsHistograms and Density PlotsScatter PlotsPlotting Maps: Visualizing Haiti Earthquake Crisis DataPython Visualization Tool EcosystemChacomayaviOther PackagesThe Future of Visualization Tools?
9. Data Aggregation and Group Operations
GroupBy MechanicsIterating Over GroupsSelecting a Column or Subset of ColumnsGrouping with Dicts and SeriesGrouping with FunctionsGrouping by Index LevelsData AggregationColumn-wise and Multiple Function ApplicationReturning Aggregated Data in “unindexed” FormGroup-wise Operations and TransformationsApply: General split-apply-combineSuppressing the group keysQuantile and Bucket AnalysisExample: Filling Missing Values with Group-specific ValuesExample: Random Sampling and PermutationExample: Group Weighted Average and CorrelationExample: Group-wise Linear RegressionPivot Tables and Cross-TabulationCross-Tabulations: CrosstabExample: 2012 Federal Election Commission DatabaseDonation Statistics by Occupation and EmployerBucketing Donation AmountsDonation Statistics by State
10. Time Series
Date and Time Data Types and ToolsConverting between string and datetimeTime Series BasicsIndexing, Selection, SubsettingTime Series with Duplicate IndicesDate Ranges, Frequencies, and ShiftingGenerating Date RangesFrequencies and Date OffsetsWeek of month datesShifting (Leading and Lagging) DataShifting dates with offsetsTime Zone HandlingLocalization and ConversionOperations with Time Zone−aware Timestamp ObjectsOperations between Different Time ZonesPeriods and Period ArithmeticPeriod Frequency ConversionQuarterly Period FrequenciesConverting Timestamps to Periods (and Back)Creating a PeriodIndex from ArraysResampling and Frequency ConversionDownsamplingOpen-High-Low-Close (OHLC) resamplingResampling with GroupByUpsampling and InterpolationResampling with PeriodsTime Series PlottingMoving Window FunctionsExponentially-weighted functionsBinary Moving Window FunctionsUser-Defined Moving Window FunctionsPerformance and Memory Usage Notes
11. Financial and Economic Data Applications
Data Munging TopicsTime Series and Cross-Section AlignmentOperations with Time Series of Different FrequenciesUsing periods instead of timestampsTime of Day and “as of” Data SelectionSplicing Together Data SourcesReturn Indexes and Cumulative ReturnsGroup Transforms and AnalysisGroup Factor ExposuresDecile and Quartile AnalysisMore Example ApplicationsSignal Frontier AnalysisFuture Contract RollingRolling Correlation and Linear Regression
12. Advanced NumPy
ndarray Object InternalsNumPy dtype HierarchyAdvanced Array ManipulationReshaping ArraysC versus Fortran OrderConcatenating and Splitting ArraysStacking helpers: r_ and c_Repeating Elements: Tile and RepeatFancy Indexing Equivalents: Take and PutBroadcastingBroadcasting Over Other AxesSetting Array Values by BroadcastingAdvanced ufunc Usageufunc Instance MethodsCustom ufuncsStructured and Record ArraysNested dtypes and Multidimensional FieldsWhy Use Structured Arrays?Structured Array Manipulations: numpy.lib.recfunctionsMore About SortingIndirect Sorts: argsort and lexsortAlternate Sort Algorithmsnumpy.searchsorted: Finding elements in a Sorted ArrayNumPy Matrix ClassAdvanced Array Input and OutputMemory-mapped FilesHDF5 and Other Array Storage OptionsPerformance TipsThe Importance of Contiguous MemoryOther Speed Options: Cython, f2py, C
A. Python Language Essentials
The Python InterpreterThe BasicsLanguage SemanticsIndentation, not bracesEverything is an objectCommentsFunction and object method callsVariables and pass-by-referenceDynamic references, strong typesAttributes and methods“Duck” typingImportsBinary operators and comparisonsStrictness versus lazinessMutable and immutable objectsScalar TypesNumeric typesStringsBooleansType castingNoneDates and timesControl Flowif, elif, and elsefor loopswhile loopspassException handlingrange and xrangeTernary ExpressionsData Structures and SequencesTupleUnpacking tuplesTuple methodsListAdding and removing elementsConcatenating and combining listsSortingBinary search and maintaining a sorted listSlicingBuilt-in Sequence FunctionsenumeratesortedzipreversedDictCreating dicts from sequencesDefault valuesValid dict key typesSetList, Set, and Dict ComprehensionsNested list comprehensionsFunctionsNamespaces, Scope, and Local FunctionsReturning Multiple ValuesFunctions Are ObjectsAnonymous (lambda) FunctionsClosures: Functions that Return FunctionsExtended Call Syntax with *args, **kwargsCurrying: Partial Argument ApplicationGeneratorsGenerator expresssionsitertools moduleFiles and the operating system
Index
About the Author
Colophon
Copyright

Content preview from Python for Data Analysis

Chapter 5. Getting Started with pandas

pandas will be the primary library of interest throughout much of the rest of the book. It contains high-level data structures and manipulation tools designed to make data analysis fast and easy in Python. pandas is built on top of NumPy and makes it easy to use in NumPy-centric applications.

As a bit of background, I started building pandas in early 2008 during my tenure at AQR, a quantitative investment management firm. At the time, I had a distinct set of requirements that were not well-addressed by any single tool at my disposal:

Data structures with labeled axes supporting automatic or explicit data alignment. This prevents common errors resulting from misaligned data and working with differently-indexed data coming from different sources.
Integrated time series functionality.
The same data structures handle both time series data and non-time series data.
Arithmetic operations and reductions (like summing across an axis) would pass on the metadata (axis labels).
Flexible handling of missing data.
Merge and other relational operations found in popular database databases (SQL-based, for example).

I wanted to be able to do all of these things in one place, preferably in a language well-suited to general purpose software development. Python was a good candidate language for this, but at that time there was not an integrated set of data structures and tools providing this functionality.

Over the last four years, pandas has matured into a quite large ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781449323592Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Python for Data Analysis

by Wes McKinney

Chapter 5. Getting Started with pandas

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.