book

Learning Data Science

by Sam Lau, Joseph Gonzalez, Deborah Nolan

September 2023

Beginner

596 pages

15h 31m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Includes Quizzes

Expected Background KnowledgeOrganization of the BookConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
The Stages of the LifecycleExamples of the LifecycleSummary
Big Data and New OpportunitiesExample: Google Flu TrendsTarget Population, Access Frame, and SampleExample: What Makes Members of an Online Community Active?Example: Who Will Win the Election?Example: How Do Environmental Hazards Relate to an Individual’s Health?Instruments and ProtocolsMeasuring Natural PhenomenaExample: What Is the Level of CO2 in the Air?AccuracyTypes of BiasTypes of VariationSummary
The Urn ModelSampling DesignsSampling Distribution of a StatisticSimulating the Sampling DistributionSimulation with the Hypergeometric DistributionExample: Simulating Election Poll Bias and VarianceThe Pennsylvania Urn ModelAn Urn Model with BiasConducting Larger PollsExample: Simulating a Randomized Trial for a VaccineScopeThe Urn Model for Random AssignmentExample: Measuring Air QualitySummary
The Constant ModelMinimizing LossMean Absolute ErrorMean Squared ErrorChoosing Loss FunctionsSummary
Question and ScopeData WranglingExploring Bus TimesModeling Wait TimesSummary
SubsettingData Scope and QuestionDataframes and IndicesSlicingFiltering RowsExample: How Recently Has Luna Become a Popular Name?AggregatingBasic Group-AggregateGrouping on Multiple ColumnsCustom Aggregation FunctionsPivotingJoiningInner JoinsLeft, Right, and Outer JoinsExample: Popularity of NYT Name CategoriesTransformingApplyExample: Popularity of “L” NamesThe Price of ApplyHow Are Dataframes Different from Other Data Representations?Dataframes and SpreadsheetsDataframes and MatricesDataframes and RelationsSummary
SubsettingSQL Basics: SELECT and FROMWhat’s a Relation?SlicingFiltering RowsExample: How Recently Has Luna Become a Popular Name?AggregatingBasic Group-Aggregate Using GROUP BYGrouping on Multiple ColumnsOther Aggregation FunctionsJoiningInner JoinsLeft and Right JoinsExample: Popularity of NYT Name CategoriesTransforming and Common Table ExpressionsSQL FunctionsMultistep Queries Using a WITH ClauseExample: Popularity of “L” NamesSummary

Data Source ExamplesDrug Abuse Warning Network (DAWN) SurveySan Francisco Restaurant Food SafetyFile FormatsDelimited FormatFixed-Width FormatHierarchical FormatsLoosely Formatted TextFile EncodingFile SizeThe Shell and Command-Line ToolsTable Shape and GranularityGranularity of Restaurant Inspections and ViolationsDAWN Survey Shape and GranularitySummary
Example: Wrangling CO2 Measurements from the Mauna Loa ObservatoryQuality ChecksAddressing Missing DataReshaping the Data TableQuality ChecksQuality Based on ScopeQuality of Measurements and Recorded ValuesQuality Across Related FeaturesQuality for AnalysisFixing the Data or NotMissing Values and RecordsTransformations and TimestampsTransforming TimestampsPiping for TransformationsModifying StructureExample: Wrangling Restaurant Safety ViolationsNarrowing the FocusAggregating ViolationsExtracting Information from Violation DescriptionsSummary
Feature TypesExample: Dog BreedsTransforming Qualitative FeaturesThe Importance of Feature TypesWhat to Look For in a DistributionWhat to Look For in a RelationshipTwo Quantitative FeaturesOne Qualitative and One Quantitative VariableTwo Qualitative FeaturesComparisons in Multivariate SettingsGuidelines for ExplorationExample: Sale Prices for HousesUnderstanding PriceWhat Next?Examining Other FeaturesDelving Deeper into RelationshipsFixing LocationEDA DiscoveriesSummary
Choosing Scale to Reveal StructureFilling the Data RegionIncluding ZeroRevealing Shape Through TransformationsBanking to Decipher RelationshipsRevealing Relationships Through StraighteningSmoothing and Aggregating DataSmoothing Techniques to Uncover ShapeSmoothing Techniques to Uncover Relationships and TrendsSmoothing Techniques Need TuningReducing Distributions to QuantilesWhen Not to SmoothFacilitating Meaningful ComparisonsEmphasize the Important DifferenceOrdering GroupsAvoid StackingSelecting a Color PaletteGuidelines for Comparisons in PlotsIncorporating the Data DesignData Collected Over TimeObservational StudiesUnequal SamplingGeographic DataAdding ContextExample: 100m Sprint TimesCreating Plots Using plotlyFigure and Trace ObjectsModifying LayoutPlotting FunctionsAnnotationsOther Tools for VisualizationmatplotlibGrammar of GraphicsSummary
Question, Design, and ScopeFinding Collocated SensorsWrangling the List of AQS SitesWrangling the List of PurpleAir SitesMatching AQS and PurpleAir SensorsWrangling and Cleaning AQS Sensor DataChecking GranularityRemoving Unneeded ColumnsChecking the Validity of DatesChecking the Quality of PM2.5 MeasurementsWrangling PurpleAir Sensor DataChecking the GranularityHandling Missing ValuesExploring PurpleAir and AQS MeasurementsCreating a Model to Correct PurpleAir MeasurementsSummary
Examples of Text and TasksConvert Text into a Standard FormatExtract a Piece of Text to Create a FeatureTransform Text into FeaturesText AnalysisString ManipulationConverting Text to a Standard Format with Python String MethodsString Methods in pandasSplitting Strings to Extract Pieces of TextRegular ExpressionsConcatenation of LiteralsQuantifiersAlternation and Grouping to Create FeaturesReference TablesText AnalysisSummary
NetCDF DataJSON DataHTTPRESTXML, HTML, and XPathExample: Scraping Race Times from WikipediaXPathExample: Accessing Exchange Rates from the ECBSummary
Simple Linear ModelExample: A Simple Linear Model for Air QualityInterpreting Linear ModelsAssessing the FitFitting the Simple Linear ModelMultiple Linear ModelFitting the Multiple Linear ModelExample: Where Is the Land of Opportunity?Explaining Upward Mobility Using Commute TimeRelating Upward Mobility Using Multiple VariablesFeature Engineering for Numeric MeasurementsFeature Engineering for Categorical MeasurementsSummary
OverfittingExample: Energy ConsumptionTrain-Test SplitCross-ValidationRegularizationModel Bias and VarianceSummary
Distributions: Population, Empirical, SamplingBasics of Hypothesis TestingExample: A Rank Test to Compare Productivity of Wikipedia ContributorsExample: A Test of Proportions for Vaccine EfficacyBootstrapping for InferenceBasics of Confidence IntervalsBasics of Prediction IntervalsExample: Predicting Bus LatenessExample: Predicting Crab SizeExample: Predicting the Incremental Growth of a CrabProbability for Inference and PredictionFormalizing the Theory for Average Rank StatisticsGeneral Properties of Random VariablesProbability Behind Testing and IntervalsProbability Behind Model SelectionSummary
Donkey Study Question and ScopeWrangling and TransformingExploringModeling a Donkey’s WeightA Loss Function for Prescribing AnestheticsFitting a Simple Linear ModelFitting a Multiple Linear ModelBringing Qualitative Features into the ModelModel AssessmentSummary
Example: Wind-Damaged TreesModeling and ClassificationA Constant ModelExamining the Relationship Between Size and WindthrowModeling Proportions (and Probabilities)A Logistic ModelLog OddsUsing a Logistic CurveA Loss Function for the Logistic ModelFrom Probabilities to ClassificationThe Confusion MatrixPrecision Versus RecallSummary
Gradient Descent BasicsMinimizing Huber LossConvex and Differentiable Loss FunctionsVariants of Gradient DescentStochastic Gradient DescentMini-Batch Gradient DescentNewton’s MethodSummary
Question and ScopeObtaining and Wrangling the DataExploring the DataExploring the PublishersExploring Publication DateExploring Words in ArticlesModelingA Single-Word ModelMultiple-Word ModelPredicting with the tf-idf TransformSummary

Content preview from Learning Data Science

Chapter 1. The Data Science Lifecycle

Data science is a rapidly evolving field. At the time of this writing, people are still trying to pin down exactly what data science is, what data scientists do, and what skills data scientists should have. What we do know, though, is that data science uses a combination of methods and principles from statistics and computer science to work with and draw insights from data. And learning computer science and statistics in combination makes us better data scientists. We also know that any insights we glean need to be interpreted in the context of the problem that we are working on.

This book covers fundamental principles and skills that data scientists need to help make all sorts of important decisions. With both technical skills and conceptual understanding we can work on data-centric problems to, say, assess whether a vaccine works, filter out fake news automatically, calibrate air quality sensors, and advise analysts on policy changes.

To help you keep track of the bigger picture, we’ve organized topics around a workflow that we call the data science lifecycle. In this chapter, we introduce this lifecycle. Unlike other data science books, which tend to focus on one part of the lifecycle or address only computational or statistical topics, we cover the entire cycle from start to finish and consider both statistical and computational aspects together.