book

Practical Statistics for Data Scientists, 2nd Edition

by Peter Bruce, Andrew Bruce, Peter Gedeck

May 2020

Beginner

360 pages

9h 16m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Conventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
Elements of Structured DataFurther ReadingRectangular DataData Frames and IndexesNonrectangular Data StructuresFurther ReadingEstimates of LocationMeanMedian and Robust EstimatesExample: Location Estimates of Population and Murder RatesFurther ReadingEstimates of VariabilityStandard Deviation and Related EstimatesEstimates Based on PercentilesExample: Variability Estimates of State PopulationFurther ReadingExploring the Data DistributionPercentiles and BoxplotsFrequency Tables and HistogramsDensity Plots and EstimatesFurther ReadingExploring Binary and Categorical DataModeExpected ValueProbabilityFurther ReadingCorrelationScatterplotsFurther ReadingExploring Two or More VariablesHexagonal Binning and Contours (Plotting Numeric Versus Numeric Data)Two Categorical VariablesCategorical and Numeric DataVisualizing Multiple VariablesFurther ReadingSummary
Random Sampling and Sample BiasBiasRandom SelectionSize Versus Quality: When Does Size Matter?Sample Mean Versus Population MeanFurther ReadingSelection BiasRegression to the MeanFurther ReadingSampling Distribution of a StatisticCentral Limit TheoremStandard ErrorFurther ReadingThe BootstrapResampling Versus BootstrappingFurther ReadingConfidence IntervalsFurther ReadingNormal DistributionStandard Normal and QQ-PlotsLong-Tailed DistributionsFurther ReadingStudent’s t-DistributionFurther ReadingBinomial DistributionFurther ReadingChi-Square DistributionFurther ReadingF-DistributionFurther ReadingPoisson and Related DistributionsPoisson DistributionsExponential DistributionEstimating the Failure RateWeibull DistributionFurther ReadingSummary
A/B TestingWhy Have a Control Group?Why Just A/B? Why Not C, D,…?Further ReadingHypothesis TestsThe Null HypothesisAlternative HypothesisOne-Way Versus Two-Way Hypothesis TestsFurther ReadingResamplingPermutation TestExample: Web StickinessExhaustive and Bootstrap Permutation TestsPermutation Tests: The Bottom Line for Data ScienceFurther ReadingStatistical Significance and p-Valuesp-ValueAlphaType 1 and Type 2 ErrorsData Science and p-ValuesFurther Readingt-TestsFurther ReadingMultiple TestingFurther ReadingDegrees of FreedomFurther ReadingANOVAF-StatisticTwo-Way ANOVAFurther ReadingChi-Square TestChi-Square Test: A Resampling ApproachChi-Square Test: Statistical TheoryFisher’s Exact TestRelevance for Data ScienceFurther ReadingMulti-Arm Bandit AlgorithmFurther ReadingPower and Sample SizeSample SizeFurther ReadingSummary
Simple Linear RegressionThe Regression EquationFitted Values and ResidualsLeast SquaresPrediction Versus Explanation (Profiling)Further ReadingMultiple Linear RegressionExample: King County Housing DataAssessing the ModelCross-ValidationModel Selection and Stepwise RegressionWeighted RegressionFurther ReadingPrediction Using RegressionThe Dangers of ExtrapolationConfidence and Prediction IntervalsFactor Variables in RegressionDummy Variables RepresentationFactor Variables with Many LevelsOrdered Factor VariablesInterpreting the Regression EquationCorrelated PredictorsMulticollinearityConfounding VariablesInteractions and Main EffectsRegression DiagnosticsOutliersInfluential ValuesHeteroskedasticity, Non-Normality, and Correlated ErrorsPartial Residual Plots and NonlinearityPolynomial and Spline RegressionPolynomialSplinesGeneralized Additive ModelsFurther ReadingSummary
Naive BayesWhy Exact Bayesian Classification Is ImpracticalThe Naive SolutionNumeric Predictor VariablesFurther ReadingDiscriminant AnalysisCovariance MatrixFisher’s Linear DiscriminantA Simple ExampleFurther ReadingLogistic RegressionLogistic Response Function and LogitLogistic Regression and the GLMGeneralized Linear ModelsPredicted Values from Logistic RegressionInterpreting the Coefficients and Odds RatiosLinear and Logistic Regression: Similarities and DifferencesAssessing the ModelFurther ReadingEvaluating Classification ModelsConfusion MatrixThe Rare Class ProblemPrecision, Recall, and SpecificityROC CurveAUCLiftFurther ReadingStrategies for Imbalanced DataUndersamplingOversampling and Up/Down WeightingData GenerationCost-Based ClassificationExploring the PredictionsFurther ReadingSummary
K-Nearest NeighborsA Small Example: Predicting Loan DefaultDistance MetricsOne Hot EncoderStandardization (Normalization, z-Scores)Choosing KKNN as a Feature EngineTree ModelsA Simple ExampleThe Recursive Partitioning AlgorithmMeasuring Homogeneity or ImpurityStopping the Tree from GrowingPredicting a Continuous ValueHow Trees Are UsedFurther ReadingBagging and the Random ForestBaggingRandom ForestVariable ImportanceHyperparametersBoostingThe Boosting AlgorithmXGBoostRegularization: Avoiding OverfittingHyperparameters and Cross-ValidationSummary
Principal Components AnalysisA Simple ExampleComputing the Principal ComponentsInterpreting Principal ComponentsCorrespondence AnalysisFurther ReadingK-Means ClusteringA Simple ExampleK-Means AlgorithmInterpreting the ClustersSelecting the Number of ClustersHierarchical ClusteringA Simple ExampleThe DendrogramThe Agglomerative AlgorithmMeasures of DissimilarityModel-Based ClusteringMultivariate Normal DistributionMixtures of NormalsSelecting the Number of ClustersFurther ReadingScaling and Categorical VariablesScaling the VariablesDominant VariablesCategorical Data and Gower’s DistanceProblems with Clustering Mixed DataSummary

Content preview from Practical Statistics for Data Scientists, 2nd Edition

Chapter 2. Data and Sampling Distributions

A popular misconception holds that the era of big data means the end of a need for sampling. In fact, the proliferation of data of varying quality and relevance reinforces the need for sampling as a tool to work efficiently with a variety of data and to minimize bias. Even in a big data project, predictive models are typically developed and piloted with samples. Samples are also used in tests of various sorts (e.g., comparing the effect of web page designs on clicks).

Figure 2-1 shows a schematic that underpins the concepts we will discuss in this chapter—data and sampling distributions. The lefthand side represents a population that, in statistics, is assumed to follow an underlying but unknown distribution. All that is available is the sample data and its empirical distribution, shown on the righthand side. To get from the lefthand side to the righthand side, a sampling procedure is used (represented by an arrow). Traditional statistics focused very much on the lefthand side, using theory based on strong assumptions about the population. Modern statistics has moved to the righthand side, where such assumptions are not needed.

In general, data scientists need not worry about the theoretical nature of the lefthand side and instead should focus on the sampling procedures and the data at hand. There are some notable exceptions. Sometimes data is generated from a physical process that can be modeled. The simplest example is flipping a coin: ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Practical Statistics for Data Scientists

Publisher Resources

ISBN: 9781492072935Errata Page Supplemental Content

Practical Statistics for Data Scientists, 2nd Edition

by Peter Bruce, Andrew Bruce, Peter Gedeck

Chapter 2. Data and Sampling Distributions

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like