book

Statistics for Data Science and Analytics

Name: Statistics for Data Science and Analytics
ISBN: 9781394253807

by Peter C. Bruce, Peter Gedeck, Janet Dobbins

September 2024

Intermediate to advanced

384 pages

9h 30m

English

Wiley

Read now

Unlock full access

Cover
Table of Contents
Title Page
Copyright
Dedication
About the Authors
Acknowledgments
About the Companion Website
Introduction
Statistics and Data ScienceAccompanying Web ResourcesPythonUsing Python with this Book
1 Statistics and Data Science
1.1 Big Data: Predicting Pregnancy1.2 Phantom Protection from Vitamin E1.3 Statistician, Heal Thyself1.4 Identifying Terrorists in Airports1.5 Looking Ahead1.6 Big Data and Statisticians

2 Designing and Carrying Out a Statistical Study
2.1 Statistical Science2.2 Big Data2.3 Data Science2.4 Example: Hospital Errors2.5 Experiment2.6 Designing an Experiment2.7 The Data2.8 Variables and Their Flavors2.9 Python: Data Structures and Operations2.10 Are We Sure We Made a Difference?2.11 Is Chance Responsible? The Foundation of Hypothesis Testing2.12 Probability2.13 Significance or Alpha Level2.14 Other Kinds of Studies2.15 When to Use Hypothesis Tests2.16 Experiments Falling Short of the Gold Standard2.17 Summary2.18 Python: Iterations and Conditional Execution2.19 Python: Numpy, scipy, and pandas—The Workhorses of Data ScienceExercisesNotes
3 Exploring and Displaying the Data
3.1 Exploratory Data Analysis3.2 What to Measure—Central Location3.3 What to Measure—Variability3.4 What to Measure—Distance (Nearness)3.5 Test Statistic3.6 Examining and Displaying the Data3.7 Python: Exploratory Data Analysis/Data VisualizationExercisesNotes
4 Accounting for Chance—Statistical Inference
4.1 Avoid Being Fooled by Chance4.2 The Null Hypothesis4.3 Repeating the Experiment4.4 Statistical Significance4.5 Power4.6 The Normal Distribution4.7 Summary4.8 Python: Random NumbersExercisesNotes
5 Probability
5.1 What Is Probability5.2 Simple Probability5.3 Probability Distributions5.4 From Binomial to Normal Distribution5.5 Appendix: Binomial Formula and Normal Approximation5.6 Python: ProbabilityExercises
6 Categorical Variables
6.1 Two-way Tables6.2 Conditional Probability6.3 Bayesian Estimates6.4 Independence6.5 Multiplication Rule6.6 Simpson’s Paradox6.7 Python: Counting and Contingency TablesExercisesNotes
7 Surveys and Sampling
7.1 Literary Digest—Sampling Trumps “All Data”7.2 Simple Random Samples7.3 Margin of Error: Sampling Distribution for a Proportion7.4 Sampling Distribution for a Mean7.5 The Bootstrap7.6 Rationale for the Bootstrap7.7 Standard Error7.8 Other Sampling Methods7.9 Absolute vs. Relative Sample Size7.10 Python: Random Sampling StrategiesExercisesNotes
8 More than Two Samples or Categories
8.1 Count Data—R C Tables8.2 The Role of Experiments (Many Are Costly)8.3 Chi-Square Test8.4 Single Sample—Goodness-of-Fit8.5 Numeric Data: ANOVA8.6 Components of Variance8.7 Factorial Design8.8 The Problem of Multiple Inference8.9 Continuous Testing8.10 Bandit Algorithms8.11 Appendix: ANOVA, the Factor Diagram, and the -Statistic8.12 More than One Factor or Variable—From ANOVA to Statistical Models8.13 Python: Contingency Tables and Chi-square Test8.14 Python: ANOVAExercisesNotes
9 Correlation
9.1 Example: Delta Wire9.2 Example: Cotton Dust and Lung Disease9.3 The Vector Product Sum Test9.4 Correlation Coefficient9.5 Correlation is not Causation9.6 Other Forms of Association9.7 Python: CorrelationExercisesNotes
10 Regression
10.1 Finding the Regression Line by Eye10.2 Finding the Regression Line by Minimizing Residuals10.3 Linear Relationships10.4 Prediction vs. Explanation10.5 Python: Linear RegressionExercisesNote
11 Multiple Linear Regression
11.1 Terminology11.2 Example—Housing Prices11.3 Interaction11.4 Regression Assumptions11.5 Assessing Explanatory Regression Models11.6 Assessing Regression for Prediction11.7 Python: Multiple Linear RegressionExercisesNote
12 Predicting Binary Outcomes
12.1 -Nearest-Neighbors12.2 Python: Classification12.3 ExercisesNote
Index
End User License Agreement

Content preview from Statistics for Data Science and Analytics

Introduction

Statistics and Data Science

As of the writing of this book, the fields of statistics and data science are evolving rapidly to meet the changing needs of business, government, and research organizations. It is an oversimplification, but still useful, to think of two distinct communities as you proceed:

The traditional academic and medical research communities that typically conduct extended research projects adhering to rigorous regulatory or publication standards, and
Businesses and large organizations that use statistical methods to extract value from their data, often on the fly. Reliability and value are more important than academic rigor to this data science community.

Most users of statistical methods now fall in the second category, as those methods are a basic component of what is now called artificial intelligence (AI). However, most of the specific techniques, as well as the language of statistics, had their origin in the first group. As a result, there is a certain amount of “baggage” that is not truly relevant to the data science community. That baggage can sometimes be obscure or confusing and, in this book, we provide guidance on what is or is not important to data science. Another feature of this book is the use of resampling/simulation methods to develop the underpinnings of statistical inference (the most difficult topic in an introductory course) in a transparent and understandable fashion.

We start off with some examples of statistics in action ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Statistics for Data Science and Business Analysis

Publisher Resources

ISBN: 9781394253807Purchase Link

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Statistics for Data Science and Analytics

by Peter C. Bruce, Peter Gedeck, Janet Dobbins

Introduction

Statistics and Data Science

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.