Improve your data quality using sampling distribution

Book description

Statistical methods are central to the techniques involved in capturing value from data. While there are many resources that teach basic statistics, it’s not as common to find statistics approached through a data science perspective. In this lesson, you’ll learn about key statistical concepts—sampling distribution and variability, bootstrapping, and confidence intervals—as they relate directly to data science.

What you’ll learn—and how you can apply it

You'll learn how sampling distribution and sampling variability impact the results of statistical and machine learning models and impact your data quality. You’ll also learn the bootstrap procedure—an easy and effective way to estimate the sampling distribution of a statistic, or of model parameters. Discover how to utilize confidence intervals—a method to express the potential error in a sample estimate, and present your estimates as an interval range, to communicate the potential error in an estimate, and learn whether you need a larger sample of data.

This lesson is for you because:

You're a data scientist or analyst working with data, and want to gain beginner-level knowledge of key statistical concepts to improve your data models and data quality.

Prerequisites:

  • Basic familiarity with coding in R

Materials or downloads needed:

  • None

Publisher resources

View/Submit Errata

Product information

  • Title: Improve your data quality using sampling distribution
  • Author(s): Andrew Bruce, Peter Bruce
  • Release date: December 2016
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781491978320