Foundational data science with R
Published by O'Reilly Media, Inc.
Mastering basic statistical analysis, summary, and visualization using R
This course provides a firm foundation on the fundamentals of data science using R, with a focus on key statistical methods, exploratory data analysis, and visualizations.
Before worrying about advanced analytics and neural nets, it is important to master the core skills. While this is certainly not a mathematical course, we won’t shy away from giving insight into the underlying mathematical theory. This invaluable online course will give you a solid grounding in the fundamental data science skills you need.
What you’ll learn and how you can apply it
At the end of this live, online training, you’ll understand:
- How to summarize data sets with key statistics
- Which statistics are optimal for large data sets
- The trade-off between different summary measures.
- The importance of color, transparency and shape in data visualisations
- Mathematical distribution, and how it relates to “real” data
- How key algorithms work
And you’ll be able to:
- Summarize data sets
- Graphically describe data
- Compare groups of data using principled statistical techniques
- Describe relationships among data sets with correlation and regression models
- Use insight to predict future values
This live event is for you because...
You are a:
- Programmer, interested in data science but with little or no statistics or mathematical background.
- Manager who wants to summarize data sets.
- Someone who uses data, but doesn’t have the necessary training to analyze and summarize it.
Prerequisites
No experience with R is necessary, but participants are expected to understand basic programming via another language, e.g. python, matlab, C, or Java. The course will be taught using R, but the focus is on the methods, rather than programming.
Schedule
The time frames are only estimates and may vary according to how the class is progressing.
DAY 1
Introduction and course overview (20 minutes)
- Introduction
- Course overview
Condensing data with numerical summaries (90 minutes)
Measures of location
- Mean, median, mode
- Example
- Exercise / Q&A (25 minutes)
Measures of spread
- Variance, standard deviation, quartiles, range
- Example
- Exercise / Q&A (25 minutes)
Streaming data
- Mean vs median
- variance vs quartiles
- Example
- Exercise / Q&A (10 minutes)
Break (5 min)
What, why and how of visualisation (90 minutes)
Scatter plot Colors
- Number of points–should you summarize?
- Transparency
- log scales
- Examples
- Exercise/Q&A (25 minutes)
Histogram
- How do determine the number of bins
- Examples
- Barplot
- ordinal data
- Examples
- Boxplot
- Great for comparison
- Examples
- Exercises/Q&A (25 minutes)
Wrap up
DAY 2
The normal distribution-what’s the point? (30 minutes)
- Why does the normal distribution come from?
- Shape: the famous bell shaped curve
- Key parameters
- The 2 standard deviations rule
- Scaling data
- (Data - mean)/sd
- Example
- Exercise/Q&A (10 minutes) Break (5 min)
How to compare groups (60 minutes)
- The t-test
- The t-distribution
- Assumptions: normality, independent
- Example:
- OK Cupid data. Are the “daters” heights different from the standard population?
- The central limit theorem (basically, don’t worry about normality too much if your data set is big enough)
- Confidence intervals:
- Standard errors vs standard deviation
- Example
- Exercise/Q&A
Break (5 min)
Capturing relationships with linear regression (90 minutes)
- Correlation: linear relationship between two variables
- Examples
- Exercise/Q&A
- Simple linear regression
- Assumptions
- Residuals: Observed - expected
- Examples
- Exercise/Q&A
Wrap-up (5 min)
Your Instructor
Colin Gillespie
Colin Gillespie is a Senior Lecturer in Statistics at Newcastle University, UK, and the co-author of Efficient R Programming by O’Reilly. His research interests are high-performance statistical computing and Bayesian statistics. He is regularly employed as a consultant by Jumping Rivers and has been teaching R since 2005 at a variety of levels, ranging from beginning to advanced programming.