Beginning Machine Learning with scikit-learn
Understanding the fundamental concepts of Machine Learning
‘Machine learning’ is simply what we call the algorithmic extraction of knowledge from data. The ability to perform complex analysis of data, moving beyond the basic tools of statistics, has been refined and developed increasingly over the last two decades. Over a similar period, Python has grown to be the premier language for data science, and scikit-learn has grown to be the main toolkit used within Python for general purpose machine learning.
This course introduces a range of fundamental concepts and techniques used throughout machine learning, using scikit-learn as the concrete library and API in which these are illustrated. This first course focuses most heavily on what is called “supervised machine learning” but also introduces a number of concepts required to understand unsupervised learning.
What you'll learn-and how you can apply it
- Data cleanup and examining data's “shape”
- Classification vs. Regression vs. Clustering
- The scikit-learn models APIs
- Evaluation and scoring of models
This training course is for you because...
- You are an aspiring or beginning data scientist.
- You have a comfortable intermediate-level knowledge of Python and a very basic familiarity with statistics and linear algebra.
- You are a working programmer or student who is motivated to expand your skills to include machine learning with Python.
- A first course in Python and/or working experience as a programmer
- College-level basic mathematics
Students should have a system with Jupyter notebooks installed, a recent version of scikit-learn, along with Pandas, NumPy, and matplotlib, and the general scientific Python tool stack.
Before attending this course, please configure the environments you will need. Within the repository, find the file requirements.txt to install software using pip, or the file environment.yml to install software using conda.
This training material is available under a CC BY-NC-SA 4.0 license. You can find it at: https://github.com/DavidMertz/ML-Live-Beginner
- (video) Python Programming Language LiveLessons by David Beazley
- (video) Modern Python LiveLessons: Big Ideas and Little Code in Python by Ramond Hettinger
- (video) Using Jupyter Notebooks for Data Science Analysis in Python LiveLessons by Jamie Whitacre
- (video) Pandas Data Analysis with Python Fundamentals by Daniel Y. Chen
- (Live Online Training) Intermediate Machine Learning with scikit-learn by David Mertz - dates vary; search Safari to register
- (book) Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron
- (book) Introduction to Machine Learning with Python by Sarah Guido, Andreas C. Müller
About your instructor
David Mertz is a data scientist, trainer, and erstwhile startup CTO, who is currently writing the Addison Wesley title Cleaning Data for Successful Data Science: Doing the other 80% of the work. He created the training program for Anaconda, Inc. He was a Director of the Python Software Foundation for six years and remains chair of a few PSF committees. For nine years, David helped with creating the world's fastest—highly-specialized—supercomputer for performing molecular dynamics.
The timeframes are only estimates and may vary according to how the class is progressing
Lesson 1: What is Machine Learning? (1 hour)
1.1 Difference between "Deep Learning" and other ML techniques
1.2 Overview of techniques used in Machine Learning - 1.2.1 Classification - 1.2.2 Regression - 1.2.3 Clustering - 1.2.4 Dimensionality Reduction - 1.2.5 Feature Engineering - 1.2.6 Feature Selection - 1.2.7 Categorical vs. Ordinal vs. Continuous variables - 1.2.8 One-hot encoding - 1.2.9 Hyperparameters - 1.2.10 Grid Search
Lesson 2: Exploring a data set (30 minutes)
2.1 Looking for anomalies and data integrity problems
2.2 Cleaning data
2.3 Massaging data format to be model-ready
2.4 Choosing features and a target
2.5 Train/test split
Lesson 3: Classification (30 minutes)
3.1 Choosing a model
3.2 Feature importances
3.3 Cut points in a decision tree
3.4 Comparing multiple classifiers
Lesson 4: Regression (30 minutes)
4.1 Sample data sets in scikit-learn
4.2 Linear regressors
4.3 Probabilistic regressors
4.4 Other regressors
Lesson 5: Hyperparameters (30 minutes)
5.1 Understanding hyperparameters
5.2 Manual search of parameter space
5.4 Attributes of grid search and wrapped model