Intermediate Machine Learning with scikit-learn
Using scikit-learn effectively and performantly
‘Machine learning’ is simply what we call the algorithmic extraction of knowledge from data. The ability to perform complex analysis of data, moving beyond the basic tools of statistics, has been refined and developed increasingly over the last two decades. Over a similar period, Python has grown to be the premier language for data science, and scikit-learn has grown to be the main toolkit used within Python for general purpose machine learning.
This course moves beyond the topics covered in Beginning Machine Learning with scikit-learn. A recap is given of a few essential concepts for students starting here. We then first discuss unsupervised machine learning techniques, and then look at data preparation and “massaging” that is always needed for robust models. Finally, we address concerns best practices for robust and generalizable modeling techniques needed for real-world data science.
What you'll learn-and how you can apply it
- Recap: Classification vs. Regression vs. Clustering
- Unsupervised machine learning
- Feature engineering and feature selection
- Better train/test splits
This training course is for you because...
- You are an aspiring or beginning data scientist.
- You have a comfortable intermediate-level knowledge of Python and a very basic familiarity with statistics and linear algebra.
- You are a working programmer or student who is motivated to expand your skills to include machine learning with Python.
- You have some familiarity with the fundamentals of machine learning or have taken the Beginning Machine Learning with scikit-learn live training class.
- A first course in Python and/or working experience as a programmer
- College level basic mathematics
- Recommended: Attend or view Beginning Machine Learning with scikit-learn
- IMPORTANT: Complete the Course Set-up instructions, at the Github repo indicated below.
- Before class begins: Follow the setup instructions at: https://github.com/DavidMertz/ML-Webinar.
- Students should have a system with Jupyter notebooks installed, a recent version of scikit-learn, along with Pandas, NumPy, and matplotlib, and the general scientific Python tool stack.
These resources are optional, but helpful if you need a refresher on Python, Jupyter Notebooks, or Pandas:
- (Live Online Training) Beginning Machine Learning with scikit-learn by David Mertz
- (video) Python Programming Language LiveLessons by David Beazley
- (video) Modern Python LiveLessons: Big Ideas and Little Code in Python by Ramond Hettinger
- (video) Using Jupyter Notebooks for Data Science Analysis in Python LiveLessons by Jamie Whitacre
- (video) Pandas Data Analysis with Python Fundamentals by Daniel Y. Chen
- (book) Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems by Aurélien Géron
- (book) Introduction to Machine Learning with Python by Sarah Guido, Andreas C. Müller
About your instructor
David Mertz is a data scientist, trainer, and erstwhile startup CTO, who is currently writing the Addison Wesley title Cleaning Data for Successful Data Science: Doing the other 80% of the work. He created the training program for Anaconda, Inc. He was a Director of the Python Software Foundation for six years and remains chair of a few PSF committees. For nine years, David helped with creating the world's fastest—highly-specialized—supercomputer for performing molecular dynamics.
The timeframes are only estimates and may vary according to how the class is progressing
Lesson 1: Recap: What is Machine Learning? (30 minutes)
1.1 Overview of techniques used in Machine Learning - 1.1.1 Classification, Regression, Clustering - 1.1.2 Dimensionality Reduction, Feature Engineering, Feature Selection
1.2.3 Categorical vs. Ordinal vs. Continuous variables
1.2.4 Results of Classification and Regression in earlier session
1.2.5 Metrics [BREAK]
Lesson 2: Clustering (45 minutes)
2.1 Overview of (some) clustering algorithms - 2.1.1 Kmeans - 2.1.2 Agglomerative - 2.1.3 Density based clustering - 184.108.40.206 DBScan - 220.127.116.11 HDBScan
2.2 n_clusters, labels, and predictions
2.3 Visualizing results [BREAK]
Lesson 3: Feature engineering and feature selection (45 minutes)
3.1 Dimensionality reduction - 3.1.1 Principal Component Analysis (PCA) - 3.1.2 Non-Negative Matrix Factorization (NMF) - 3.1.3 Latent Dirichlet Allocation (LDA) - 3.1.4 Independent component analysis (ICA) - 3.1.5 SelectKBest
3.2 Dimensionality expansion - 3.2.1 Polynomial Features - 3.2.2 One-Hot Encoding
3.3 Scaling - 3.3.1 StandardScaler, RobustScaler, MinMaxScaler, Normalizer - 3.3.2 Quantiles, binarize [BREAK]
Lesson 4: Pipelines (30 minutes)
4.1 Feature selection and engineering
4.2 Grid search
4.3 Model [BREAK]
Lesson 5: Robust Train/test splits (30 minutes)
5.3 KFold, RepeatedKFold, LeaveOneOut, LeavePOut, StratifiedKFold