Thoughtful Machine Learning

Book Description

Learn how to apply test-driven development (TDD) to machine-learning algorithms—and catch mistakes that could sink your analysis. In this practical guide, author Matthew Kirk takes you through the principles of TDD and machine learning, and shows you how to apply TDD to several machine-learning algorithms, including Naive Bayesian classifiers and Neural Networks.

Machine-learning algorithms often have tests baked in, but they can’t account for human errors in coding. Rather than blindly rely on machine-learning results as many researchers have, you can mitigate the risk of errors with TDD and write clean, stable machine-learning code. If you’re familiar with Ruby 2.1, you’re ready to start.

  • Apply TDD to write and run tests before you start coding
  • Learn the best uses and tradeoffs of eight machine learning algorithms
  • Use real-world examples to test each algorithm through engaging, hands-on exercises
  • Understand the similarities between TDD and the scientific method for validating solutions
  • Be aware of the risks of machine learning, such as underfitting and overfitting data
  • Explore techniques for improving your machine-learning models or data extraction

Table of Contents

  1. Preface
    1. What to Expect from This Book
    2. How to Read This Book
    3. Who This Book Is For
    4. How to Contact Me
    5. Conventions Used in This Book
    6. Using Code Examples
    7. Safari® Books Online
    8. How to Contact Us
    9. Acknowledgments
  2. 1. Test-Driven Machine Learning
    1. History of Test-Driven Development
    2. TDD and the Scientific Method
      1. TDD Makes a Logical Proposition of Validity
      2. TDD Involves Writing Your Assumptions Down on Paper or in Code
      3. TDD and Scientific Method Work in Feedback Loops
    3. Risks with Machine Learning
      1. Unstable Data
      2. Underfitting
      3. Overfitting
      4. Unpredictable Future
    4. What to Test for to Reduce Risks
      1. Mitigate Unstable Data with Seam Testing
      2. Check Fit by Cross-Validating
      3. Reduce Overfitting Risk by Testing the Speed of Training
      4. Monitor for Future Shifts with Precision and Recall
    5. Conclusion
  3. 2. A Quick Introduction to Machine Learning
    1. What Is Machine Learning?
      1. Supervised Learning
      2. Unsupervised Learning
      3. Reinforcement Learning
    2. What Can Machine Learning Accomplish?
    3. Mathematical Notation Used Throughout the Book
    4. Conclusion
  4. 3. K-Nearest Neighbors Classification
    1. History of K-Nearest Neighbors Classification
    2. House Happiness Based on a Neighborhood
    3. How Do You Pick K?
      1. Guessing K
      2. Heuristics for Picking K
      3. Algorithms for Picking K
    4. What Makes a Neighbor “Near”?
      1. Minkowski Distance
      2. Mahalanobis Distance
    5. Determining Classes
    6. Beard and Glasses Detection Using KNN and OpenCV
      1. The Class Diagram
      2. Raw Image to Avatar
      3. The Face Class
      4. The Neighborhood Class
    7. Conclusion
  5. 4. Naive Bayesian Classification
    1. Using Bayes’ Theorem to Find Fraudulent Orders
      1. Conditional Probabilities
      2. Inverse Conditional Probability (aka Bayes’ Theorem)
    2. Naive Bayesian Classifier
      1. The Chain Rule
      2. Naivety in Bayesian Reasoning
      3. Pseudocount
    3. Spam Filter
      1. The Class Diagram
      2. Data Source
      3. Email Class
      4. Tokenization and Context
      5. The SpamTrainer
      6. Error Minimization Through Cross-Validation
    4. Conclusion
  6. 5. Hidden Markov Models
    1. Tracking User Behavior Using State Machines
      1. Emissions/Observations of Underlying States
      2. Simplification through the Markov Assumption
      3. Using Markov Chains Instead of a Finite State Machine
      4. Hidden Markov Model
    2. Evaluation: Forward-Backward Algorithm
      1. Using User Behavior
    3. The Decoding Problem through the Viterbi Algorithm
    4. The Learning Problem
    5. Part-of-Speech Tagging with the Brown Corpus
      1. The Seam of Our Part-of-Speech Tagger: CorpusParser
      2. Writing the Part-of-Speech Tagger
      3. Cross-Validating to Get Confidence in the Model
      4. How to Make This Model Better
    6. Conclusion
  7. 6. Support Vector Machines
    1. Solving the Loyalty Mapping Problem
    2. Derivation of SVM
    3. Nonlinear Data
      1. The Kernel Trick
      2. Soft Margins
    4. Using SVM to Determine Sentiment
      1. The Class Diagram
      2. Corpus Class
      3. Return a Unique Set of Words from the Corpus
      4. The CorpusSet Class
      5. The SentimentClassifier Class
      6. Improving Results Over Time
    5. Conclusion
  8. 7. Neural Networks
    1. History of Neural Networks
    2. What Is an Artificial Neural Network?
      1. Input Layer
      2. Hidden Layers
      3. Neurons
      4. Output Layer
      5. Training Algorithms
    3. Building Neural Networks
      1. How Many Hidden Layers?
      2. How Many Neurons for Each Layer?
      3. Tolerance for Error and Max Epochs
    4. Using a Neural Network to Classify a Language
      1. Writing the Seam Test for Language
      2. Cross-Validating Our Way to a Network Class
      3. Tuning the Neural Network
      4. Convergence Testing
      5. Precision and Recall for Neural Networks
      6. Wrap-Up of Example
    5. Conclusion
  9. 8. Clustering
    1. User Cohorts
    2. K-Means Clustering
      1. The K-Means Algorithm
      2. The Downside of K-Means Clustering
    3. Expectation Maximization (EM) Clustering
    4. The Impossibility Theorem
    5. Categorizing Music
      1. Gathering the Data
      2. Analyzing the Data with K-Means
      3. EM Clustering
      4. EM Jazz Clustering Results
    6. Conclusion
  10. 9. Kernel Ridge Regression
    1. Collaborative Filtering
    2. Linear Regression Applied to Collaborative Filtering
    3. Introducing Regularization, or Ridge Regression
    4. Kernel Ridge Regression
    5. Wrap-Up of Theory
    6. Collaborative Filtering with Beer Styles
      1. Data Set
      2. The Tools We Will Need
      3. Reviewer
      4. Writing the Code to Figure Out Someone’s Preference
      5. Collaborative Filtering with User Preferences
    7. Conclusion
  11. 10. Improving Models and Data Extraction
    1. The Problem with the Curse of Dimensionality
    2. Feature Selection
    3. Feature Transformation
    4. Principal Component Analysis (PCA)
    5. Independent Component Analysis (ICA)
    6. Monitoring Machine Learning Algorithms
      1. Precision and Recall: Spam Filter
      2. The Confusion Matrix
    7. Mean Squared Error
    8. The Wilds of Production Environments
    9. Conclusion
  12. 11. Putting It All Together
    1. Machine Learning Algorithms Revisited
    2. How to Use This Information for Solving Problems
    3. What’s Next for You?
  13. Index

Product Information

  • Title: Thoughtful Machine Learning
  • Author(s): Matthew Kirk
  • Release date: October 2014
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781449374068