Machine Learning in Production: Developing and Optimizing Data Science Workflows and Applications

Book description

The typical data science task in industry starts with an “ask” from the business. But few data scientists have been taught what to do with that ask. This book shows them how to assess it in the context of the business’s goals, reframe it to work optimally for both the data scientist and the employer, and then execute on it. Written by two of the experts who’ve achieved breakthrough optimizations at BuzzFeed, it’s packed with real-world examples that take you from start to finish: from ask to actionable insight.

Andrew Kelleher and Adam Kelleher walk you through well-formed, concrete principles for approaching common data science problems, giving you an easy-to-use checklist for effective execution. Using their principles and techniques, you’ll gain deeper understanding of your data, learn how to analyze noise and confounding variables so they don’t compromise your analysis, and save weeks of iterative improvement by planning your projects more effectively upfront.

Once you’ve mastered their principles, you’ll put them to work in two realistic, beginning-to-end site optimization tasks. These extended examples come complete with reusable code examples and recommended open-source solutions designed for easy adaptation to your everyday challenges. They will be especially valuable for anyone seeking their first data science job – and everyone who’s found that job and wants to succeed in it.

Table of contents

  1. Cover
  2. About This E-Book
  3. Title Page
  4. Copyright Page
  5. Dedication
  6. Contents
  7. Foreword
  8. Preface
    1. Who This Book Is For
    2. What This Book Covers
    3. Going Forward
  9. About the Authors
  10. I: Principles of Framing
    1. 1. The Role of the Data Scientist
      1. 1.1 Introduction
      2. 1.2 The Role of the Data Scientist
      3. 1.3 Conclusion
    2. 2. Project Workflow
      1. 2.1 Introduction
      2. 2.2 The Data Team Context
      3. 2.3 Agile Development and the Product Focus
      4. 2.4 Conclusion
    3. 3. Quantifying Error
      1. 3.1 Introduction
      2. 3.2 Quantifying Error in Measured Values
      3. 3.3 Sampling Error
      4. 3.4 Error Propagation
      5. 3.5 Conclusion
    4. 4. Data Encoding and Preprocessing
      1. 4.1 Introduction
      2. 4.2 Simple Text Preprocessing
      3. 4.3 Information Loss
      4. 4.4 Conclusion
    5. 5. Hypothesis Testing
      1. 5.1 Introduction
      2. 5.2 What Is a Hypothesis?
      3. 5.3 Types of Errors
      4. 5.4 P-values and Confidence Intervals
      5. 5.5 Multiple Testing and “P-hacking”
      6. 5.6 An Example
      7. 5.7 Planning and Context
      8. 5.8 Conclusion
    6. 6. Data Visualization
      1. 6.1 Introduction
      2. 6.2 Distributions and Summary Statistics
      3. 6.3 Time-Series Plots
      4. 6.4 Graph Visualization
      5. 6.5 Conclusion
  11. II: Algorithms and Architectures
    1. 7. Introduction to Algorithms and Architectures
      1. 7.1 Introduction
      2. 7.2 Architectures
      3. 7.3 Models
      4. 7.4 Conclusion
    2. 8. Comparison
      1. 8.1 Introduction
      2. 8.2 Jaccard Distance
      3. 8.3 MinHash
      4. 8.4 Cosine Similarity
      5. 8.5 Mahalanobis Distance
      6. 8.6 Conclusion
    3. 9. Regression
      1. 9.1 Introduction
      2. 9.2 Linear Least Squares
      3. 9.3 Nonlinear Regression with Linear Regression
      4. 9.4 Random Forest
      5. 9.5 Conclusion
    4. 10. Classification and Clustering
      1. 10.1 Introduction
      2. 10.2 Logistic Regression
      3. 10.3 Bayesian Inference, Naive Bayes
      4. 10.4 K-Means
      5. 10.5 Leading Eigenvalue
      6. 10.6 Greedy Louvain
      7. 10.7 Nearest Neighbors
      8. 10.8 Conclusion
    5. 11. Bayesian Networks
      1. 11.1 Introduction
      2. 11.2 Causal Graphs, Conditional Independence, and Markovity
      3. 11.3 D-separation and the Markov Property
      4. 11.4 Causal Graphs as Bayesian Networks
      5. 11.5 Fitting Models
      6. 11.6 Conclusion
    6. 12. Dimensional Reduction and Latent Variable Models
      1. 12.1 Introduction
      2. 12.2 Priors
      3. 12.3 Factor Analysis
      4. 12.4 Principal Components Analysis
      5. 12.5 Independent Component Analysis
      6. 12.6 Latent Dirichlet Allocation
      7. 12.7 Conclusion
    7. 13. Causal Inference
      1. 13.1 Introduction
      2. 13.2 Experiments
      3. 13.3 Observation: An Example
      4. 13.4 Controlling to Block Non-causal Paths
      5. 13.5 Machine-Learning Estimators
      6. 13.6 Conclusion
    8. 14. Advanced Machine Learning
      1. 14.1 Introduction
      2. 14.2 Optimization
      3. 14.3 Neural Networks
      4. 14.4 Conclusion
  12. III: Bottlenecks and Optimizations
    1. 15. Hardware Fundamentals
      1. 15.1 Introduction
      2. 15.2 Random Access Memory
      3. 15.3 Nonvolatile/Persistent Storage
      4. 15.4 Throughput
      5. 15.5 Processors
      6. 15.6 Conclusion
    2. 16. Software Fundamentals
      1. 16.1 Introduction
      2. 16.2 Paging
      3. 16.3 Indexing
      4. 16.4 Granularity
      5. 16.5 Robustness
      6. 16.6 Extract, Transfer/Transform, Load
      7. 16.7 Conclusion
    3. 17. Software Architecture
      1. 17.1 Introduction
      2. 17.2 Client-Server Architecture
      3. 17.3 N-tier/Service-Oriented Architecture
      4. 17.4 Microservices
      5. 17.5 Monolith
      6. 17.6 Practical Cases (Mix-and-Match Architectures)
      7. 17.7 Conclusion
    4. 18. The CAP Theorem
      1. 18.1 Introduction
      2. 18.2 Consistency/Concurrency
      3. 18.3 Availability
      4. 18.4 Partition Tolerance
      5. 18.5 Conclusion
    5. 19. Logical Network Topological Nodes
      1. 19.1 Introduction
      2. 19.2 Network Diagrams
      3. 19.3 Load Balancing
      4. 19.4 Caches
      5. 19.5 Databases
      6. 19.6 Queues
      7. 19.7 Conclusion
  13. Bibliography
  14. Index
  15. Credits
  16. Code Snippets

Product information

  • Title: Machine Learning in Production: Developing and Optimizing Data Science Workflows and Applications
  • Author(s): Andrew Kelleher, Adam Kelleher
  • Release date: May 2019
  • Publisher(s): Addison-Wesley Professional
  • ISBN: 9780134116556