Machine Learning Infrastructure and Best Practices for Software Engineers

Book description

Efficiently transform your initial designs into big systems by learning the foundations of infrastructure, algorithms, and ethical considerations for modern software products

Key Features

  • Learn how to scale-up your machine learning software to a professional level
  • Secure the quality of your machine learning pipeline at runtime
  • Apply your knowledge to natural languages, programming languages, and images

Book Description

Although creating a machine learning pipeline or developing a working prototype of a software system from that pipeline is easy and straightforward nowadays, the journey toward a professional software system is still extensive. This book will help you get to grips with various best practices and recipes that will help software engineers transform prototype pipelines into complete software products.

The book begins by introducing the main concepts of professional software systems that leverage machine learning at their core. As you progress, you’ll explore the differences between traditional, non-ML software, and machine learning software. The initial best practices will guide you in determining the type of software you need for your product. Subsequently, you will delve into algorithms, covering their selection, development, and testing before exploring the intricacies of the infrastructure for machine learning systems by defining best practices for identifying the right data source and ensuring its quality.

Towards the end, you’ll address the most challenging aspect of large-scale machine learning systems – ethics. By exploring and defining best practices for assessing ethical risks and strategies for mitigation, you will conclude the book where it all began – large-scale machine learning software.

What you will learn

  • Identify what the machine learning software best suits your needs
  • Work with scalable machine learning pipelines
  • Scale up pipelines from prototypes to fully fledged software
  • Choose suitable data sources and processing methods for your product
  • Differentiate raw data from complex processing, noting their advantages
  • Track and mitigate important ethical risks in machine learning software
  • Work with testing and validation for machine learning systems

Who this book is for

If you’re a machine learning engineer, this book will help you design more robust software, and understand which scaling-up challenges you need to address and why. Software engineers will benefit from best practices that will make your products robust, reliable, and innovative. Decision makers will also find lots of useful information in this book, including guidance on what to look for in a well-designed machine learning software product.

Table of contents

  1. Machine Learning Infrastructure and Best Practices for Software Engineers
  2. Contributors
  3. About the author
  4. About the reviewers
  5. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Download the example code files
    5. Conventions used
    6. Get in touch
    7. Share Your Thoughts
    8. Download a free PDF copy of this book
  6. Part 1:Machine Learning Landscape in Software Engineering
  7. Machine Learning Compared to Traditional Software
    1. Machine learning is not traditional software
      1. Supervised, unsupervised, and reinforcement learning – it is just the beginning
      2. An example of traditional and machine learning software
    2. Probability and software – how well they go together
    3. Testing and evaluation – the same but different
    4. Summary
    5. References
  8. Elements of a Machine Learning System
    1. Elements of a production machine learning system
    2. Data and algorithms
    3. Data collection
      1. Feature extraction
      2. Data validation
    4. Configuration and monitoring
      1. Configuration
      2. Monitoring
    5. Infrastructure and resource management
      1. Data serving infrastructure
      2. Computational infrastructure
    6. How this all comes together – machine learning pipelines
    7. References
  9. Data in Software Systems – Text, Images, Code, and Their Annotations
    1. Raw data and features – what are the differences?
      1. Images
      2. Text
      3. Visualization of output from more advanced text processing
      4. Structured text – source code of programs
    2. Every data has its purpose – annotations and tasks
    3. Annotating text for intent recognition
    4. Where different types of data can be used together – an outlook on multi-modal data models
    5. References
  10. Data Acquisition, Data Quality, and Noise
    1. Sources of data and what we can do with them
    2. Extracting data from software engineering tools – Gerrit and Jira
    3. Extracting data from product databases – GitHub and Git
    4. Data quality
    5. Noise
    6. Summary
    7. References
  11. Quantifying and Improving Data Properties
    1. Feature engineering – the basics
    2. Clean data
    3. Noise in data management
    4. Attribute noise
    5. Splitting data
    6. How ML models handle noise
    7. References
  12. Part 2: Data Acquisition and Management
  13. Processing Data in Machine Learning Systems
    1. Numerical data
      1. Summarizing the data
      2. Diving deeper into correlations
      3. Summarizing individual measures
      4. Reducing the number of measures – PCA
    2. Other types of data – images
    3. Text data
    4. Toward feature engineering
    5. References
  14. Feature Engineering for Numerical and Image Data
    1. Feature engineering
    2. Feature engineering for numerical data
      1. PCA
      2. t-SNE
      3. ICA
      4. Locally linear embedding
      5. Linear discriminant analysis
      6. Autoencoders
    3. Feature engineering for image data
    4. Summary
    5. References
  15. Feature Engineering for Natural Language Data
    1. Natural language data in software engineering and the rise of GitHub Copilot
    2. What a tokenizer is and what it does
    3. Bag-of-words and simple tokenizers
    4. WordPiece tokenizer
    5. BPE
    6. The SentencePiece tokenizer
    7. Word embeddings
    8. FastText
    9. From feature extraction to models
    10. References
  16. Part 3: Design and Development of ML Systems
  17. Types of Machine Learning Systems – Feature-Based and Raw Data-Based (Deep Learning)
    1. Why do we need different types of models?
    2. Classical machine learning models
    3. Convolutional neural networks and image processing
    4. BERT and GPT models
    5. Using language models in software systems
    6. Summary
    7. References
  18. Training and Evaluating Classical Machine Learning Systems and Neural Networks
    1. Training and testing processes
    2. Training classical machine learning models
    3. Understanding the training process
    4. Random forest and opaque models
    5. Training deep learning models
    6. Misleading results – data leaking
    7. Summary
    8. References
  19. Training and Evaluation of Advanced ML Algorithms – GPT and Autoencoders
    1. From classical ML to GenAI
    2. The theory behind advanced models – AEs and transformers
      1. AEs
      2. Transformers
    3. Training and evaluation of a RoBERTa model
    4. Training and evaluation of an AE
    5. Developing safety cages to prevent models from breaking the entire system
    6. Summary
    7. References
  20. Designing Machine Learning Pipelines (MLOps) and Their Testing
    1. What ML pipelines are
      1. ML pipelines
      2. Elements of MLOps
    2. ML pipelines – how to use ML in the system in practice
      1. Deploying models to HuggingFace
      2. Downloading models from HuggingFace
    3. Raw data-based pipelines
      1. Pipelines for NLP-related tasks
      2. Pipelines for images
    4. Feature-based pipelines
    5. Testing of ML pipelines
    6. Monitoring ML systems at runtime
    7. Summary
    8. References
  21. Designing and Implementing Large-Scale, Robust ML Software
    1. ML is not alone
    2. The UI of an ML model
    3. Data storage
    4. Deploying an ML model for numerical data
    5. Deploying a generative ML model for images
    6. Deploying a code completion model as an extension
    7. Summary
    8. References
  22. Part 4: Ethical Aspects of Data Management and ML System Development
  23. Ethics in Data Acquisition and Management
    1. Ethics in computer science and software engineering
    2. Data is all around us, but can we really use it?
    3. Ethics behind data from open source systems
    4. Ethics behind data collected from humans
    5. Contracts and legal obligations
    6. References
  24. Ethics in Machine Learning Systems
    1. Bias and ML – is it possible to have an objective AI?
    2. Measuring and monitoring for bias
      1. Other metrics of bias
    3. Developing mechanisms to prevent ML bias from spreading throughout the system
    4. Summary
    5. References
  25. Integrating ML Systems in Ecosystems
    1. Ecosystems
    2. Creating web services over ML models using Flask
      1. Creating a web service using Flask
      2. Creating a web service that contains a pre-trained ML model
    3. Deploying ML models using Docker
    4. Combining web services into ecosystems
    5. Summary
    6. References
  26. Summary and Where to Go Next
    1. To know where we’re going, we need to know where 
we’ve been
    2. Best practices
    3. Current developments
    4. My view on the future
    5. Final remarks
    6. References
  27. Index
    1. Why subscribe?
  28. Other Books You May Enjoy
    1. Packt is searching for authors like you
    2. Share Your Thoughts
    3. Download a free PDF copy of this book

Product information

  • Title: Machine Learning Infrastructure and Best Practices for Software Engineers
  • Author(s): Miroslaw Staron
  • Release date: January 2024
  • Publisher(s): Packt Publishing
  • ISBN: 9781837634064