book

Machine Learning Infrastructure and Best Practices for Software Engineers

Name: Machine Learning Infrastructure and Best Practices for Software Engineers
Author: Miroslaw Staron
ISBN: 9781837634064

by Miroslaw Staron

January 2024

Intermediate to advanced

346 pages

9h 9m

English

Packt Publishing

Read now

Unlock full access

Machine Learning Infrastructure and Best Practices for Software Engineers
ContributorsAbout the authorAbout the reviewers
Preface
Who this book is forWhat this book coversTo get the most out of this bookDownload the example code filesConventions usedGet in touchShare Your ThoughtsDownload a free PDF copy of this book
Part 1:Machine Learning Landscape in Software Engineering
Machine Learning Compared to Traditional Software
Machine learning is not traditional softwareSupervised, unsupervised, and reinforcement learning – it is just the beginningAn example of traditional and machine learning softwareProbability and software – how well they go togetherTesting and evaluation – the same but differentSummaryReferences
Elements of a Machine Learning System
Elements of a production machine learning systemData and algorithmsData collectionFeature extractionData validationConfiguration and monitoringConfigurationMonitoringInfrastructure and resource managementData serving infrastructureComputational infrastructureHow this all comes together – machine learning pipelinesReferences
Data in Software Systems – Text, Images, Code, and Their Annotations
Raw data and features – what are the differences?ImagesTextVisualization of output from more advanced text processingStructured text – source code of programsEvery data has its purpose – annotations and tasksAnnotating text for intent recognitionWhere different types of data can be used together – an outlook on multi-modal data modelsReferences
Data Acquisition, Data Quality, and Noise
Sources of data and what we can do with themExtracting data from software engineering tools – Gerrit and JiraExtracting data from product databases – GitHub and GitData qualityNoiseSummaryReferences
Quantifying and Improving Data Properties
Feature engineering – the basicsClean dataNoise in data managementAttribute noiseSplitting dataHow ML models handle noiseReferences
Part 2: Data Acquisition and Management
Processing Data in Machine Learning Systems
Numerical dataSummarizing the dataDiving deeper into correlationsSummarizing individual measuresReducing the number of measures – PCAOther types of data – imagesText dataToward feature engineeringReferences

Feature Engineering for Numerical and Image Data
Feature engineeringFeature engineering for numerical dataPCAt-SNEICALocally linear embeddingLinear discriminant analysisAutoencodersFeature engineering for image dataSummaryReferences
Feature Engineering for Natural Language Data
Natural language data in software engineering and the rise of GitHub CopilotWhat a tokenizer is and what it doesBag-of-words and simple tokenizersWordPiece tokenizerBPEThe SentencePiece tokenizerWord embeddingsFastTextFrom feature extraction to modelsReferences
Part 3: Design and Development of ML Systems
Types of Machine Learning Systems – Feature-Based and Raw Data-Based (Deep Learning)
Why do we need different types of models?Classical machine learning modelsConvolutional neural networks and image processingBERT and GPT modelsUsing language models in software systemsSummaryReferences
Training and Evaluating Classical Machine Learning Systems and Neural Networks
Training and testing processesTraining classical machine learning modelsUnderstanding the training processRandom forest and opaque modelsTraining deep learning modelsMisleading results – data leakingSummaryReferences
Training and Evaluation of Advanced ML Algorithms – GPT and Autoencoders
From classical ML to GenAIThe theory behind advanced models – AEs and transformersAEsTransformersTraining and evaluation of a RoBERTa modelTraining and evaluation of an AEDeveloping safety cages to prevent models from breaking the entire systemSummaryReferences
Designing Machine Learning Pipelines (MLOps) and Their Testing
What ML pipelines areML pipelinesElements of MLOpsML pipelines – how to use ML in the system in practiceDeploying models to HuggingFaceDownloading models from HuggingFaceRaw data-based pipelinesPipelines for NLP-related tasksPipelines for imagesFeature-based pipelinesTesting of ML pipelinesMonitoring ML systems at runtimeSummaryReferences
Designing and Implementing Large-Scale, Robust ML Software
ML is not aloneThe UI of an ML modelData storageDeploying an ML model for numerical dataDeploying a generative ML model for imagesDeploying a code completion model as an extensionSummaryReferences
Part 4: Ethical Aspects of Data Management and ML System Development
Ethics in Data Acquisition and Management
Ethics in computer science and software engineeringData is all around us, but can we really use it?Ethics behind data from open source systemsEthics behind data collected from humansContracts and legal obligationsReferences
Ethics in Machine Learning Systems
Bias and ML – is it possible to have an objective AI?Measuring and monitoring for biasOther metrics of biasDeveloping mechanisms to prevent ML bias from spreading throughout the systemSummaryReferences
Integrating ML Systems in Ecosystems
EcosystemsCreating web services over ML models using FlaskCreating a web service using FlaskCreating a web service that contains a pre-trained ML modelDeploying ML models using DockerCombining web services into ecosystemsSummaryReferences
Summary and Where to Go Next
To know where we’re going, we need to know where  we’ve beenBest practicesCurrent developmentsMy view on the futureFinal remarksReferences
Index
Why subscribe?
Other Books You May EnjoyPackt is searching for authors like youShare Your ThoughtsDownload a free PDF copy of this book

Overview

Machine Learning Infrastructure and Best Practices for Software Engineers equips readers with a comprehensive understanding of best practices to transform machine learning prototypes into robust, scalable software systems. This book covers designing pipelines, scaling them up, ensuring data quality, and addressing the ethical dimensions of machine learning.

What this Book will help me do

Transform machine learning prototypes into fully operational software systems.
Design scalable machine learning pipelines suitable for production environments.
Ensure quality and reliability in the data acquisition and processing stages.
Implement effective testing and validation strategies for machine learning systems.
Assess and mitigate ethical risks in large-scale machine learning implementations.

Author(s)

Miroslaw Staron, the author, is an experienced software engineer and academic with a deep focus on machine learning systems. He combines practical experience from the industry with insights from his extensive research to provide actionable and relevant guidance. His writing integrates theoretical concepts with practical applications to bridge the gap between research and implementation in machine learning software.

Who is it for?

This book is ideal for software engineers looking to improve their expertise in scaling machine learning prototypes, machine learning engineers aiming to understand the challenges in production-level systems, and decision-makers seeking to grasp the essential aspects of creating robust machine learning-driven solutions. Regardless of your current skill level, this book provides insights and practices to guide you in developing complete machine learning software systems.

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781837634064

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills