Chapter 1. Introduction

As machine learning (ML) becomes increasingly widespread, the number of topics that an ML practitioner needs to know about increases as well. Often, security isn’t a high-priority topic: “My company has a whole security team, so why should I worry about it?” But as I’ll discuss in this report, security for ML systems is different from security for traditional software. The rapid advances in ML come with a whole new set of security risks. With the speed of innovation in this area, is building secure ML possible?

In this report, I’ll seek to answer this question. I’ll discuss why security is particularly important for ML and review the known security risks for ML systems. I’ll also explain techniques to mitigate these attacks, enhance security, and increase privacy. I’ll answer the question of whether secure ML is possible by defining what is meant by “secure,” and discussing whether the techniques we have today are sufficient to achieve it.

As with many topics in machine learning, there has been a great deal of research into security, but most of the research does not make it into a data scientist or machine learning engineer’s standard workflow. Last year, researchers at Microsoft interviewed 28 companies on their preparedness around ML and security. It published the results in a paper that included the following quote:¹

Industry practitioners are not equipped with tactical and strategic tools to protect, detect and respond to attacks on their Machine Learning systems.

This report aims to help bridge the gap between industry and academia, and highlight the important security topics that should be considered when building a machine learning system.

Who Is This Report For?

This report is for ML engineers, data scientists, managers of ML teams, and other professionals using ML. It’s for people for whom security is not the core focus, but who want to learn more about how an ML system can be attacked, and what they can do about it. This is not an introduction to machine learning—if ML is new to you, I recommend the following books as great places to get started:

Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition, by Aurelien Geron (O’Reilly)
AI and Machine Learning for Coders, by Laurence Moroney (O’Reilly)

This report will teach you about some of the major security threats to ML systems, and some strategies that can help mitigate these threats. It won’t tell you everything you need to know to build a secure ML system, but it will help you start asking the right questions.

What Do We Mean by “Secure” ML?

Security is the practice of protecting systems from theft, damage, disruption, or unwanted information disclosure. It’s about reducing the risk of theft of information or intellectual property, and reducing the risk of unwanted intrusions into a system. In a machine learning system, this means protecting training data from theft, or protecting a model’s predictions from disruption.

This is not the same as privacy. We’re not considering how much personal data is shared with a company. A company can hold a person’s data securely, but the amount of data that it holds could be an invasion of that person’s privacy. That said, a security breach that exposes personal data is also a breach of privacy.

Security also doesn’t include ethics, a separate huge topic in ML. We’re not considering the morality of the system, whether the model should exist or not, or whether the model is biased. These are extremely important topics, but they are outside the scope of this report. The O’Reilly report Ethics and Data Science by Mike Loukides, Hilary Mason, and DJ Patil is a great place to get started on these topics.

What Security Standards and Regulations Apply to ML?

In the software industry, a company’s security is evaluated relative to external standards. A company may comply with general security standards such as ISO 27001, an international standard for cybersecurity, or follow frameworks from the US National Institute of Standards and Technology (NIST). Privacy regulations such as the EU’s General Data Protection Regulation (GDPR) also include provisions on security. According to the GDPR, data must be⁠²

processed in a manner that ensures appropriate security of the personal data, including protection against unauthorised or unlawful processing and against accidental loss, destruction or damage, using appropriate technical or organisational measures.

At the time of writing, these standards don’t specifically call out machine learning, but NIST has published a draft report, A Taxonomy and Terminology of Adversarial Machine Learning, that is a step toward future standards. It’s best to consult with your company’s security team to find out what standards and regulations you should comply with, and how these might affect the way you treat your data and models.

Why Is Security for ML Different?

Security for machine learning is, in some ways, similar to security for traditional software. ML depends, of course, on the data it is trained on, and this input data provides a means of attacking a model. An attacker can extract information from an exposed model endpoint without the proper safeguards. And the normal security standards around encrypting data in transit, protecting passwords, and physical security still apply.

However, some things are different. For a given input into an ML system, we don’t know what the output of the model will be. As I will discuss in Chapter 2, this means that an image fed to an ML system can cause a response that we don’t want. ML is also very dependent on open source data and models, which can provide another avenue of attack.

Note

Attacks on ML systems are often divided into “open-box” and “closed-box” attacks. In an open-box attack, the attacker has information on the ML algorithm or the weights of the trained model. In a closed-box attack, the attacker can only query the model and record the response.

When Should You Think About Security?

There are a few key points in a machine learning lifecycle where security becomes particularly important. Assuming that you have some secure environment for training models (such as a cloud computing environment), points to watch include any steps where data or models enter or leave your environment. For example:

Deploying your model to a production system and exposing an endpoint
Downloading the weights of a pretrained model
Setting up an automated training loop

In the next chapter, I’ll explain more about the risks at each of these points.

It’s also important to consider the potential harms that your ML system could cause. It’s crucial to consider security if you are training models using personal or sensitive data, if an incorrect model prediction could potentially have a negative impact on your users, or if your company could suffer harm to its reputation through the actions of your model.

¹ Kumar et al. “Adversarial Machine Learning - Industry Perspectives,” March 19, 2021, https://arxiv.org/pdf/2002.05646.pdf.

² General Data Protection Regulation 2016/679, Article 5, https://gdpr-info.eu/art-5-gdpr.

Get Is Building Secure ML Possible? now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial