Proposals for model vulnerability and security

Apply fair and private models, white-hat and forensic model debugging, and common sense to protect machine learning models from malicious actors.

By Patrick Hall
March 20, 2019
Cyber crime Cyber crime (source: Pixabay)

Like many others, I’ve known for some time that machine learning models themselves could pose security risks. A recent flourish of posts and papers has outlined the broader topic, listed attack vectors and vulnerabilities, started to propose defensive solutions, and provided the necessary framework for this post. The objective here is to brainstorm on potential security vulnerabilities and defenses in the context of popular, traditional predictive modeling systems, such as linear and tree-based models trained on static data sets. While I’m no security expert, I have been following the areas of machine learning debugging, explanations, fairness, interpretability, and privacy very closely, and I think many of these techniques can be applied to attack and defend predictive modeling systems.

In hopes of furthering discussions between actual security experts and practitioners in the applied machine learning community (like me), this post will put forward several plausible attack vectors for a typical machine learning system at a typical organization, propose tentative defensive solutions, and discuss a few general concerns and potential best practices.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

1. Data poisoning attacks

Data poisoning refers to someone systematically changing your training data to manipulate your model’s predictions. (Data poisoning attacks have also been called “causative” attacks.) To poison data, an attacker must have access to some or all of your training data. And at many companies, many different employees, consultants, and contractors have just that—and with little oversight. It’s also possible a malicious external actor could acquire unauthorized access to some or all of your training data and poison it. A very direct kind of data poisoning attack might involve altering the labels of a training data set. So, whatever the commercial application of your model is, the attacker could dependably benefit from your model’s predictions—for example, by altering labels so your model learns to award large loans, large discounts, or small insurance premiums to people like themselves. (Forcing your model to make a false prediction for the attacker’s benefit is sometimes called a violation of your model’s “integrity”.) It’s also possible that a malicious actor could use data poisoning to train your model to intentionally discriminate against a group of people, depriving them the big loan, big discount, or low premiums they rightfully deserve. This is like a denial-of-service (DOS) attack on your model itself. (Forcing your model to make a false prediction to hurt others is sometimes called a violation of your model’s “availability”.) While it might be simpler to think of data poisoning as changing the values in the existing rows of a data set, data poisoning can also be conducted by adding seemingly harmless or superfluous columns onto a data set. Altered values in these columns could then trigger altered model predictions.

Now, let’s discuss some potential defensive and forensic solutions for data poisoning:

  • Disparate impact analysis: Many banks already undertake disparate impact analysis for fair lending purposes to determine if their model is treating different types of people in a discriminatory manner. Many other organizations, however, aren’t yet so evolved. Disparate impact analysis could potentially discover intentional discrimination in model predictions. There are several great open source tools for detecting discrimination and disparate impact analysis, such as Aequitas, Themis, and AIF360.
  • Fair or private models: Models such as learning fair representations (LFR) and private aggregation of teacher ensembles (PATE) try to focus less on individual demographic traits to make predictions. These models may also be less susceptible to discriminatory data poisoning attacks.
  • Reject on Negative Impact (RONI): RONI is a technique that removes rows of data from the training data set that decrease prediction accuracy. See “The Security of Machine Learning” in section 9 for more information on RONI.
  • Residual analysis: Look for strange, prominent patterns in the residuals of your model predictions, especially for employees, consultants, or contractors.
  • Self-reflection: Score your models on your employees, consultants, and contractors and look for anomalously beneficial predictions.

Disparate impact analysis, residual analysis, and self-reflection can be conducted at training time and as part of real-time model monitoring activities.

2. Watermark attacks

Watermarking is a term borrowed from the deep learning security literature that often refers to putting special pixels into an image to trigger a desired outcome from your model. It seems entirely possible to do the same with customer or transactional data. Consider a scenario where an employee, consultant, contractor, or malicious external actor has access to your model’s production code—that makes real-time predictions. Such an individual could change that code to recognize a strange, or unlikely, combination of input variable values to trigger a desired prediction outcome. Like data poisoning, watermark attacks can be used to attack your model’s integrity or availability. For instance, to attack your model’s integrity, a malicious insider could insert a payload into your model’s production scoring code that recognizes the combination of age of 0 and years at an address of 99 to trigger some kind of positive prediction outcome for themselves or their associates. To deny model availability, an attacker could insert an artificial, discriminatory rule into your model’s scoring code that prevents your model from producing positive outcomes for a certain group of people.

Defensive and forensic approaches for watermark attacks might include:

  • Anomaly detection: Autoencoders are a fraud detection model that can identify input data that is strange or unlike other input data, but in complex ways. Autoencoders could potentially catch any watermarks used to trigger malicious mechanisms.
  • Data integrity constraints: Many databases don’t allow for strange or unrealistic combinations of input variables and this could potentially thwart watermarking attacks. Applying data integrity constraints on live, incoming data streams could have the same benefits.
  • Disparate impact analysis: see section 1.
  • Version control: Production model scoring code should be managed and version-controlled—just like any other mission-critical software asset.

Anomaly detection, data integrity constraints, and disparate impact analysis can be used at training time and as part of real-time model monitoring activities.

3. Inversion by surrogate models

Inversion basically refers to getting unauthorized information out of your model—as opposed to putting information into your model. Inversion can also be an example of an “exploratory reverse-engineering” attack. If an attacker can receive many predictions from your model API or other endpoint (website, app, etc.), they can train their own surrogate model. In short, that’s a simulation of your very own predictive model! An attacker could conceivably train a surrogate model between the inputs they used to generate the received predictions and the received predictions themselves. Depending on the number of predictions they can receive, the surrogate model could become quite an accurate simulation of your model. Once the surrogate model is trained, then the attacker has a sandbox from which to plan impersonation (i.e., “mimicry”) or adversarial example attacks against your model’s integrity, or the potential ability to start reconstructing aspects of your sensitive training data. Surrogate models can also be trained using external data sources that can be somehow matched to your predictions, as ProPublica famously did with the proprietary COMPAS recidivism model.

To protect your model against inversion by surrogate model, consider the following approaches:

  • Authorized access: Require additional authentication (e.g., 2FA) to receive a prediction.
  • Throttle predictions: Restrict high numbers of rapid predictions from single users; consider artificially increasing prediction latency.
  • White-hat surrogate models: As a white-hat hacking exercise, try this: train your own surrogate models between your inputs and the predictions of your production model and carefully observe:

    • the accuracy bounds of different types of white-hat surrogate models; try to understand the extent to which a surrogate model can really be used to learn unfavorable knowledge about your model.
    • the types of data trends that can be learned from your white-hat surrogate model, like linear trends represented by linear model coefficients.
    • the types of segments or demographic distributions that can be learned by analyzing the number of individuals assigned to certain white-hat surrogate decision tree nodes.
    • the rules that can be learned from a white-hat surrogate decision tree—for example, how to reliably impersonate an individual who would receive a beneficial prediction.

4. Membership inference by surrogate model

In an attack similar to inversion, also carried out by surrogate models, a malicious actor can determine whether a given person or product is in your model’s training data. Called a “membership inference” attack, this hack is executed with two layers of surrogate models. First an attacker passes data into a public prediction API or other endpoint, receives predictions back, and then trains a surrogate model or models between the passed data and the predictions. Once a surrogate model or models has been trained to replicate your model, the attacker then trains a second layer classifier that can differentiate between data that was used to train the surrogate(s) and data that was not used to train the surrogate(s). When this second model is used to attack your model, it can give a solid indication as to whether any given row (or rows) of data is in your training data.

Membership in a training data set can be sensitive when the model and data are related to undesirable outcomes such as bankruptcy or disease, or desirable outcomes like high income or net worth. Moreover, if the relationship between a single row and the target of your model can be easily generalized by an attacker, such as an obvious relationship between race, gender, age and some undesirable outcome, this attack can violate the privacy of an entire group of people. Frighteningly, when carried out to it’s fullest extent, a membership inference attack could also allow a malicious actor, with access only to a public prediction API or other model endpoint, to reconstruct portions of a sensitive or valuable data set.

While many of the defenses against surrogate model inversion also apply to membership inference attacks, other more specialized defenses may be in order:

  • Authorized access: see section 3.
  • Disparate impact analysis: see section 1.
  • Fair or private models: see section 1.
  • Monitor for development data: Monitor for data that is within a certain range of any individual used to develop the model. Real-time scoring of rows that are extremely similar or identical to data used in training, validation, or testing should be recorded and investigated.
  • Throttle predictions: see section 3.
  • White-hat surrogate models: see section 3.

5. Adversarial example attacks

A motivated attacker could theoretically learn, say by trial and error (i.e., “exploration” or “sensitivity analysis”), surrogate model inversion, or by social engineering, how to game your model to receive their desired prediction outcome or to avoid an undesirable prediction. Carrying out an attack by specifically engineering a row of data for such purposes is referred to as an adversarial example attack. (Sometimes also known as an “exploratory integrity” attack.) An attacker could use an adversarial example attack to grant themselves a large loan or a low insurance premium or to avoid denial of parole based on a high criminal risk score. Some people might call using adversarial examples to avoid an undesirable outcome from your model prediction “evasion.”

Try out the techniques outlined below to defend against or to confirm an adversarial example attack:

  • Activation analysis: Activation analysis requires benchmarking internal mechanisms of your predictive models, such as the average activation of neurons in your neural network or the proportion of observations assigned to each leaf node in your random forest. You then compare that information against your model’s behavior on incoming, real-world data streams. As one of my colleagues put it, “this is like seeing one leaf node in a random forest correspond to 0.1% of the training data but hit for 75% of the production scoring rows in an hour.” Patterns like this could be evidence of an adversarial example attack.
  • Anomaly detection: see section 2.
  • Authorized access: see section 3.
  • Benchmark models: Use a highly transparent benchmark model when scoring new data in addition to your more complex model. Interpretable models could be seen as harder to hack because their mechanisms are directly transparent. When scoring new data, compare your new fancy machine learning model against a trusted, transparent model or a model trained on a trusted data source and pipeline. If the difference between your more complex and opaque machine learning model and your interpretable or trusted model is too great, fall back to the predictions of the conservative model or send the row of data for manual processing. Also record the incident. It could be an adversarial example attack.
  • Throttle predictions: see section 3.
  • White-hat sensitivity analysis: Use sensitivity analysis to conduct your own exploratory attacks to understand what variable values (or combinations thereof) can cause large swings in predictions. Screen for these values, or combinations of values, when scoring new data. You may find the open source package cleverhans helpful for any white-hat exploratory analyses you conduct.
  • White-hat surrogate models: see section 3.

Activation analysis and benchmark models can be used at training time and as part of real-time model monitoring activities.

6. Impersonation

A motivated attacker can learn—say, again, by trial and error, surrogate model inversion, or social engineering—what type of input or individual receives a desired prediction outcome. The attacker can then impersonate this input or individual to receive their desired prediction outcome from your model. (Impersonation attacks are sometimes also known as “mimicry” attacks and resemble identity theft from the model’s perspective.) Like an adversarial example attack, an impersonation attack involves artificially changing the input data values to your model. Unlike an adversarial example attack, where a potentially random-looking combination of input data values could be used to trick your model, impersonation implies using the information associated with another modeled entity (i.e., convict, customer, employee, financial transaction, patient, product, etc.) to receive the prediction your model associates with that type of entity. For example, an attacker could learn what characteristics your model associates with awarding large discounts, like comping a room at a casino for a big spender, and then falsify their information to receive the same discount. They could also share their strategy with others, potentially leading to large losses for your company.

If you are using a two-stage model, be aware of an “allergy” attack. This is where a malicious actor may impersonate a normal row of input data for the first stage of your model in order to attack the second stage of your model.

Defensive and forensic approaches for impersonation attacks may include:

  • Activation analysis: see section 5.
  • Authorized access: see section 3.
  • Screening for duplicates: At scoring time track the number of similar records your model is exposed to, potentially in a reduced-dimensional space using autoencoders, multidimensional scaling (MDS), or similar dimension reduction techniques. If too many similar rows are encountered during some time span, take corrective action.
  • Security-aware features: Keep a feature in your pipeline, say num_similar_queries, that may be useless when your model is first trained or deployed but could be populated at scoring time (or during future model retrainings) to make your model or your pipeline security-aware. For instance, if at scoring time the value of num_similar_queries is greater than zero, the scoring request could be sent for human oversight. In the future, when you retrain your model, you could teach it to give input data rows with high num_similar_queries values negative prediction outcomes.

Activation analysis, screening for duplicates, and security-aware features can be used at training time and as part of real-time model monitoring activities.

7. General concerns

Several common machine learning usage patterns also present more general security concerns.

Blackboxes and unnecessary complexity: Although recent developments in interpretable models and model explanations have provided the opportunity to use accurate and also transparent nonlinear classifiers and regressors, many machine learning workflows are still centered around blackbox models. Such blackbox models are only one type of often unnecessary complexity in a typical commercial machine learning workflow. Other examples of potentially harmful complexity could be overly exotic feature engineering or large numbers of package dependencies. Such complexity can be problematic for at least two reasons:

  1. A dedicated, motivated attacker can, over time, learn more about your overly complex blackbox modeling system than you or your team knows about your own model. (Especially in today’s overheated and turnover-prone data “science” market.) To do so, they can use many newly available model-agnostic explanation techniques and old-school sensitivity analysis, among many other more common hacking tools. This knowledge imbalance can potentially be exploited to conduct the attacks described in sections 1 – 6 or for other yet unknown types of attacks.
  2. Machine learning in the research and development environment is highly dependent on a diverse ecosystem of open source software packages. Some of these packages have many, many contributors and users. Some are highly specific and only meaningful to a small number of researchers or practitioners. It’s well understood that many packages are maintained by brilliant statisticians and machine learning researchers whose primary focus is mathematics or algorithms, not software engineering, and certainly not security. It’s not uncommon for a machine learning pipeline to be dependent on dozens or even hundreds of external packages, any one of which could be hacked to conceal an attack payload.

Distributed systems and models: For better or worse, we live in the age of big data. Many organizations are now using distributed data processing and machine learning systems. Distributed computing can provide a broad attack surface for a malicious internal or external actor in the context of machine learning. Data could be poisoned on only one or a few worker nodes of a large distributed data storage or processing system. A back door for watermarking could be coded into just one model of a large ensemble. Instead of debugging one simple data set or model, now practitioners must examine data or models distributed across large computing clusters.

Distributed denial of service (DDOS) attacks: If a predictive modeling service is central to your organization’s mission, ensure you have at least considered more conventional distributed denial of service attacks, where attackers hit the public-facing prediction service with an incredibly high volume of requests to delay or stop predictions for legitimate users.

8. General solutions

Several older and newer general best practices can be employed to decrease your security vulnerabilities and to increase fairness, accountability, transparency, and trust in machine learning systems.

Authorized access and prediction throttling: Standard safeguards such as additional authentication and throttling may be highly effective at stymieing a number of the attack vectors described in sections 1–6.

Benchmark models: An older or trusted interpretable modeling pipeline, or other highly transparent predictor, can be used as a benchmark model from which to measure whether a prediction was manipulated by any number of means. This could include data poisoning, watermark attacks, or adversarial example attacks. If the difference between your trusted model’s prediction and your more complex and opaque model’s predictions are too large, record these instances. Refer them to human analysts or take other appropriate forensic or remediation steps. (Of course, serious precautions must be taken to ensure your benchmark model and pipeline remains secure and unchanged from its original, trusted state.)

Interpretable, fair, or private models: The techniques now exist (e.g., monotonic GBMs (M-GBM), scalable Bayesian rule lists (SBRL), eXplainable Neural Networks (XNN)), that can allow for both accuracy and interpretability. These accurate and interpretable models are easier to document and debug than classic machine learning blackboxes. Newer types of fair and private models (e.g., LFR, PATE) can also be trained to essentially care less about outward visible, demographic characteristics that can be observed, socially engineered into an adversarial example attack, or impersonated. Are you considering creating a new machine learning workflow in the future? Think about basing it on lower-risk, interpretable, private, or fair models. Models like this are more easily debugged and potentially robust to changes in an individual entity’s characteristics.

Model debugging for security: The newer field of model debugging is focused on discovering errors in machine learning model mechanisms and predictions, and remediating those errors. Debugging tools such a surrogate models, residual analysis, and sensitivity analysis can be used in white-hat exercises to understand your own vulnerabilities or for forensic exercises to find any potential attacks that may have occurred or be occurring.

Model documentation and explanation techniques: Model documentation is a risk-mitigation strategy that has been used for decades in banking. It allows knowledge about complex modeling systems to be preserved and transferred as teams of model owners change over time. Model documentation has been traditionally applied to highly transparent linear models. But with the advent of powerful, accurate explanatory tools (such as tree SHAP and derivative-based local feature attributions for neural networks), pre-existing blackbox model workflows can be at least somewhat explained, debugged, and documented. Documentation should obviously now include all security goals, including any known, remediated, or anticipated security vulnerabilities.

Model monitoring and management explicitly for security: Serious practitioners understand most models are trained on static snapshots of reality represented by training data and that their prediction accuracy degrades in real time as present realities drift away from the past information captured in the training data. Today, most model monitoring is aimed at discovering this drift in input variable distributions that will eventually lead to accuracy decay. Model monitoring should now likely be designed to monitor for the attacks described in sections 1 – 6 and any other potential threats your white-hat model debugging exercises uncover. (While not always directly related to security, my opinion is that models should also be evaluated for disparate impact in real time as well.) Along with model documentation, all modeling artifacts, source code, and associated metadata need to be managed, versioned, and audited for security like the valuable commercial assets they are.

Random data detection: Because random data could be used to generate surrogate models, rules for generating effective adversaries, or for boundary checks, autoencoder models could be used to detect random data in real time and prevent the issuance of predictions on random data.

Security-aware features: Features, rules, and pre- or post-processing steps can be included in your models or pipelines that are security-aware, such as the number of similar rows seen by the model, whether the current row represents an employee, contractor, or consultant, or whether the values in the current row are similar to those found in white-hat adversarial example attacks. These features may or may not be useful when a model is first trained. But keeping a placeholder for them when scoring new data, or when retraining future iterations of your model, may come in very handy one day.

Systemic anomaly detection: Train an autoencoder–based anomaly detection metamodel on your entire predictive modeling system’s operating statistics—the number of predictions in some time period, latency, CPU, memory, and disk loads, the number of concurrent users, and everything else you can get your hands on—and then closely monitor this metamodel for anomalies. An anomaly could tip you off that something is generally not right in your predictive modeling system. Subsequent investigation or specific mechanisms would be needed to trace down the exact problem.

9. References and further reading

A lot of the contemporary academic machine learning security literature focuses on adaptive learning, deep learning, and encryption. However, I don’t know many practitioners who are actually doing these things yet. So, in addition to recently published articles and blogs, I found papers from the 1990s and early 2000s about network intrusion, virus detection, spam filtering, and related topics to be helpful resources as well. If you’d like to learn more about the fascinating subject of securing machine learning models, here are the main references—past and present—that I used for this post. I’d recommend them for further reading, too.


I care very much about the science and practice of machine learning, and I am now concerned that the threat of a terrible machine learning hack, combined with growing concerns about privacy violations and algorithmic discrimination, could increase burgeoning public and political skepticism about machine learning and AI. We should all be mindful of AI winters in the not-so-distant past. Security vulnerabilities, privacy violations, and algorithmic discrimination could all potentially combine to lead to decreased funding for machine learning research or draconian over-regulation of the field. Let’s continue discussing and addressing these important problems to preemptively prevent a crisis, as opposed to having to reactively respond to one.

Post topics: Artificial Intelligence