Proposals for model vulnerability and security
Apply fair and private models, white-hat and forensic model debugging, and common sense to protect machine learning models from malicious actors.
Apply fair and private models, white-hat and forensic model debugging, and common sense to protect machine learning models from malicious actors.
Like many others, I’ve known for some time that machine learning models themselves could pose security risks. A recent flourish of posts and papers has outlined the broader topic, listed attack vectors and vulnerabilities, started to propose defensive solutions, and provided the necessary framework for this post. The objective here is to brainstorm on potential security vulnerabilities and defenses in the context of popular, traditional predictive modeling systems, such as linear and tree-based models trained on static data sets. While I’m no security expert, I have been following the areas of machine learning debugging, explanations, fairness, interpretability, and privacy very closely, and I think many of these techniques can be applied to attack and defend predictive modeling systems.
In hopes of furthering discussions between actual security experts and practitioners in the applied machine learning community (like me), this post will put forward several plausible attack vectors for a typical machine learning system at a typical organization, propose tentative defensive solutions, and discuss a few general concerns and potential best practices.
Data poisoning refers to someone systematically changing your training data to manipulate your model’s predictions. (Data poisoning attacks have also been called “causative” attacks.) To poison data, an attacker must have access to some or all of your training data. And at many companies, many different employees, consultants, and contractors have just that—and with little oversight. It’s also possible a malicious external actor could acquire unauthorized access to some or all of your training data and poison it. A very direct kind of data poisoning attack might involve altering the labels of a training data set. So, whatever the commercial application of your model is, the attacker could dependably benefit from your model’s predictions—for example, by altering labels so your model learns to award large loans, large discounts, or small insurance premiums to people like themselves. (Forcing your model to make a false prediction for the attacker’s benefit is sometimes called a violation of your model’s “integrity”.) It’s also possible that a malicious actor could use data poisoning to train your model to intentionally discriminate against a group of people, depriving them the big loan, big discount, or low premiums they rightfully deserve. This is like a denial-of-service (DOS) attack on your model itself. (Forcing your model to make a false prediction to hurt others is sometimes called a violation of your model’s “availability”.) While it might be simpler to think of data poisoning as changing the values in the existing rows of a data set, data poisoning can also be conducted by adding seemingly harmless or superfluous columns onto a data set. Altered values in these columns could then trigger altered model predictions.
Now, let’s discuss some potential defensive and forensic solutions for data poisoning:
Disparate impact analysis, residual analysis, and self-reflection can be conducted at training time and as part of real-time model monitoring activities.
Watermarking is a term borrowed from the deep learning security literature that often refers to putting special pixels into an image to trigger a desired outcome from your model. It seems entirely possible to do the same with customer or transactional data. Consider a scenario where an employee, consultant, contractor, or malicious external actor has access to your model’s production code—that makes real-time predictions. Such an individual could change that code to recognize a strange, or unlikely, combination of input variable values to trigger a desired prediction outcome. Like data poisoning, watermark attacks can be used to attack your model’s integrity or availability. For instance, to attack your model’s integrity, a malicious insider could insert a payload into your model’s production scoring code that recognizes the combination of age of 0 and years at an address of 99 to trigger some kind of positive prediction outcome for themselves or their associates. To deny model availability, an attacker could insert an artificial, discriminatory rule into your model’s scoring code that prevents your model from producing positive outcomes for a certain group of people.
Defensive and forensic approaches for watermark attacks might include:
Anomaly detection, data integrity constraints, and disparate impact analysis can be used at training time and as part of real-time model monitoring activities.
Inversion basically refers to getting unauthorized information out of your model—as opposed to putting information into your model. Inversion can also be an example of an “exploratory reverse-engineering” attack. If an attacker can receive many predictions from your model API or other endpoint (website, app, etc.), they can train their own surrogate model. In short, that’s a simulation of your very own predictive model! An attacker could conceivably train a surrogate model between the inputs they used to generate the received predictions and the received predictions themselves. Depending on the number of predictions they can receive, the surrogate model could become quite an accurate simulation of your model. Once the surrogate model is trained, then the attacker has a sandbox from which to plan impersonation (i.e., “mimicry”) or adversarial example attacks against your model’s integrity, or the potential ability to start reconstructing aspects of your sensitive training data. Surrogate models can also be trained using external data sources that can be somehow matched to your predictions, as ProPublica famously did with the proprietary COMPAS recidivism model.
To protect your model against inversion by surrogate model, consider the following approaches:
In an attack similar to inversion, also carried out by surrogate models, a malicious actor can determine whether a given person or product is in your model’s training data. Called a “membership inference” attack, this hack is executed with two layers of surrogate models. First an attacker passes data into a public prediction API or other endpoint, receives predictions back, and then trains a surrogate model or models between the passed data and the predictions. Once a surrogate model or models has been trained to replicate your model, the attacker then trains a second layer classifier that can differentiate between data that was used to train the surrogate(s) and data that was not used to train the surrogate(s). When this second model is used to attack your model, it can give a solid indication as to whether any given row (or rows) of data is in your training data.
Membership in a training data set can be sensitive when the model and data are related to undesirable outcomes such as bankruptcy or disease, or desirable outcomes like high income or net worth. Moreover, if the relationship between a single row and the target of your model can be easily generalized by an attacker, such as an obvious relationship between race, gender, age and some undesirable outcome, this attack can violate the privacy of an entire group of people. Frighteningly, when carried out to it’s fullest extent, a membership inference attack could also allow a malicious actor, with access only to a public prediction API or other model endpoint, to reconstruct portions of a sensitive or valuable data set.
While many of the defenses against surrogate model inversion also apply to membership inference attacks, other more specialized defenses may be in order:
A motivated attacker could theoretically learn, say by trial and error (i.e., “exploration” or “sensitivity analysis”), surrogate model inversion, or by social engineering, how to game your model to receive their desired prediction outcome or to avoid an undesirable prediction. Carrying out an attack by specifically engineering a row of data for such purposes is referred to as an adversarial example attack. (Sometimes also known as an “exploratory integrity” attack.) An attacker could use an adversarial example attack to grant themselves a large loan or a low insurance premium or to avoid denial of parole based on a high criminal risk score. Some people might call using adversarial examples to avoid an undesirable outcome from your model prediction “evasion.”
Try out the techniques outlined below to defend against or to confirm an adversarial example attack:
Activation analysis and benchmark models can be used at training time and as part of real-time model monitoring activities.
A motivated attacker can learn—say, again, by trial and error, surrogate model inversion, or social engineering—what type of input or individual receives a desired prediction outcome. The attacker can then impersonate this input or individual to receive their desired prediction outcome from your model. (Impersonation attacks are sometimes also known as “mimicry” attacks and resemble identity theft from the model’s perspective.) Like an adversarial example attack, an impersonation attack involves artificially changing the input data values to your model. Unlike an adversarial example attack, where a potentially random-looking combination of input data values could be used to trick your model, impersonation implies using the information associated with another modeled entity (i.e., convict, customer, employee, financial transaction, patient, product, etc.) to receive the prediction your model associates with that type of entity. For example, an attacker could learn what characteristics your model associates with awarding large discounts, like comping a room at a casino for a big spender, and then falsify their information to receive the same discount. They could also share their strategy with others, potentially leading to large losses for your company.
If you are using a two-stage model, be aware of an “allergy” attack. This is where a malicious actor may impersonate a normal row of input data for the first stage of your model in order to attack the second stage of your model.
Defensive and forensic approaches for impersonation attacks may include:
num_similar_queries, that may be useless when your model is first trained or deployed but could be populated at scoring time (or during future model retrainings) to make your model or your pipeline security-aware. For instance, if at scoring time the value of
num_similar_queriesis greater than zero, the scoring request could be sent for human oversight. In the future, when you retrain your model, you could teach it to give input data rows with high
num_similar_queriesvalues negative prediction outcomes.
Activation analysis, screening for duplicates, and security-aware features can be used at training time and as part of real-time model monitoring activities.
Several common machine learning usage patterns also present more general security concerns.
Blackboxes and unnecessary complexity: Although recent developments in interpretable models and model explanations have provided the opportunity to use accurate and also transparent nonlinear classifiers and regressors, many machine learning workflows are still centered around blackbox models. Such blackbox models are only one type of often unnecessary complexity in a typical commercial machine learning workflow. Other examples of potentially harmful complexity could be overly exotic feature engineering or large numbers of package dependencies. Such complexity can be problematic for at least two reasons:
Distributed systems and models: For better or worse, we live in the age of big data. Many organizations are now using distributed data processing and machine learning systems. Distributed computing can provide a broad attack surface for a malicious internal or external actor in the context of machine learning. Data could be poisoned on only one or a few worker nodes of a large distributed data storage or processing system. A back door for watermarking could be coded into just one model of a large ensemble. Instead of debugging one simple data set or model, now practitioners must examine data or models distributed across large computing clusters.
Distributed denial of service (DDOS) attacks: If a predictive modeling service is central to your organization’s mission, ensure you have at least considered more conventional distributed denial of service attacks, where attackers hit the public-facing prediction service with an incredibly high volume of requests to delay or stop predictions for legitimate users.
Several older and newer general best practices can be employed to decrease your security vulnerabilities and to increase fairness, accountability, transparency, and trust in machine learning systems.
Authorized access and prediction throttling: Standard safeguards such as additional authentication and throttling may be highly effective at stymieing a number of the attack vectors described in sections 1–6.
Benchmark models: An older or trusted interpretable modeling pipeline, or other highly transparent predictor, can be used as a benchmark model from which to measure whether a prediction was manipulated by any number of means. This could include data poisoning, watermark attacks, or adversarial example attacks. If the difference between your trusted model’s prediction and your more complex and opaque model’s predictions are too large, record these instances. Refer them to human analysts or take other appropriate forensic or remediation steps. (Of course, serious precautions must be taken to ensure your benchmark model and pipeline remains secure and unchanged from its original, trusted state.)
Interpretable, fair, or private models: The techniques now exist (e.g., monotonic GBMs (M-GBM), scalable Bayesian rule lists (SBRL), eXplainable Neural Networks (XNN)), that can allow for both accuracy and interpretability. These accurate and interpretable models are easier to document and debug than classic machine learning blackboxes. Newer types of fair and private models (e.g., LFR, PATE) can also be trained to essentially care less about outward visible, demographic characteristics that can be observed, socially engineered into an adversarial example attack, or impersonated. Are you considering creating a new machine learning workflow in the future? Think about basing it on lower-risk, interpretable, private, or fair models. Models like this are more easily debugged and potentially robust to changes in an individual entity’s characteristics.
Model debugging for security: The newer field of model debugging is focused on discovering errors in machine learning model mechanisms and predictions, and remediating those errors. Debugging tools such a surrogate models, residual analysis, and sensitivity analysis can be used in white-hat exercises to understand your own vulnerabilities or for forensic exercises to find any potential attacks that may have occurred or be occurring.
Model documentation and explanation techniques: Model documentation is a risk-mitigation strategy that has been used for decades in banking. It allows knowledge about complex modeling systems to be preserved and transferred as teams of model owners change over time. Model documentation has been traditionally applied to highly transparent linear models. But with the advent of powerful, accurate explanatory tools (such as tree SHAP and derivative-based local feature attributions for neural networks), pre-existing blackbox model workflows can be at least somewhat explained, debugged, and documented. Documentation should obviously now include all security goals, including any known, remediated, or anticipated security vulnerabilities.
Model monitoring and management explicitly for security: Serious practitioners understand most models are trained on static snapshots of reality represented by training data and that their prediction accuracy degrades in real time as present realities drift away from the past information captured in the training data. Today, most model monitoring is aimed at discovering this drift in input variable distributions that will eventually lead to accuracy decay. Model monitoring should now likely be designed to monitor for the attacks described in sections 1 – 6 and any other potential threats your white-hat model debugging exercises uncover. (While not always directly related to security, my opinion is that models should also be evaluated for disparate impact in real time as well.) Along with model documentation, all modeling artifacts, source code, and associated metadata need to be managed, versioned, and audited for security like the valuable commercial assets they are.
Random data detection: Because random data could be used to generate surrogate models, rules for generating effective adversaries, or for boundary checks, autoencoder models could be used to detect random data in real time and prevent the issuance of predictions on random data.
Security-aware features: Features, rules, and pre- or post-processing steps can be included in your models or pipelines that are security-aware, such as the number of similar rows seen by the model, whether the current row represents an employee, contractor, or consultant, or whether the values in the current row are similar to those found in white-hat adversarial example attacks. These features may or may not be useful when a model is first trained. But keeping a placeholder for them when scoring new data, or when retraining future iterations of your model, may come in very handy one day.
Systemic anomaly detection: Train an autoencoder–based anomaly detection metamodel on your entire predictive modeling system’s operating statistics—the number of predictions in some time period, latency, CPU, memory, and disk loads, the number of concurrent users, and everything else you can get your hands on—and then closely monitor this metamodel for anomalies. An anomaly could tip you off that something is generally not right in your predictive modeling system. Subsequent investigation or specific mechanisms would be needed to trace down the exact problem.
A lot of the contemporary academic machine learning security literature focuses on adaptive learning, deep learning, and encryption. However, I don’t know many practitioners who are actually doing these things yet. So, in addition to recently published articles and blogs, I found papers from the 1990s and early 2000s about network intrusion, virus detection, spam filtering, and related topics to be helpful resources as well. If you’d like to learn more about the fascinating subject of securing machine learning models, here are the main references—past and present—that I used for this post. I’d recommend them for further reading, too.
I care very much about the science and practice of machine learning, and I am now concerned that the threat of a terrible machine learning hack, combined with growing concerns about privacy violations and algorithmic discrimination, could increase burgeoning public and political skepticism about machine learning and AI. We should all be mindful of AI winters in the not-so-distant past. Security vulnerabilities, privacy violations, and algorithmic discrimination could all potentially combine to lead to decreased funding for machine learning research or draconian over-regulation of the field. Let’s continue discussing and addressing these important problems to preemptively prevent a crisis, as opposed to having to reactively respond to one.