Risk Management for AI Chatbots

Services like ChatGPT may be the future, but they also bring risks.

By Q McCallum
June 27, 2023
(Photo by Jon Tyson on Unsplash)

Does your company plan to release an AI chatbot, similar to OpenAI’s ChatGPT or Google’s Bard? Doing so means giving the general public a freeform text box for interacting with your AI model.

That doesn’t sound so bad, right? Here’s the catch: for every one of your users who has read a “Here’s how ChatGPT and Midjourney can do half of my job” article, there may be at least one who has read one offering “Here’s how to get AI chatbots to do something nefarious.” They’re posting screencaps as trophies on social media; you’re left scrambling to close the loophole they exploited.

Learn faster. Dig deeper. See farther.

Join the O'Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Welcome to your company’s new AI risk management nightmare.

So, what do you do? I’ll share some ideas for mitigation. But first, let’s dig deeper into the problem.

Old Problems Are New Again

The text-box-and-submit-button combo exists on pretty much every website. It’s been that way since the web form was created roughly thirty years ago. So what’s so scary about putting up a text box so people can engage with your chatbot?

Those 1990s web forms demonstrate the problem all too well. When a person clicked “submit,” the website would pass that form data through some backend code to process it—thereby sending an e-mail, creating an order, or storing a record in a database. That code was too trusting, though. Malicious actors determined that they could craft clever inputs to trick it into doing something unintended, like exposing sensitive database records or deleting information. (The most popular attacks were cross-site scripting and SQL injection, the latter of which is best explained in the story of “Little Bobby Tables.”)

With a chatbot, the web form passes an end-user’s freeform text input—a “prompt,” or a request to act—to a generative AI model. That model creates the response images or text by interpreting the prompt and then replaying (a probabilistic variation of) the patterns it uncovered in its training data.

That leads to three problems:

  1. By default, that underlying model will respond to any prompt.  Which means your chatbot is effectively a naive person who has access to all of the information from the training dataset. A rather juicy target, really. In the same way that bad actors will use social engineering to fool humans guarding secrets, clever prompts are a form of  social engineering for your chatbot. This kind of prompt injection can get it to say nasty things. Or reveal a recipe for napalm. Or divulge sensitive details. It’s up to you to filter the bot’s inputs, then.
  2. The range of potentially unsafe chatbot inputs amounts to “any stream of human language.” It just so happens, this also describes all possible chatbot inputs. With a SQL injection attack, you can “escape” certain characters so that the database doesn’t give them special treatment. There’s currently no equivalent, straightforward way to render a chatbot’s input safe. (Ask anyone who’s done content moderation for social media platforms: filtering specific terms will only get you so far, and will also lead to a lot of false positives.)
  3. The model is not deterministic. Each invocation of an AI chatbot is a probabilistic journey through its training data. One prompt may return different answers each time it is used. The same idea, worded differently, may take the bot down a completely different road. The right prompt can get the chatbot to reveal information you didn’t even know was in there. And when that happens, you can’t really explain how it reached that conclusion.

Why haven’t we seen these problems with other kinds of AI models, then? Because most of those have been deployed in such a way that they are only communicating with trusted internal systems. Or their inputs pass through layers of indirection that structure and limit their shape. Models that accept numeric inputs, for example, might sit behind a filter that only permits the range of values observed in the training data.

What Can You Do?

Before you give up on your dreams of releasing an AI chatbot, remember: no risk, no reward.

The core idea of risk management is that you don’t win by saying “no” to everything. You win by understanding the potential problems ahead, then figure out how to steer clear of them. This approach reduces your chances of downside loss while leaving you open to the potential upside gain.

I’ve already described the risks of your company deploying an AI chatbot. The rewards include improvements to your products and services, or streamlined customer service, or the like. You may even get a publicity boost, because just about every other article these days is about how companies are using chatbots.

So let’s talk about some ways to manage that risk and position you for a reward. (Or, at least, position you to limit your losses.)

Spread the word: The first thing you’ll want to do is let people in the company know what you’re doing. It’s tempting to keep your plans under wraps—nobody likes being told to slow down or change course on their special project—but there are several people in your company who can help you steer clear of trouble. And they can do so much more for you if they know about the chatbot long before it is released.

Your company’s Chief Information Security Officer (CISO) and Chief Risk Officer will certainly have ideas. As will your legal team. And maybe even your Chief Financial Officer, PR team, and head of HR, if they have sailed rough seas in the past.

Define a clear terms of service (TOS) and acceptable use policy (AUP): What do you do with the prompts that people type into that text box? Do you ever provide them to law enforcement or other parties for analysis, or feed it back into your model for updates? What guarantees do you make or not make about the quality of the outputs and how people use them? Putting your chatbot’s TOS front-and-center will let people know what to expect before they enter sensitive personal details or even confidential company information. Similarly, an AUP will explain what kinds of prompts are permitted.

(Mind you, these documents will spare you in a court of law in the event something goes wrong. They may not hold up as well in the court of public opinion, as people will accuse you of having buried the important details in the fine print. You’ll want to include plain-language warnings in your sign-up and around the prompt’s entry box so that people can know what to expect.)

Prepare to invest in defense: You’ve allocated a budget to train and deploy the chatbot, sure. How much have you set aside to keep attackers at bay? If the answer is anywhere close to “zero”—that is, if you assume that no one will try to do you harm—you’re setting yourself up for a nasty surprise. At a bare minimum, you will need additional team members to establish defenses between the text box where people enter prompts and the chatbot’s generative AI model. That leads us to the next step.

Keep an eye on the model: Longtime readers will be familiar with my catchphrase, “Never let the machines run unattended.” An AI model is not self-aware, so it doesn’t know when it’s operating out of its depth. It’s up to you to filter out bad inputs before they induce the model to misbehave.

You’ll also need to review samples of the prompts supplied by end-users (there’s your TOS calling) and the results returned by the backing AI model. This is one way to catch the small cracks before the dam bursts. A spike in a certain prompt, for example, could imply that someone has found a weakness and they’ve shared it with others.

Be your own adversary: Since outside actors will try to break the chatbot, why not give some insiders a try? Red-team exercises can uncover weaknesses in the system while it’s still under development.

This may seem like an invitation for your teammates to attack your work. That’s because it is. Better to have a “friendly” attacker uncover problems before an outsider does, no?

Narrow the scope of audience: A chatbot that’s open to a very specific set of users—say, “licensed medical practitioners who must prove their identity to sign up and who use 2FA to login to the service”—will be tougher for random attackers to access. (Not impossible, but definitely tougher.) It should also see fewer hack attempts by the registered users because they’re not looking for a joyride; they’re using the tool to complete a specific job.

Build the model from scratch (to narrow the scope of training data): You may be able to extend an existing, general-purpose AI model with your own data (through an ML technique called transfer learning). This approach will shorten your time-to-market, but also leave you to question what went into the original training data. Building your own model from scratch gives you complete control over the training data, and therefore, additional influence (though, not “control”) over the chatbot’s outputs.

This highlights an added value in training on a domain-specific dataset: it’s unlikely that anyone would, say, trick the finance-themed chatbot BloombergGPT into revealing the secret recipe for Coca-Cola or instructions for acquiring illicit substances. The model can’t reveal what it doesn’t know.

Training your own model from scratch is, admittedly, an extreme option. Right now this approach requires a combination of technical expertise and compute resources that are out of most companies’ reach. But if you want to deploy a custom chatbot and are highly sensitive to reputation risk, this option is worth a look.

Slow down: Companies are caving to pressure from boards, shareholders, and sometimes internal stakeholders to release an AI chatbot. This is the time to remind them that a broken chatbot released this morning can be a PR nightmare before lunchtime. Why not take the extra time to test for problems?


Thanks to its freeform input and output, an AI-based chatbot exposes you to additional risks above and beyond using other kinds of AI models. People who are bored, mischievous, or looking for fame will try to break your chatbot just to see whether they can. (Chatbots are extra tempting right now because they are novel, and “corporate chatbot says weird things” makes for a particularly humorous trophy to share on social media.)

By assessing the risks and proactively developing mitigation strategies, you can reduce the chances that attackers will convince your chatbot to give them bragging rights.

I emphasize the term “reduce” here. As your CISO will tell you, there’s no such thing as a “100% secure” system. What you want to do is close off the easy access for the amateurs, and at least give the hardened professionals a challenge.

Many thanks to Chris Butler and Michael S. Manley for reviewing (and dramatically improving) early drafts of this article. Any rough edges that remain are mine.

Post topics: AI & ML, Artificial Intelligence
Post tags: Deep Dive

Get the O’Reilly Radar Trends to Watch newsletter