Chapter 6. Adversarial Attacks and Defenses
In the previous chapter, you’ve explored the secure deployment of large language models (LLMs) from both engineering and organizational perspectives. You examined various infrastructure considerations, API design patterns, and access control mechanisms that help safeguard these powerful models in production environments. However, even the most carefully deployed system remains vulnerable if the underlying model itself can be manipulated.
This chapter shifts our focus to the fascinating cat-and-mouse game between attackers and defenders in the LLM landscape. You’ll now don the hat of an adversary to understand how these models can be attacked and then pivot to examine the defensive measures that can protect them. Like other deep learning systems, LLMs are vulnerable to adversarial attacks: carefully crafted inputs designed to manipulate the model’s behavior in unintended and potentially harmful ways.
The stakes in this arena are significant. As LLMs become increasingly integrated into critical applications, from financial services and healthcare to content moderation and security systems, their vulnerabilities can lead to severe consequences. An attacker who successfully manipulates an LLM might bypass content filters to generate harmful content, extract private information used during training, or even compromise downstream systems that rely on the model’s outputs.
In this chapter, you’ll explore four key aspects of LLM security. First, ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access