Chapter 5. Can Your LLM Know Too Much?

In 2023, a rash of companies began banning or heavily restricting the usage of LLM services, like ChatGPT, based on concerns about possible leaks of confidential data. A partial list of such companies includes Samsung, JPMorgan Chase, Amazon, Bank of America, Citigroup, Deutsche Bank, Wells Fargo, and Goldman Sachs. These actions by giant finance and tech corporations show substantial concern about LLMs disclosing confidential and sensitive information, but how critical is the risk? As the developer of an LLM application, do you need to care?

In the Tay story from Chapter 1, Microsoft’s chatbot was attacked by hackers. As bad as the damage was, it was limited because Tay didn’t have access to much sensitive data she could have disclosed. However, the intersection of LLMs with real-world data can harbor the potential of unintended information disclosure, as seen in cases where employees have inadvertently fed sensitive business data to ChatGPT, which then became integrated into the system’s training base so that others could discover it.

This chapter will dig into the various ways that LLMs acquire access to data. We will examine the three predominant knowledge acquisition methods and the risks associated with your LLM having this access. Along the way, we’ll try to answer the question “Can your LLM know too much?” and discuss how to mitigate the risks associated with your application disclosing sensitive, private, or confidential data.

Get The Developer's Playbook for Large Language Model Security now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.