Chapter 4. Data Engineering for LLMs
In this chapter, you will learn about data engineering, data management practices, and the database tools and systems available. The discussion will be geared toward data, DevOps, and MLOps engineers who want to become LLMOps engineers and/or lead their company’s data engineering efforts. By the end of this chapter, you will have a strong grasp of the foundations of data engineering, as well as best practices for LLMs.
Data Engineering and the Rise of LLMs
In the late 1960s, British computer scientist Edgar F. Codd, fresh from finishing his doctorate in self-replicating computers, was working at IBM. Codd became fascinated by the theory of data arrangement and in 1970 published an internal IBM paper called “A Relational Model of Data for Large Shared Data Banks” that introduced what we know today as relational databases. For example, instead of a sales table in which each record contains all the information about the products and the customers to whom they’ve been sold, relational databases store this data in multiple related tables: one for customers, one for products, and one for sales. Before relational databases, something as simple as a change in customer address would require changing all sales records for that customer, which was an expensive operation in mainframes. In a relational database, you can change just the customer record, and all the related records will be updated.
While it didn’t fascinate anyone at IBM right away, the ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access