12 Text Preprocessing in the Era of LLMs

In the era of Large Language Models (LLMs), mastering text preprocessing is more crucial than ever. As LLMs grow in complexity and capability, the foundation of successful Natural Language Processing (NLP) tasks still lies in how well the text data is prepared. In this chapter, we will discuss text preprocessing, the foundation for any NLP Task. We will also explore essential preprocessing techniques, focusing on adapting them to maximize the potential of LLMs.

In this chapter, we’ll cover the following topics:

Relearning text preprocessing in the era of LLMs
Text cleaning techniques
Handling rare words and spelling variations
Chunking
Tokenization strategies
Turning tokens into embeddings

Technical ...

Get Python Data Cleaning and Preparation Best Practices now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Python Data Cleaning and Preparation Best Practices by Maria Zervou

12

Text Preprocessing in the Era of LLMs

Technical ...

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly