12

Text Preprocessing in the Era of LLMs

In the era of Large Language Models (LLMs), mastering text preprocessing is more crucial than ever. As LLMs grow in complexity and capability, the foundation of successful Natural Language Processing (NLP) tasks still lies in how well the text data is prepared. In this chapter, we will discuss text preprocessing, the foundation for any NLP Task. We will also explore essential preprocessing techniques, focusing on adapting them to maximize the potential of LLMs.

In this chapter, we’ll cover the following topics:

  • Relearning text preprocessing in the era of LLMs
  • Text cleaning techniques
  • Handling rare words and spelling variations
  • Chunking
  • Tokenization strategies
  • Turning tokens into embeddings

Technical ...

Get Python Data Cleaning and Preparation Best Practices now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.