Applied Natural Language Processing in the Enterprise
by Ankur A. Patel, Ajay Uppili Arasanipalai
Chapter 4. Tokenization
This is our first chapter in the section of NLP from the ground up. In the first three chapters, we walked you through the high-level components of an NLP pipeline. From here till Chapter 9, we’ll be covering a lot of the underlying details to really understand how modern NLP systems work. The main components of this are:
-
Tokenization
-
Embeddings
-
Architectures
Previously, all of these steps were abstracted away in the libraries we
used (spaCy, transformers, and fastai). But now, we’ll
try to understand how these libraries actually work and how
you can modify your code at a low level to build amazing NLP
applications beyond the simple examples we presented in this book.
One thing to note: “low level” is a subjective term. While some may call PyTorch a low-level deep learning library, others may scoff at using that term for anything other than building a custom memory allocator in x86 assembly. It’s a matter of perspective. What we mean by low level here is that after learning about these things, you’ll have enough of an understanding to build useful applications with NLP in the real world and that you’ll also be able to understand and follow the latest research in the field. We won’t be discussing anything that’s too far beyond the scope of NLP. For example, learning about how CUDA works is certainly both interesting and useful, and we’ll do a bit of that in Appendix B. But CUDA itself as a tool is useful for many things outside NLP, so we’d consider ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access