book

Natural Language Processing with Transformers, Revised Edition

by Lewis Tunstall, Leandro von Werra, Thomas Wolf

May 2022

Intermediate to advanced

408 pages

11h 25m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Quizzes

Foreword
Preface
Who Is This Book For?What You Will LearnSoftware and Hardware RequirementsConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgmentsLewisLeandroThomas
1. Hello Transformers
The Encoder-Decoder FrameworkAttention MechanismsTransfer Learning in NLPHugging Face Transformers: Bridging the GapA Tour of Transformer ApplicationsText ClassificationNamed Entity RecognitionQuestion AnsweringSummarizationTranslationText GenerationThe Hugging Face EcosystemThe Hugging Face HubHugging Face TokenizersHugging Face DatasetsHugging Face AccelerateMain Challenges with TransformersConclusion
2. Text Classification
The DatasetA First Look at Hugging Face DatasetsFrom Datasets to DataFramesLooking at the Class DistributionHow Long Are Our Tweets?From Text to TokensCharacter TokenizationWord TokenizationSubword TokenizationTokenizing the Whole DatasetTraining a Text ClassifierTransformers as Feature ExtractorsFine-Tuning TransformersConclusion
3. Transformer Anatomy
The Transformer ArchitectureThe EncoderSelf-AttentionThe Feed-Forward LayerAdding Layer NormalizationPositional EmbeddingsAdding a Classification HeadThe DecoderMeet the TransformersThe Transformer Tree of LifeThe Encoder BranchThe Decoder BranchThe Encoder-Decoder BranchConclusion
4. Multilingual Named Entity Recognition
The DatasetMultilingual TransformersA Closer Look at TokenizationThe Tokenizer PipelineThe SentencePiece TokenizerTransformers for Named Entity RecognitionThe Anatomy of the Transformers Model ClassBodies and HeadsCreating a Custom Model for Token ClassificationLoading a Custom ModelTokenizing Texts for NERPerformance MeasuresFine-Tuning XLM-RoBERTaError AnalysisCross-Lingual TransferWhen Does Zero-Shot Transfer Make Sense?Fine-Tuning on Multiple Languages at OnceInteracting with Model WidgetsConclusion
5. Text Generation
The Challenge with Generating Coherent TextGreedy Search DecodingBeam Search DecodingSampling MethodsTop-k and Nucleus SamplingWhich Decoding Method Is Best?Conclusion
6. Summarization
The CNN/DailyMail DatasetText Summarization PipelinesSummarization BaselineGPT-2T5BARTPEGASUSComparing Different SummariesMeasuring the Quality of Generated TextBLEUROUGEEvaluating PEGASUS on the CNN/DailyMail DatasetTraining a Summarization ModelEvaluating PEGASUS on SAMSumFine-Tuning PEGASUSGenerating Dialogue SummariesConclusion
7. Question Answering
Building a Review-Based QA SystemThe DatasetExtracting Answers from TextUsing Haystack to Build a QA PipelineImproving Our QA PipelineEvaluating the RetrieverEvaluating the ReaderDomain AdaptationEvaluating the Whole QA PipelineGoing Beyond Extractive QAConclusion
8. Making Transformers Efficient in Production
Intent Detection as a Case StudyCreating a Performance BenchmarkMaking Models Smaller via Knowledge DistillationKnowledge Distillation for Fine-TuningKnowledge Distillation for PretrainingCreating a Knowledge Distillation TrainerChoosing a Good Student InitializationFinding Good Hyperparameters with OptunaBenchmarking Our Distilled ModelMaking Models Faster with QuantizationBenchmarking Our Quantized ModelOptimizing Inference with ONNX and the ONNX RuntimeMaking Models Sparser with Weight PruningSparsity in Deep Neural NetworksWeight Pruning MethodsConclusion

9. Dealing with Few to No Labels
Building a GitHub Issues TaggerGetting the DataPreparing the DataCreating Training SetsCreating Training SlicesImplementing a Naive BayeslineWorking with No Labeled DataWorking with a Few LabelsData AugmentationUsing Embeddings as a Lookup TableFine-Tuning a Vanilla TransformerIn-Context and Few-Shot Learning with PromptsLeveraging Unlabeled DataFine-Tuning a Language ModelFine-Tuning a ClassifierAdvanced MethodsConclusion
10. Training Transformers from Scratch
Large Datasets and Where to Find ThemChallenges of Building a Large-Scale CorpusBuilding a Custom Code DatasetWorking with Large DatasetsAdding Datasets to the Hugging Face HubBuilding a TokenizerThe Tokenizer ModelMeasuring Tokenizer PerformanceA Tokenizer for PythonTraining a TokenizerSaving a Custom Tokenizer on the HubTraining a Model from ScratchA Tale of Pretraining ObjectivesInitializing the ModelImplementing the DataloaderDefining the Training LoopThe Training RunResults and AnalysisConclusion
11. Future Directions
Scaling TransformersScaling LawsChallenges with ScalingAttention Please!Sparse AttentionLinearized AttentionGoing Beyond TextVisionTablesMultimodal TransformersSpeech-to-TextVision and TextWhere to from Here?
Index
About the Authors

Content preview from Natural Language Processing with Transformers, Revised Edition

Chapter 1. Hello Transformers

In 2017, researchers at Google published a paper that proposed a novel neural network architecture for sequence modeling.¹ Dubbed the Transformer, this architecture outperformed recurrent neural networks (RNNs) on machine translation tasks, both in terms of translation quality and training cost.

In parallel, an effective transfer learning method called ULMFiT showed that training long short-term memory (LSTM) networks on a very large and diverse corpus could produce state-of-the-art text classifiers with little labeled data.²

These advances were the catalysts for two of today’s most well-known transformers: the Generative Pretrained Transformer (GPT)³ and Bidirectional Encoder Representations from Transformers (BERT).⁴ By combining the Transformer architecture with unsupervised learning, these models removed the need to train task-specific architectures from scratch and broke almost every benchmark in NLP by a significant margin. Since the release of GPT and BERT, a zoo of transformer models has emerged; a timeline of the most prominent entries is shown in Figure 1-1.

But we’re getting ahead of ourselves. To understand what is novel about transformers, we first need to explain:

The encoder-decoder framework
Attention mechanisms
Transfer learning

In this chapter we’ll introduce the core concepts that underlie ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Hands-On Generative AI with Transformers and Diffusion Models

Publisher Resources

ISBN: 9781098136789Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Natural Language Processing with Transformers, Revised Edition

by Lewis Tunstall, Leandro von Werra, Thomas Wolf

Chapter 1. Hello Transformers

Figure 1-1. The transformers timeline

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.