Transformers for Natural Language Processing and Computer Vision - Third Edition

Book description

The definitive guide to LLMs, from architectures, pretraining, and fine-tuning to Retrieval Augmented Generation (RAG), multimodal Generative AI, risks, and implementations with ChatGPT Plus with GPT-4, Hugging Face, and Vertex AI

Key Features

  • Compare and contrast 20+ models (including GPT-4, BERT, and Llama 2) and multiple platforms and libraries to find the right solution for your project
  • Apply RAG with LLMs using customized texts and embeddings
  • Mitigate LLM risks, such as hallucinations, using moderation models and knowledge bases
  • Purchase of the print or Kindle book includes a free eBook in PDF format

Book Description

Transformers for Natural Language Processing and Computer Vision, Third Edition, explores Large Language Model (LLM) architectures, applications, and various platforms (Hugging Face, OpenAI, and Google Vertex AI) used for Natural Language Processing (NLP) and Computer Vision (CV).

The book guides you through different transformer architectures to the latest Foundation Models and Generative AI. You’ll pretrain and fine-tune LLMs and work through different use cases, from summarization to implementing question-answering systems with embedding-based search techniques. You will also learn the risks of LLMs, from hallucinations and memorization to privacy, and how to mitigate such risks using moderation models with rule and knowledge bases. You’ll implement Retrieval Augmented Generation (RAG) with LLMs to improve the accuracy of your models and gain greater control over LLM outputs.

Dive into generative vision transformers and multimodal model architectures and build applications, such as image and video-to-text classifiers. Go further by combining different models and platforms and learning about AI agent replication.

This book provides you with an understanding of transformer architectures, pretraining, fine-tuning, LLM use cases, and best practices.

What you will learn

  • Breakdown and understand the architectures of the Original Transformer, BERT, GPT models, T5, PaLM, ViT, CLIP, and DALL-E
  • Fine-tune BERT, GPT, and PaLM 2 models
  • Learn about different tokenizers and the best practices for preprocessing language data
  • Pretrain a RoBERTa model from scratch
  • Implement retrieval augmented generation and rules bases to mitigate hallucinations
  • Visualize transformer model activity for deeper insights using BertViz, LIME, and SHAP
  • Go in-depth into vision transformers with CLIP, DALL-E 2, DALL-E 3, and GPT-4V

Who this book is for

This book is ideal for NLP and CV engineers, software developers, data scientists, machine learning engineers, and technical leaders looking to advance their LLMs and generative AI skills or explore the latest trends in the field. Knowledge of Python and machine learning concepts is required to fully understand the use cases and code examples. However, with examples using LLM user interfaces, prompt engineering, and no-code model building, this book is great for anyone curious about the AI revolution.

Table of contents

  1. Preface
    1. Who this book is for
    2. What this book covers
    3. To get the most out of this book
    4. Get in touch
  2. What Are Transformers?
    1. How constant time complexity O(1) changed our lives forever
      1. O(1) attention conquers O(n) recurrent methods
        1. Attention layer
        2. Recurrent layer
      2. The magic of the computational time complexity of an attention layer
        1. Computational time complexity with a CPU
        2. Computational time complexity with a GPU
        3. Computational time complexity with a TPU
        4. TPU-LLM
      3. A brief journey from recurrent to attention
        1. A brief history
    2. From one token to an AI revolution
      1. From one token to everything
    3. Foundation Models
      1. From general purpose to specific tasks
    4. The role of AI professionals
      1. The future of AI professionals
      2. What resources should we use?
      3. Decision-making guidelines
    5. The rise of transformer seamless APIs and assistants
      1. Choosing ready-to-use API-driven libraries
      2. Choosing a cloud platform and transformer model
    6. Summary
    7. Questions
    8. References
    9. Further reading
  3. Getting Started with the Architecture of the Transformer Model
    1. The rise of the Transformer: Attention Is All You Need
      1. The encoder stack
        1. Input embedding
        2. Positional encoding
        3. Sublayer 1: Multi-head attention
        4. Sublayer 2: Feedforward network
      2. The decoder stack
        1. Output embedding and position encoding
        2. The attention layers
        3. The FFN sublayer, the post-LN, and the linear layer
    2. Training and performance
    3. Hugging Face transformer models
    4. Summary
    5. Questions
    6. References
    7. Further reading
  4. Emergent vs Downstream Tasks: The Unseen Depths of Transformers
    1. The paradigm shift: What is an NLP task?
      1. Inside the head of the attention sublayer of a transformer
      2. Exploring emergence with ChatGPT
    2. Investigating the potential of downstream tasks
      1. Evaluating models with metrics
        1. Accuracy score
        2. F1-score
        3. MCC
      2. Human evaluation
        1. Benchmark tasks and datasets
        2. Defining the SuperGLUE benchmark tasks
    3. Running downstream tasks
      1. The Corpus of Linguistic Acceptability (CoLA)
      2. Stanford Sentiment TreeBank (SST-2)
      3. Microsoft Research Paraphrase Corpus (MRPC)
      4. Winograd schemas
    4. Summary
    5. Questions
    6. References
    7. Further reading
  5. Advancements in Translations with Google Trax, Google Translate, and Gemini
    1. Defining machine translation
      1. Human transductions and translations
      2. Machine transductions and translations
    2. Evaluating machine translations
      1. Preprocessing a WMT dataset
        1. Preprocessing the raw data
        2. Finalizing the preprocessing of the datasets
      2. Evaluating machine translations with BLEU
        1. Geometric evaluations
        2. Applying a smoothing technique
    3. Translations with Google Trax
      1. Installing Trax
      2. Creating the Original Transformer model
      3. Initializing the model using pretrained weights
      4. Tokenizing a sentence
      5. Decoding from the Transformer
      6. De-tokenizing and displaying the translation
    4. Translation with Google Translate
      1. Translation with a Google Translate AJAX API Wrapper
        1. Implementing googletrans
    5. Translation with Gemini
      1. Gemini’s potential
    6. Summary
    7. Questions
    8. References
    9. Further reading
  6. Diving into Fine-Tuning through BERT
    1. The architecture of BERT
      1. The encoder stack
        1. Preparing the pretraining input environment
        2. Pretraining and fine-tuning a BERT model
    2. Fine-tuning BERT
      1. Defining a goal
      2. Hardware constraints
      3. Installing Hugging Face Transformers
      4. Importing the modules
      5. Specifying CUDA as the device for torch
      6. Loading the CoLA dataset
      7. Creating sentences, label lists, and adding BERT tokens
      8. Activating the BERT tokenizer
      9. Processing the data
      10. Creating attention masks
      11. Splitting the data into training and validation sets
      12. Converting all the data into torch tensors
      13. Selecting a batch size and creating an iterator
      14. BERT model configuration
      15. Loading the Hugging Face BERT uncased base model
      16. Optimizer grouped parameters
      17. The hyperparameters for the training loop
      18. The training loop
      19. Training evaluation
      20. Predicting and evaluating using the holdout dataset
        1. Exploring the prediction process
      21. Evaluating using the Matthews correlation coefficient
      22. Matthews correlation coefficient evaluation for the whole dataset
    3. Building a Python interface to interact with the model
      1. Saving the model
      2. Creating an interface for the trained model
        1. Interacting with the model
    4. Summary
    5. Questions
    6. References
    7. Further reading
  7. Pretraining a Transformer from Scratch through RoBERTa
    1. Training a tokenizer and pretraining a transformer
    2. Building KantaiBERT from scratch
      1. Step 1: Loading the dataset
      2. Step 2: Installing Hugging Face transformers
      3. Step 3: Training a tokenizer
      4. Step 4: Saving the files to disk
      5. Step 5: Loading the trained tokenizer files
      6. Step 6: Checking resource constraints: GPU and CUDA
      7. Step 7: Defining the configuration of the model
      8. Step 8: Reloading the tokenizer in transformers
      9. Step 9: Initializing a model from scratch
        1. Exploring the parameters
      10. Step 10: Building the dataset
      11. Step 11: Defining a data collator
      12. Step 12: Initializing the trainer
      13. Step 13: Pretraining the model
      14. Step 14: Saving the final model (+tokenizer + config) to disk
      15. Step 15: Language modeling with FillMaskPipeline
    3. Pretraining a Generative AI customer support model on X data
      1. Step 1: Downloading the dataset
      2. Step 2: Installing Hugging Face transformers
      3. Step 3: Loading and filtering the data
      4. Step 4: Checking Resource Constraints: GPU and CUDA
      5. Step 5: Defining the configuration of the model
      6. Step 6: Creating and processing the dataset
      7. Step 7: Initializing the trainer
      8. Step 8: Pretraining the model
      9. Step 9: Saving the model
      10. Step 10: User interface to chat with the Generative AI agent
      11. Further pretraining
      12. Limitations
    4. Next steps
    5. Summary
    6. Questions
    7. References
    8. Further reading
  8. The Generative AI Revolution with ChatGPT
    1. GPTs as GPTs
      1. Improvement
      2. Diffusion
        1. New application sectors
        2. Self-service assistants
        3. Development assistants
      3. Pervasiveness
    2. The architecture of OpenAI GPT transformer models
      1. The rise of billion-parameter transformer models
      2. The increasing size of transformer models
      3. Context size and maximum path length
      4. From fine-tuning to zero-shot models
      5. Stacking decoder layers
      6. GPT models
    3. OpenAI models as assistants
      1. ChatGPT provides source code
      2. GitHub Copilot code assistant
      3. General-purpose prompt examples
      4. Getting started with ChatGPT – GPT-4 as an assistant
        1. 1. GPT-4 helps to explain how to write source code
        2. 2. GPT-4 creates a function to show the YouTube presentation of GPT-4 by Greg Brockman on March 14, 2023
        3. 3. GPT-4 creates an application for WikiArt to display images
        4. 4. GPT-4 creates an application to display IMDb reviews
        5. 5. GPT-4 creates an application to display a newsfeed
        6. 6. GPT-4 creates a k-means clustering (KMC) algorithm
    4. Getting started with the GPT-4 API
      1. Running our first NLP task with GPT-4
        1. Steps 1: Installing OpenAI and Step 2: Entering the API key
        2. Step 3: Running an NLP task with GPT-4
        3. Key hyperparameters
      2. Running multiple NLP tasks
    5. Retrieval Augmented Generation (RAG) with GPT-4
      1. Installation
      2. Document retrieval
      3. Augmented retrieval generation
    6. Summary
    7. Questions
    8. References
    9. Further reading
  9. Fine-Tuning OpenAI GPT Models
    1. Risk management
    2. Fine-tuning a GPT model for completion (generative)
    3. 1. Preparing the dataset
      1. 1.1. Preparing the data in JSON
      2. 1.2. Converting the data to JSONL
    4. 2. Fine-tuning an original model
    5. 3. Running the fine-tuned GPT model
    6. 4. Managing fine-tuned jobs and models
    7. Before leaving
    8. Summary
    9. Questions
    10. References
    11. Further reading
  10. Shattering the Black Box with Interpretable Tools
    1. Transformer visualization with BertViz
      1. Running BertViz
        1. Step 1: Installing BertViz and importing the modules
        2. Step 2: Load the models and retrieve attention
        3. Step 3: Head view
        4. Step 4: Processing and displaying attention heads
        5. Step 5: Model view
        6. Step 6: Displaying the output probabilities of attention heads
        7. Streaming the output of the attention heads
        8. Visualizing word relationships using attention scores with pandas
        9. exBERT
    2. Interpreting Hugging Face transformers with SHAP
      1. Introducing SHAP
        1. Explaining Hugging Face outputs with SHAP
    3. Transformer visualization via dictionary learning
      1. Transformer factors
      2. Introducing LIME
      3. The visualization interface
    4. Other interpretable AI tools
      1. LIT
        1. PCA
        2. Running LIT
      2. OpenAI LLMs explain neurons in transformers
      3. Limitations and human control
    5. Summary
    6. Questions
    7. References
    8. Further reading
  11. Investigating the Role of Tokenizers in Shaping Transformer Models
    1. Matching datasets and tokenizers
      1. Best practices
        1. Step 1: Preprocessing
        2. Step 2: Quality control
        3. Step 3: Continuous human quality control
      2. Word2Vec tokenization
        1. Case 0: Words in the dataset and the dictionary
        2. Case 1: Words not in the dataset or the dictionary
        3. Case 2: Noisy relationships
        4. Case 3: Words in a text but not in the dictionary
        5. Case 4: Rare words
        6. Case 5: Replacing rare words
    2. Exploring sentence and WordPiece tokenizers to understand the efficiency of subword tokenizers for transformers
      1. Word and sentence tokenizers
        1. Sentence tokenization
        2. Word tokenization
        3. Regular expression tokenization
        4. Treebank tokenization
        5. White space tokenization
        6. Punkt tokenization
        7. Word punctuation tokenization
        8. Multi-word tokenization
      2. Subword tokenizers
        1. Unigram language model tokenization
        2. SentencePiece
        3. Byte-Pair Encoding (BPE)
        4. WordPiece
      3. Exploring in code
        1. Detecting the type of tokenizer
        2. Displaying token-ID mappings
        3. Analyzing and controlling the quality of token-ID mappings
    3. Summary
    4. Questions
    5. References
    6. Further reading
  12. Leveraging LLM Embeddings as an Alternative to Fine-Tuning
    1. LLM embeddings as an alternative to fine-tuning
      1. From prompt design to prompt engineering
    2. Fundamentals of text embedding with NLKT and Gensim
      1. Installing libraries
      2. 1. Reading the text file
      3. 2. Tokenizing the text with Punkt
        1. Preprocessing the tokens
      4. 3. Embedding with Gensim and Word2Vec
      5. 4. Model description
      6. 5. Accessing a word and vector
      7. 6. Exploring Gensim’s vector space
      8. 7. TensorFlow Projector
    3. Implementing question-answering systems with embedding-based search techniques
      1. 1. Installing the libraries and selecting the models
      2. 2. Implementing the embedding model and the GPT model
        1. 2.1 Evaluating the model with a knowledge base: GPT can answer questions
        2. 2.2 Add a knowledge base
        3. 2.3 Evaluating the model without a knowledge base: GPT cannot answer questions
      3. 3. Prepare search data
      4. 4. Search
      5. 5. Ask
        1. 5.1.Example question
        2. 5.2.Troubleshooting wrong answers
    4. Transfer learning with Ada embeddings
      1. 1. The Amazon Fine Food Reviews dataset
        1. 1.2. Data preparation
      2. 2. Running Ada embeddings and saving them for future reuse
      3. 3. Clustering
        1. 3.1. Find the clusters using k-means clustering
        2. 3.2. Display clusters with t-SNE
      4. 4. Text samples in the clusters and naming the clusters
    5. Summary
    6. Questions
    7. References
    8. Further reading
  13. Toward Syntax-Free Semantic Role Labeling with ChatGPT and GPT-4
    1. Getting started with cutting-edge SRL
    2. Entering the syntax-free world of AI
    3. Defining SRL
      1. Visualizing SRL
    4. SRL experiments with ChatGPT with GPT-4
      1. Basic sample
      2. Difficult sample
    5. Questioning the scope of SRL
      1. The challenges of predicate analysis
    6. Redefining SRL
    7. From task-specific SRL to emergence with ChatGPT
      1. 1. Installing OpenAI
      2. 2. GPT-4 dialog function
      3. 3. SRL
        1. Sample 1 (basic)
        2. Sample 2 (basic)
        3. Sample 3 (basic)
        4. Sample 4 (difficult)
        5. Sample 5 (difficult)
        6. Sample 6 (difficult)
    8. Summary
    9. Questions
    10. References
    11. Further reading
  14. Summarization with T5 and ChatGPT
    1. Designing a universal text-to-text model
    2. The rise of text-to-text transformer models
    3. A prefix instead of task-specific formats
    4. The T5 model
    5. Text summarization with T5
      1. Hugging Face
        1. Selecting a Hugging Face transformer model
      2. Initializing the T5-large transformer model
        1. Getting started with T5
        2. Exploring the architecture of the T5 model
      3. Summarizing documents with T5-large
        1. Creating a summarization function
        2. A general topic sample
        3. The Bill of Rights sample
        4. A corporate law sample
    6. From text-to-text to new word predictions with OpenAI ChatGPT
      1. Comparing T5 and ChatGPT’s summarization methods
        1. Pretraining
        2. Specific versus non-specific tasks
      2. Summarization with ChatGPT
    7. Summary
    8. Questions
    9. References
    10. Further reading
  15. Exploring Cutting-Edge LLMs with Vertex AI and PaLM 2
    1. Architecture
      1. Pathways
        1. Client
        2. Resource manager
        3. Intermediate representation
        4. Compiler
        5. Scheduler
        6. Executor
      2. PaLM
        1. Parallel layer processing that increases training speed
        2. Shared input-output embeddings, which saves memory
        3. No biases, which improves training stability
        4. Rotary Positional Embedding (RoPE) improves model quality
        5. SwiGLU activations improve model quality
      3. PaLM 2
        1. Improved performance, faster, and more efficient
        2. Scaling laws, optimal model size, and the number of parameters
        3. State-of-the-art (SOA) performance and a new training methodology
    2. Assistants
      1. Gemini
      2. Google Workspace
      3. Google Colab Copilot
      4. Vertex AI PaLM 2 interface
        1. Vertex AI PaLM 2 assistant
    3. Vertex AI PaLM 2 API
      1. Question answering
      2. Question-answer task
      3. Summarization of a conversation
      4. Sentiment analysis
      5. Multi-choice problems
      6. Code
    4. Fine-tuning
      1. Creating a bucket
      2. Fine-tuning the model
    5. Summary
    6. Questions
    7. References
    8. Further reading
  16. Guarding the Giants: Mitigating Risks in Large Language Models
    1. The emergence of functional AGI
    2. Cutting-edge platform installation limitations
    3. Auto-BIG-bench
    4. WandB
    5. When will AI agents replicate?
      1. Function: `create_vocab`
        1. Process:
      2. Function: `scrape_wikipedia`
        1. Process:
      3. Function: `create_dataset`
        1. Process:
      4. Classes: `TextDataset`, `Encoder`, and `Decoder`
      5. Function: `count_parameters`
      6. Function: `main`
        1. Process:
      7. Saving and Executing the Model
    6. Risk management
      1. Hallucinations and memorization
        1. Memorization
      2. Risky emergent behaviors
      3. Disinformation
      4. Influence operations
      5. Harmful content
      6. Privacy
      7. Cybersecurity
    7. Risk mitigation tools with RLHF and RAG
      1. 1. Input and output moderation with transformers and a rule base
      2. 2. Building a knowledge base for ChatGPT and GPT-4
        1. Adding keywords
      3. 3. Parsing the user requests and accessing the KB
      4. 4. Generating ChatGPT content with a dialog function
        1. Token control
        2. Moderation
    8. Summary
    9. Questions
    10. References
    11. Further reading
  17. Beyond Text: Vision Transformers in the Dawn of Revolutionary AI
    1. From task-agnostic models to multimodal vision transformers
    2. ViT – Vision Transformer
      1. The basic architecture of ViT
        1. Step 1: Splitting the image into patches
        2. Step 2: Building a vocabulary of image patches
        3. Step 3: The transformer
      2. Vision transformers in code
        1. A feature extractor simulator
        2. The transformer
        3. Configuration and shapes
    3. CLIP
      1. The basic architecture of CLIP
      2. CLIP in code
    4. DALL-E 2 and DALL-E 3
      1. The basic architecture of DALL-E
      2. Getting started with the DALL-E 2 and DALL-E 3 API
        1. Creating a new image
        2. Creating a variation of an image
        3. From research to mainstream AI with DALL-E
    5. GPT-4V, DALL-E 3, and divergent semantic association
      1. Defining divergent semantic association
      2. Creating an image with ChatGPT Plus with DALL-E
      3. Implementing the GPT-4V API and experimenting with DAT
        1. Example 1: A standard image and text
        2. Example 2: Divergent semantic association, moderate divergence
        3. Example 3: Divergent semantic association, high divergence
    6. Summary
    7. Questions
    8. References
    9. Further Reading
  18. Transcending the Image-Text Boundary with Stable Diffusion
    1. Transcending image generation boundaries
    2. Part I: Defining text-to-image with Stable Diffusion
      1. 1. Text embedding using a transformer encoder
      2. 2. Random image creation with noise
      3. 3. Stable Diffusion model downsampling
      4. 4. Decoder upsampling
      5. 5. Output image
      6. Running the Keras Stable Diffusion implementation
    3. Part II: Running text-to-image with Stable Diffusion
      1. Generative AI Stable Diffusion for a Divergent Association Task (DAT)
    4. Part III: Video
      1. Text-to-video with Stability AI animation
      2. Text-to-video, with a variation of OpenAI CLIP
      3. A video-to-text model with TimeSformer
      4. Preparing the video frames
      5. Putting the TimeSformer to work to make predictions on the video frames
    5. Summary
    6. Questions
    7. References
    8. Further reading
  19. Hugging Face AutoTrain: Training Vision Models without Coding
    1. Goal and scope of this chapter
    2. Getting started
    3. Uploading the dataset
      1. No coding?
    4. Training models with AutoTrain
    5. Deploying a model
    6. Running our models for inference
      1. Retrieving validation images
        1. The program will now attempt to classify the validation images. We will see how a vision transformer reacts to this image.
      2. Inference: image classification
      3. Validation experimentation on the trained models
        1. ViTForImageClassification
        2. SwinForImageClassification 1
        3. BeitForImage Classification
        4. SwinForImageClassification 2
        5. ConvNextForImageClassification
        6. ResNetForImageClassification
      4. Trying the top ViT model with a corpus
    7. Summary
    8. Questions
    9. References
    10. Further reading
  20. On the Road to Functional AGI with HuggingGPT and its Peers
    1. Defining F-AGI
    2. Installing and importing
    3. Validation set
      1. Level 1 image: easy
      2. Level 2 image: difficult
      3. Level 3 image: very difficult
    4. HuggingGPT
      1. Level 1: Easy
      2. Level 2: Difficult
      3. Level 3: Very difficult
    5. CustomGPT
      1. Google Cloud Vision
        1. Level 1: Easy
        2. Level 2: Difficult
        3. Level 3: Very difficult
      2. Model chaining: Chaining Google Cloud Vision to ChatGPT
    6. Model Chaining with Runway Gen-2
      1. Midjourney: Imagine a ship in the galaxy
      2. Gen-2: Make this ship sail the sea
    7. Summary
    8. Questions
    9. References
    10. Further Reading
  21. Beyond Human-Designed Prompts with Generative Ideation
    1. Part I: Defining generative ideation
      1. Automated ideation architecture
      2. Scope and limitations
    2. Part II: Automating prompt design for generative image design
      1. ChatGPT/GPT-4 HTML presentation
        1. ChatGPT with GPT-4 provides the text for the presentation
        2. ChatGPT with GPT-4 provides a graph in HTML to illustrate the presentation
      2. Llama 2
        1. A brief introduction to Llama 2
      3. Implementing Llama 2 with Hugging Face
      4. Midjourney
        1. Discord API for Midjourney
      5. Microsoft Designer
    3. Part III: Automated generative ideation with Stable Diffusion
      1. 1. No prompt: Automated instruction for GPT-4
      2. 2. Generative AI (prompt generation) using ChatGPT with GPT-4
      3. 3. and 4. Generative AI with Stable Diffusion and displaying images
    4. The future is yours!
      1. The future of development through VR-AI
        1. The groundbreaking shift: Parallelization of development through the fusion of VR and AI
        2. Opportunities and risks
    5. Summary
    6. Questions
    7. References
    8. Further reading
  22. Appendix: Answers to the Questions
  23. Other Books You May Enjoy
  24. Index

Product information

  • Title: Transformers for Natural Language Processing and Computer Vision - Third Edition
  • Author(s): Denis Rothman
  • Release date: February 2024
  • Publisher(s): Packt Publishing
  • ISBN: 9781805128724