Chapter 16. Vision and Multimodal Transformers
In the previous chapter, we implemented a transformer from scratch and turned it into a translation system, then we explored encoder-only models for NLU, decoder-only models for NLG, and we even built a little chatbot—that was quite a journey! Yet, there’s still a lot more to say about transformers. In particular, we have only dealt with text so far, but transformers actually turned out to be exceptionally good at processing all sorts of inputs. In this chapter we will cover vision transformers (ViTs), capable of processing images, followed by multimodal transformers, capable of handling multiple modalities, including text, images, audio, videos, robot sensors and actuators, and really any kind of data.
In the first part of this chapter, we will discuss some of the most influential pure-vision transformers:
- DETR (Detection Transformer)
-
An early encoder-decoder transformer for object detection.
- The original ViT (Vision Transformer)
-
This landmark encoder-only transformer treats image patches like word tokens and reaches the state of the art if trained on a large dataset.
- DeiT (Data-Efficient Image Transformer)
-
A more data-efficient ViT trained at scale using distillation.
- PVT (Pyramid Vision Transformer)
-
A hierarchical model that can produce multiscale feature maps for semantic segmentation and other dense prediction tasks.
- Swin Transformer (Shifted Windows Transformer)
-
A much faster hierarchical model.
- DINO (self-Distillation ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access