Chapter 6. Post Training Vision Language Models
So far until the previous chapter we have been living the full “from scratch” fantasy. We took a tiny VLM, wired images into text, fought with padding and packing, watched the loss wobble its way down, and even made it answer a few questions about bears and mountains. That is roughly what pre-training and basic supervised training look like in miniature.
In practice though, most people do not spin up a VLM from nothing. You usually start from a strong base model that already knows a lot about language and images, and then you nudge it into the behavior you actually want. That second phase is what people call post-training. It has two big ingredients:
-
Supervised fine-tuning, where you show the model lots of “here is a prompt, here is a good answer” pairs so it can follow instructions and handle the tasks you care about.
-
Alignment, where you teach the model what humans actually prefer, using techniques like RLHF, DPO, MPO, and GRPO so that the ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access