book

Fine-Tuning AI

Name: Fine-Tuning AI
Author: Laurence Moroney
ISBN: 9798341673373

by Laurence Moroney

July 2027

Intermediate to advanced

400 pages

2h 49m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Brief Table of Contents (Not Yet Final)
1. From Generalist to Specialist
The Cost of Renting IntelligenceThe Token TaxThe Agentic MultiplierThe Latency KillerThe Data FirewallThe Metadata LeakCompliance IssuesModel DriftThe Specialist RevolutionHardware is no longer the blockerThe Spectrum of Customization
2. Model Architectures: The DNA of AI
The Flow of Information: How transformers workAttention MechanismsMulti Head AttentionThe Feed Forward NetworkLayer Normalization and Residual ConnectionsDecoder-Only DominanceGrouped Query Attention (GQA)Continued Innovation: Mistral and MixtralSliding Window AttentionSparse Mixture of ExpertsGemma: Google’s Open ApproachTokenizers and EmbeddingsTokenizationByte Pair Encoding (BPE)EmbeddingsOther ConsiderationsInstruction FollowingReasoning DepthContext Windows and the ‘Lost in the Middle’ problemSummary
3. The Fine-Tuner’s Workshop
Why Setup MattersThe GPU: Your First DecisionVRAM: One Number to Rule Them AllUnderstanding GPUsThe Software StackChoosing Your Operating SystemYour Programming EnvironmentThe Core Library StackExperiment TrackingCloud and Pay-as-You-GoLocal or Cloud—How to DecideSummary
4. Excavating Data
Sourcing Raw DataPublic Datasets with Hugging Face: Data Sourcing and ProvenanceKaggle Datasets and CompetitionsScraping Domain-Specific DataCleaning Your DataThe Risk of Leaking PIIDetecting PIIPII Redaction StrategiesUnderstanding Data QualityStatistical ProfilingSpot CheckingContamination ChecksBuilding Your Data PipelineYour Output FormatDataset CardsSummary
5. Synthetic Data
The Teacher-Student ParadigmGenerating Instruction-Response PairsThe Naive ApproachSeed-Based GenerationPrompting the TeacherBatching for Cost EfficiencyUnderstanding Evol-InstructGenerating Chain-of-Thought ReasoningQuality Control for Synthetic DataHallucination DetectionData Diversity MetricsFiltering the Final DatasetMixing Real and Synthetic DataSummary
About the Author

Content preview from Fine-Tuning AI

Chapter 5. Synthetic Data

Chapter 4 covered the strategies for collecting and cleaning real-world data. But what happens when you’ve exhausted those strategies and still don’t have enough coverage to train a model effectively? Maybe your domain is so specialized that public datasets don’t exist for it, or perhaps the data you need is locked behind privacy regulations (such as real patient questions or financial transactions), and you can’t use it for training even if you do have it!

You’ve tried with the data you have, and you’ve found that it just doesn’t work. Exploring different architecture might work, but if you’re honest with yourself, if you don’t have enough data, what happens next?

This is where synthetic data enters the picture.

The core idea is simple: use a large, capable model (the “teacher”) to generate training examples that a smaller, cheaper model (the “student”) will learn from. The teacher already knows how to answer medical questions, write legal analyses, or debug code. ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 0642572310455Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Fine-Tuning AI

by Laurence Moroney

Chapter 5. Synthetic Data

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.