book

Fine-Tuning AI

Name: Fine-Tuning AI
Author: Laurence Moroney
ISBN: 9798341673373

by Laurence Moroney

July 2027

Intermediate to advanced

400 pages

2h 49m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Brief Table of Contents (Not Yet Final)
1. From Generalist to Specialist
The Cost of Renting IntelligenceThe Token TaxThe Agentic MultiplierThe Latency KillerThe Data FirewallThe Metadata LeakCompliance IssuesModel DriftThe Specialist RevolutionHardware is no longer the blockerThe Spectrum of Customization
2. Model Architectures: The DNA of AI
The Flow of Information: How transformers workAttention MechanismsMulti Head AttentionThe Feed Forward NetworkLayer Normalization and Residual ConnectionsDecoder-Only DominanceGrouped Query Attention (GQA)Continued Innovation: Mistral and MixtralSliding Window AttentionSparse Mixture of ExpertsGemma: Google’s Open ApproachTokenizers and EmbeddingsTokenizationByte Pair Encoding (BPE)EmbeddingsOther ConsiderationsInstruction FollowingReasoning DepthContext Windows and the ‘Lost in the Middle’ problemSummary
3. The Fine-Tuner’s Workshop
Why Setup MattersThe GPU: Your First DecisionVRAM: One Number to Rule Them AllUnderstanding GPUsThe Software StackChoosing Your Operating SystemYour Programming EnvironmentThe Core Library StackExperiment TrackingCloud and Pay-as-You-GoLocal or Cloud—How to DecideSummary
4. Excavating Data
Sourcing Raw DataPublic Datasets with Hugging Face: Data Sourcing and ProvenanceKaggle Datasets and CompetitionsScraping Domain-Specific DataCleaning Your DataThe Risk of Leaking PIIDetecting PIIPII Redaction StrategiesUnderstanding Data QualityStatistical ProfilingSpot CheckingContamination ChecksBuilding Your Data PipelineYour Output FormatDataset CardsSummary
5. Synthetic Data
The Teacher-Student ParadigmGenerating Instruction-Response PairsThe Naive ApproachSeed-Based GenerationPrompting the TeacherBatching for Cost EfficiencyUnderstanding Evol-InstructGenerating Chain-of-Thought ReasoningQuality Control for Synthetic DataHallucination DetectionData Diversity MetricsFiltering the Final DatasetMixing Real and Synthetic DataSummary
About the Author

Content preview from Fine-Tuning AI

Chapter 4. Excavating Data

Model architectures are usually very open. There are reference implementations that are open source, papers that are available to read and replicate, or tutorials showing you how to code from scratch. Similarly, training techniques, hyperparameters, and best practices are easy to find! But data, particularly the data that you or your customers have gleaned from years of practicing in your business domain is proprietary, domain-specific, and hard won. It’s the thing that differentiates you from everyone around you.

It’s the data moat that surrounds your opportunity and distinguishes you from others.

One thing to note, and we’ll focus this on during this chapter, is the data within your data. Often, we have datasets that have already been structured for us in order to be able to build and use computer-based systems. But, in the age of AI, and with the artificial understanding that modern models can give you, you also have the opportunity to find value within the massive ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 0642572310455Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Fine-Tuning AI

by Laurence Moroney

Chapter 4. Excavating Data

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.