book

Hands-On Machine Learning with Scikit-Learn and PyTorch

Name: Hands-On Machine Learning with Scikit-Learn and PyTorch
Author: Aurélien Géron
ISBN: 9798341607989

by Aurélien Géron

October 2025

Intermediate to advanced

878 pages

26h 37m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Quizzes

Preface
Machine Learning in Your ProjectsObjective and ApproachCode ExamplesPrerequisitesRoadmapChanges Between the TensorFlow and PyTorch VersionsOther ResourcesConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
I. The Fundamentals of Machine Learning
1. The Machine Learning Landscape
What Is Machine Learning?Why Use Machine Learning?Examples of ApplicationsTypes of Machine Learning SystemsTraining SupervisionBatch Versus Online LearningInstance-Based Versus Model-Based LearningMain Challenges of Machine LearningInsufficient Quantity of Training DataNonrepresentative Training DataPoor-Quality DataIrrelevant FeaturesOverfitting the Training DataUnderfitting the Training DataDeployment IssuesStepping BackTesting and ValidatingHyperparameter Tuning and Model SelectionData MismatchExercises
2. End-to-End Machine Learning Project
Working with Real DataLook at the Big PictureFrame the ProblemSelect a Performance MeasureCheck the AssumptionsGet the DataRunning the Code Examples Using Google ColabSaving Your Code Changes and Your DataThe Power and Danger of InteractivityBook Code Versus Notebook CodeDownload the DataTake a Quick Look at the Data StructureCreate a Test SetExplore and Visualize the Data to Gain InsightsVisualizing Geographical DataLook for CorrelationsExperiment with Attribute CombinationsPrepare the Data for Machine Learning AlgorithmsClean the DataHandling Text and Categorical AttributesFeature Scaling and TransformationCustom TransformersTransformation PipelinesSelect and Train a ModelTrain and Evaluate on the Training SetBetter Evaluation Using Cross-ValidationFine-Tune Your ModelGrid SearchRandomized SearchEnsemble MethodsAnalyzing the Best Models and Their ErrorsEvaluate Your System on the Test SetLaunch, Monitor, and Maintain Your SystemTry It Out!Exercises
3. Classification
MNISTTraining a Binary ClassifierPerformance MeasuresMeasuring Accuracy Using Cross-ValidationConfusion MatricesPrecision and RecallThe Precision/Recall Trade-OffThe ROC CurveMulticlass ClassificationError AnalysisMultilabel ClassificationMultioutput ClassificationExercises
4. Training Models
Linear RegressionThe Normal EquationComputational ComplexityGradient DescentBatch Gradient DescentStochastic Gradient DescentMini-Batch Gradient DescentPolynomial RegressionLearning CurvesRegularized Linear ModelsRidge RegressionLasso RegressionElastic Net RegressionEarly StoppingLogistic RegressionEstimating ProbabilitiesTraining and Cost FunctionDecision BoundariesSoftmax RegressionExercises
5. Decision Trees
Training and Visualizing a Decision TreeMaking PredictionsEstimating Class ProbabilitiesThe CART Training AlgorithmComputational ComplexityGini Impurity or Entropy?Regularization HyperparametersRegressionSensitivity to Axis OrientationDecision Trees Have a High VarianceExercises
6. Ensemble Learning and Random Forests
Voting ClassifiersBagging and PastingBagging and Pasting in Scikit-LearnOut-of-Bag EvaluationRandom Patches and Random SubspacesRandom ForestsExtra-TreesFeature ImportanceBoostingAdaBoostGradient BoostingHistogram-Based Gradient BoostingStackingExercises
7. Dimensionality Reduction
The Curse of DimensionalityMain Approaches for Dimensionality ReductionProjectionManifold LearningPCAPreserving the VariancePrincipal ComponentsProjecting Down to d DimensionsUsing Scikit-LearnExplained Variance RatioChoosing the Right Number of DimensionsPCA for CompressionRandomized PCAIncremental PCARandom ProjectionLLEOther Dimensionality Reduction TechniquesExercises
8. Unsupervised Learning Techniques
Clustering Algorithms: k-means and DBSCANk-Means ClusteringLimits of k-MeansUsing Clustering for Image SegmentationUsing Clustering for Semi-Supervised LearningDBSCANOther Clustering AlgorithmsGaussian MixturesUsing Gaussian Mixtures for Anomaly DetectionSelecting the Number of ClustersBayesian Gaussian Mixture ModelsOther Algorithms for Anomaly and Novelty DetectionExercises

II. Neural Networks and Deep Learning
9. Introduction to Artificial Neural Networks
From Biological to Artificial NeuronsBiological NeuronsLogical Computations with NeuronsThe PerceptronThe Multilayer Perceptron and BackpropagationBuilding and Training MLPs with Scikit-LearnRegression MLPsClassification MLPsHyperparameter Tuning GuidelinesNumber of Hidden LayersNumber of Neurons per Hidden LayerLearning RateBatch SizeOther HyperparametersExercises
10. Building Neural Networks with PyTorch
PyTorch FundamentalsPyTorch TensorsHardware AccelerationAutogradImplementing Linear RegressionLinear Regression Using Tensors and AutogradLinear Regression Using PyTorch’s High-Level APIImplementing a Regression MLPImplementing Mini-Batch Gradient Descent Using DataLoadersModel EvaluationBuilding Nonsequential Models Using Custom ModulesBuilding Models with Multiple InputsBuilding Models with Multiple OutputsBuilding an Image Classifier with PyTorchUsing TorchVision to Load the DatasetBuilding the ClassifierFine-Tuning Neural Network Hyperparameters with OptunaSaving and Loading PyTorch ModelsCompiling and Optimizing a PyTorch ModelExercises
11. Training Deep Neural Networks
The Vanishing/Exploding Gradients ProblemsGlorot Initialization and He InitializationBetter Activation FunctionsBatch NormalizationLayer NormalizationGradient ClippingReusing Pretrained LayersTransfer Learning with PyTorchUnsupervised PretrainingPretraining on an Auxiliary TaskFaster OptimizersMomentumNesterov Accelerated GradientAdaGradRMSPropAdamAdaMaxNAdamAdamWLearning Rate SchedulingExponential SchedulingCosine AnnealingPerformance SchedulingWarming Up the Learning RateCosine Annealing with Warm Restarts1cycle SchedulingAvoiding Overfitting Through Regularizationℓ1 and ℓ2 RegularizationDropoutMonte Carlo DropoutMax-Norm RegularizationPractical GuidelinesExercises
12. Deep Computer Vision Using Convolutional Neural Networks
The Architecture of the Visual CortexConvolutional LayersFiltersStacking Multiple Feature MapsImplementing Convolutional Layers with PyTorchPooling LayersImplementing Pooling Layers with PyTorchCNN ArchitecturesLeNet-5AlexNetGoogLeNetResNetXceptionSENetOther Noteworthy ArchitecturesChoosing the Right CNN ArchitectureGPU RAM Requirements: Inference Versus TrainingReversible Residual Networks (RevNets)Implementing a ResNet-34 CNN Using PyTorchUsing TorchVision’s Pretrained ModelsPretrained Models for Transfer LearningClassification and LocalizationObject DetectionFully Convolutional NetworksYou Only Look OnceObject TrackingSemantic SegmentationExercises
13. Processing Sequences Using RNNs and CNNs
Recurrent Neurons and LayersMemory CellsInput and Output SequencesTraining RNNsForecasting a Time SeriesThe ARMA Model FamilyPreparing the Data for Machine Learning ModelsForecasting Using a Linear ModelForecasting Using a Simple RNNForecasting Using a Deep RNNForecasting Multivariate Time SeriesForecasting Several Time Steps AheadForecasting Using a Sequence-to-Sequence ModelHandling Long SequencesFighting the Unstable Gradients ProblemTackling the Short-Term Memory ProblemExercises
14. Natural Language Processing with RNNs and Attention
Generating Shakespearean Text Using a Character RNNCreating the Training DatasetEmbeddingsBuilding and Training the Char-RNN ModelGenerating Fake Shakespearean TextSentiment Analysis Using Hugging Face LibrariesTokenization Using the Hugging Face Tokenizers LibraryReusing Pretrained TokenizersBuilding and Training a Sentiment Analysis ModelBidirectional RNNsReusing Pretrained Embeddings and Language ModelsTask-Specific ClassesThe Trainer APIHugging Face PipelinesAn Encoder-Decoder Network for Neural Machine TranslationBeam SearchAttention MechanismsExercises
15. Transformers for Natural Language Processing and Chatbots
Attention Is All You Need: The Original Transformer ArchitecturePositional EncodingsMulti-Head AttentionBuilding the Rest of the TransformerBuilding an English-to-Spanish TransformerEncoder-Only Transformers for Natural Language UnderstandingBERT’s ArchitectureBERT PretrainingBERT Fine-TuningOther Encoder-Only ModelsDecoder-Only TransformersGPT-1 Architecture and Generative PretrainingGPT-2 and Zero-Shot LearningGPT-3, In-Context Learning, One-Shot Learning, and Few-Shot LearningUsing GPT-2 to Generate TextUsing GPT-2 for Question AnsweringDownloading and Running an Even Larger Model: Mistral-7BTurning a Large Language Model into a ChatbotFine-Tuning a Model for Chatting and Following Instructions Using SFT and RLHFDirect Preference Optimization (DPO)Fine-Tuning a Model Using the TRL LibraryFrom a Chatbot Model to a Full Chatbot SystemModel Context ProtocolLibraries and ToolsEncoder-Decoder ModelsExercises
16. Vision and Multimodal Transformers
Vision TransformersRNNs with Visual AttentionDETR: A CNN-Transformer Hybrid for Object DetectionThe Original ViTData-Efficient Image TransformerPyramid Vision Transformer for Dense Prediction TasksThe Swin Transformer: A Fast and Versatile ViTDINO: Self-Supervised Visual Representation LearningOther Major Vision Models and TechniquesMultimodal TransformersVideoBERT: A BERT Variant for Text plus VideoViLBERT: A Dual-Stream Transformer for Text plus ImageCLIP: A Dual-Encoder Text plus Image Model Trained with Contrastive PretrainingDALL·E: Generating Images from Text PromptsPerceiver: Bridging High-Resolution Modalities with Latent SpacesPerceiver IO: A Flexible Output Mechanism for the PerceiverFlamingo: Open-Ended Visual DialogueBLIP and BLIP-2Other Multimodal ModelsExercises
17. Speeding Up Transformers
18. Autoencoders, GANs, and Diffusion Models
Efficient Data RepresentationsPerforming PCA with an Undercomplete Linear AutoencoderStacked AutoencodersImplementing a Stacked Autoencoder Using PyTorchVisualizing the ReconstructionsAnomaly Detection Using AutoencodersVisualizing the Fashion MNIST DatasetUnsupervised Pretraining Using Stacked AutoencodersTying WeightsTraining One Autoencoder at a TimeConvolutional AutoencodersDenoising AutoencodersSparse AutoencodersVariational AutoencodersGenerating Fashion MNIST ImagesDiscrete Variational AutoencodersGenerative Adversarial NetworksThe Difficulties of Training GANsDiffusion ModelsExercises
19. Reinforcement Learning
What Is Reinforcement Learning?Policy GradientsIntroduction to the Gymnasium LibraryNeural Network PoliciesEvaluating Actions: The Credit Assignment ProblemSolving the CartPole Using Policy GradientsValue-Based MethodsMarkov Decision ProcessesTemporal Difference LearningQ-LearningExploration PoliciesApproximate Q-Learning and Deep Q-LearningImplementing Deep Q-LearningDQN ImprovementsActor-Critic AlgorithmsMastering Atari Breakout Using the Stable-Baselines3 PPO ImplementationOverview of Some Popular RL AlgorithmsExercisesThank You!
A. Autodiff
Manual DifferentiationFinite Difference ApproximationForward-Mode AutodiffReverse-Mode Autodiff
B. Mixed Precision and Quantization
Common Number RepresentationsReduced Precision ModelsMixed-Precision TrainingQuantizationLinear QuantizationPost-Training Quantization Using torch.ao.quantizationQuantization-Aware Training (QAT)Quantizing LLMs Using the bitsandbytes LibraryUsing Pre-Quantized Models
Index
About the Author

Content preview from Hands-On Machine Learning with Scikit-Learn and PyTorch

Chapter 16. Vision and Multimodal Transformers

In the previous chapter, we implemented a transformer from scratch and turned it into a translation system, then we explored encoder-only models for NLU, decoder-only models for NLG, and we even built a little chatbot—that was quite a journey! Yet, there’s still a lot more to say about transformers. In particular, we have only dealt with text so far, but transformers actually turned out to be exceptionally good at processing all sorts of inputs. In this chapter we will cover vision transformers (ViTs), capable of processing images, followed by multimodal transformers, capable of handling multiple modalities, including text, images, audio, videos, robot sensors and actuators, and really any kind of data.

In the first part of this chapter, we will discuss some of the most influential pure-vision transformers:

DETR (Detection Transformer): An early encoder-decoder transformer for object detection.
The original ViT (Vision Transformer): This landmark encoder-only transformer treats image patches like word tokens and reaches the state of the art if trained on a large dataset.
DeiT (Data-Efficient Image Transformer): A more data-efficient ViT trained at scale using distillation.
PVT (Pyramid Vision Transformer): A hierarchical model that can produce multiscale feature maps for semantic segmentation and other dense prediction tasks.
Swin Transformer (Shifted Windows Transformer): A much faster hierarchical model.
DINO (self-Distillation ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition

Publisher Resources

ISBN: 9798341607972Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Hands-On Machine Learning with Scikit-Learn and PyTorch

by Aurélien Géron

Chapter 16. Vision and Multimodal Transformers

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.