book

Hands-On Machine Learning with Scikit-Learn and PyTorch

Name: Hands-On Machine Learning with Scikit-Learn and PyTorch
Author: Aurélien Géron
ISBN: 9798341607989

by Aurélien Géron

October 2025

Intermediate to advanced

878 pages

26h 37m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Quizzes

Preface
Machine Learning in Your ProjectsObjective and ApproachCode ExamplesPrerequisitesRoadmapChanges Between the TensorFlow and PyTorch VersionsOther ResourcesConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
I. The Fundamentals of Machine Learning
1. The Machine Learning Landscape
What Is Machine Learning?Why Use Machine Learning?Examples of ApplicationsTypes of Machine Learning SystemsTraining SupervisionBatch Versus Online LearningInstance-Based Versus Model-Based LearningMain Challenges of Machine LearningInsufficient Quantity of Training DataNonrepresentative Training DataPoor-Quality DataIrrelevant FeaturesOverfitting the Training DataUnderfitting the Training DataDeployment IssuesStepping BackTesting and ValidatingHyperparameter Tuning and Model SelectionData MismatchExercises
2. End-to-End Machine Learning Project
Working with Real DataLook at the Big PictureFrame the ProblemSelect a Performance MeasureCheck the AssumptionsGet the DataRunning the Code Examples Using Google ColabSaving Your Code Changes and Your DataThe Power and Danger of InteractivityBook Code Versus Notebook CodeDownload the DataTake a Quick Look at the Data StructureCreate a Test SetExplore and Visualize the Data to Gain InsightsVisualizing Geographical DataLook for CorrelationsExperiment with Attribute CombinationsPrepare the Data for Machine Learning AlgorithmsClean the DataHandling Text and Categorical AttributesFeature Scaling and TransformationCustom TransformersTransformation PipelinesSelect and Train a ModelTrain and Evaluate on the Training SetBetter Evaluation Using Cross-ValidationFine-Tune Your ModelGrid SearchRandomized SearchEnsemble MethodsAnalyzing the Best Models and Their ErrorsEvaluate Your System on the Test SetLaunch, Monitor, and Maintain Your SystemTry It Out!Exercises
3. Classification
MNISTTraining a Binary ClassifierPerformance MeasuresMeasuring Accuracy Using Cross-ValidationConfusion MatricesPrecision and RecallThe Precision/Recall Trade-OffThe ROC CurveMulticlass ClassificationError AnalysisMultilabel ClassificationMultioutput ClassificationExercises
4. Training Models
Linear RegressionThe Normal EquationComputational ComplexityGradient DescentBatch Gradient DescentStochastic Gradient DescentMini-Batch Gradient DescentPolynomial RegressionLearning CurvesRegularized Linear ModelsRidge RegressionLasso RegressionElastic Net RegressionEarly StoppingLogistic RegressionEstimating ProbabilitiesTraining and Cost FunctionDecision BoundariesSoftmax RegressionExercises
5. Decision Trees
Training and Visualizing a Decision TreeMaking PredictionsEstimating Class ProbabilitiesThe CART Training AlgorithmComputational ComplexityGini Impurity or Entropy?Regularization HyperparametersRegressionSensitivity to Axis OrientationDecision Trees Have a High VarianceExercises
6. Ensemble Learning and Random Forests
Voting ClassifiersBagging and PastingBagging and Pasting in Scikit-LearnOut-of-Bag EvaluationRandom Patches and Random SubspacesRandom ForestsExtra-TreesFeature ImportanceBoostingAdaBoostGradient BoostingHistogram-Based Gradient BoostingStackingExercises
7. Dimensionality Reduction
The Curse of DimensionalityMain Approaches for Dimensionality ReductionProjectionManifold LearningPCAPreserving the VariancePrincipal ComponentsProjecting Down to d DimensionsUsing Scikit-LearnExplained Variance RatioChoosing the Right Number of DimensionsPCA for CompressionRandomized PCAIncremental PCARandom ProjectionLLEOther Dimensionality Reduction TechniquesExercises
8. Unsupervised Learning Techniques
Clustering Algorithms: k-means and DBSCANk-Means ClusteringLimits of k-MeansUsing Clustering for Image SegmentationUsing Clustering for Semi-Supervised LearningDBSCANOther Clustering AlgorithmsGaussian MixturesUsing Gaussian Mixtures for Anomaly DetectionSelecting the Number of ClustersBayesian Gaussian Mixture ModelsOther Algorithms for Anomaly and Novelty DetectionExercises

II. Neural Networks and Deep Learning
9. Introduction to Artificial Neural Networks
From Biological to Artificial NeuronsBiological NeuronsLogical Computations with NeuronsThe PerceptronThe Multilayer Perceptron and BackpropagationBuilding and Training MLPs with Scikit-LearnRegression MLPsClassification MLPsHyperparameter Tuning GuidelinesNumber of Hidden LayersNumber of Neurons per Hidden LayerLearning RateBatch SizeOther HyperparametersExercises
10. Building Neural Networks with PyTorch
PyTorch FundamentalsPyTorch TensorsHardware AccelerationAutogradImplementing Linear RegressionLinear Regression Using Tensors and AutogradLinear Regression Using PyTorch’s High-Level APIImplementing a Regression MLPImplementing Mini-Batch Gradient Descent Using DataLoadersModel EvaluationBuilding Nonsequential Models Using Custom ModulesBuilding Models with Multiple InputsBuilding Models with Multiple OutputsBuilding an Image Classifier with PyTorchUsing TorchVision to Load the DatasetBuilding the ClassifierFine-Tuning Neural Network Hyperparameters with OptunaSaving and Loading PyTorch ModelsCompiling and Optimizing a PyTorch ModelExercises
11. Training Deep Neural Networks
The Vanishing/Exploding Gradients ProblemsGlorot Initialization and He InitializationBetter Activation FunctionsBatch NormalizationLayer NormalizationGradient ClippingReusing Pretrained LayersTransfer Learning with PyTorchUnsupervised PretrainingPretraining on an Auxiliary TaskFaster OptimizersMomentumNesterov Accelerated GradientAdaGradRMSPropAdamAdaMaxNAdamAdamWLearning Rate SchedulingExponential SchedulingCosine AnnealingPerformance SchedulingWarming Up the Learning RateCosine Annealing with Warm Restarts1cycle SchedulingAvoiding Overfitting Through Regularizationℓ1 and ℓ2 RegularizationDropoutMonte Carlo DropoutMax-Norm RegularizationPractical GuidelinesExercises
12. Deep Computer Vision Using Convolutional Neural Networks
The Architecture of the Visual CortexConvolutional LayersFiltersStacking Multiple Feature MapsImplementing Convolutional Layers with PyTorchPooling LayersImplementing Pooling Layers with PyTorchCNN ArchitecturesLeNet-5AlexNetGoogLeNetResNetXceptionSENetOther Noteworthy ArchitecturesChoosing the Right CNN ArchitectureGPU RAM Requirements: Inference Versus TrainingReversible Residual Networks (RevNets)Implementing a ResNet-34 CNN Using PyTorchUsing TorchVision’s Pretrained ModelsPretrained Models for Transfer LearningClassification and LocalizationObject DetectionFully Convolutional NetworksYou Only Look OnceObject TrackingSemantic SegmentationExercises
13. Processing Sequences Using RNNs and CNNs
Recurrent Neurons and LayersMemory CellsInput and Output SequencesTraining RNNsForecasting a Time SeriesThe ARMA Model FamilyPreparing the Data for Machine Learning ModelsForecasting Using a Linear ModelForecasting Using a Simple RNNForecasting Using a Deep RNNForecasting Multivariate Time SeriesForecasting Several Time Steps AheadForecasting Using a Sequence-to-Sequence ModelHandling Long SequencesFighting the Unstable Gradients ProblemTackling the Short-Term Memory ProblemExercises
14. Natural Language Processing with RNNs and Attention
Generating Shakespearean Text Using a Character RNNCreating the Training DatasetEmbeddingsBuilding and Training the Char-RNN ModelGenerating Fake Shakespearean TextSentiment Analysis Using Hugging Face LibrariesTokenization Using the Hugging Face Tokenizers LibraryReusing Pretrained TokenizersBuilding and Training a Sentiment Analysis ModelBidirectional RNNsReusing Pretrained Embeddings and Language ModelsTask-Specific ClassesThe Trainer APIHugging Face PipelinesAn Encoder-Decoder Network for Neural Machine TranslationBeam SearchAttention MechanismsExercises
15. Transformers for Natural Language Processing and Chatbots
Attention Is All You Need: The Original Transformer ArchitecturePositional EncodingsMulti-Head AttentionBuilding the Rest of the TransformerBuilding an English-to-Spanish TransformerEncoder-Only Transformers for Natural Language UnderstandingBERT’s ArchitectureBERT PretrainingBERT Fine-TuningOther Encoder-Only ModelsDecoder-Only TransformersGPT-1 Architecture and Generative PretrainingGPT-2 and Zero-Shot LearningGPT-3, In-Context Learning, One-Shot Learning, and Few-Shot LearningUsing GPT-2 to Generate TextUsing GPT-2 for Question AnsweringDownloading and Running an Even Larger Model: Mistral-7BTurning a Large Language Model into a ChatbotFine-Tuning a Model for Chatting and Following Instructions Using SFT and RLHFDirect Preference Optimization (DPO)Fine-Tuning a Model Using the TRL LibraryFrom a Chatbot Model to a Full Chatbot SystemModel Context ProtocolLibraries and ToolsEncoder-Decoder ModelsExercises
16. Vision and Multimodal Transformers
Vision TransformersRNNs with Visual AttentionDETR: A CNN-Transformer Hybrid for Object DetectionThe Original ViTData-Efficient Image TransformerPyramid Vision Transformer for Dense Prediction TasksThe Swin Transformer: A Fast and Versatile ViTDINO: Self-Supervised Visual Representation LearningOther Major Vision Models and TechniquesMultimodal TransformersVideoBERT: A BERT Variant for Text plus VideoViLBERT: A Dual-Stream Transformer for Text plus ImageCLIP: A Dual-Encoder Text plus Image Model Trained with Contrastive PretrainingDALL·E: Generating Images from Text PromptsPerceiver: Bridging High-Resolution Modalities with Latent SpacesPerceiver IO: A Flexible Output Mechanism for the PerceiverFlamingo: Open-Ended Visual DialogueBLIP and BLIP-2Other Multimodal ModelsExercises
17. Speeding Up Transformers
18. Autoencoders, GANs, and Diffusion Models
Efficient Data RepresentationsPerforming PCA with an Undercomplete Linear AutoencoderStacked AutoencodersImplementing a Stacked Autoencoder Using PyTorchVisualizing the ReconstructionsAnomaly Detection Using AutoencodersVisualizing the Fashion MNIST DatasetUnsupervised Pretraining Using Stacked AutoencodersTying WeightsTraining One Autoencoder at a TimeConvolutional AutoencodersDenoising AutoencodersSparse AutoencodersVariational AutoencodersGenerating Fashion MNIST ImagesDiscrete Variational AutoencodersGenerative Adversarial NetworksThe Difficulties of Training GANsDiffusion ModelsExercises
19. Reinforcement Learning
What Is Reinforcement Learning?Policy GradientsIntroduction to the Gymnasium LibraryNeural Network PoliciesEvaluating Actions: The Credit Assignment ProblemSolving the CartPole Using Policy GradientsValue-Based MethodsMarkov Decision ProcessesTemporal Difference LearningQ-LearningExploration PoliciesApproximate Q-Learning and Deep Q-LearningImplementing Deep Q-LearningDQN ImprovementsActor-Critic AlgorithmsMastering Atari Breakout Using the Stable-Baselines3 PPO ImplementationOverview of Some Popular RL AlgorithmsExercisesThank You!
A. Autodiff
Manual DifferentiationFinite Difference ApproximationForward-Mode AutodiffReverse-Mode Autodiff
B. Mixed Precision and Quantization
Common Number RepresentationsReduced Precision ModelsMixed-Precision TrainingQuantizationLinear QuantizationPost-Training Quantization Using torch.ao.quantizationQuantization-Aware Training (QAT)Quantizing LLMs Using the bitsandbytes LibraryUsing Pre-Quantized Models
Index
About the Author

Content preview from Hands-On Machine Learning with Scikit-Learn and PyTorch

Chapter 17. Speeding Up Transformers

In Chapters 15 and 16, we built all kinds of transformers, from classifiers, translators and chatbots, to vision and multimodal transformers. While transformers are incredibly versatile and powerful, they are far from perfect. In particular, they can be very slow, especially when processing long input sequences.

Luckily, many techniques have been developed to speed up transformers of any size:

To speed up decoding in generative transformers, we will use key/value caching and speculative decoding, then we will take of a quick look at several approaches to parallelize text generation.
To accelerate multi-head attention (MHA), which is one of the most computationally expensive components of transformers, we will look at sparse attention, approximate attention, sharing projections, and FlashAttention.
To speed up gigantic transformers of up to trillions of parameters, we will discuss mixture of experts (MoE).
To train large transformers efficiently, we will discuss parameter-efficient fine-tuning (PEFT) using adapters such as Low-Rank Adaptation (LoRA), activation checkpointing, sequence packing, gradient accumulation, and parallelism.

Tip

Another way to speed up a transformer is to make it smaller. This can be done using reduced precision and quantization, which are discussed in Appendix B.

That’s quite a lot of techniques to cover, and they are fairly advanced, so you can safely skip this chapter for now if you are new to transformers, ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition

Publisher Resources

ISBN: 9798341607972Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Hands-On Machine Learning with Scikit-Learn and PyTorch

by Aurélien Géron

Chapter 17. Speeding Up Transformers

Tip

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.