book

Hands-On Machine Learning with Scikit-Learn and PyTorch

by Aurélien Géron

October 2025

Intermediate to advanced

878 pages

26h 47m

English

O'Reilly Media, Inc.

Book available

Read now

Unlock full access

Machine Learning in Your ProjectsObjective and ApproachCode ExamplesPrerequisitesRoadmapChanges Between the TensorFlow and PyTorch VersionsOther ResourcesConventions Used in This BookO’Reilly Online LearningHow to Contact UsAcknowledgments
What Is Machine Learning?Why Use Machine Learning?Examples of ApplicationsTypes of Machine Learning SystemsTraining SupervisionBatch Versus Online LearningInstance-Based Versus Model-Based LearningMain Challenges of Machine LearningInsufficient Quantity of Training DataNonrepresentative Training DataPoor-Quality DataIrrelevant FeaturesOverfitting the Training DataUnderfitting the Training DataDeployment IssuesStepping BackTesting and ValidatingHyperparameter Tuning and Model SelectionData MismatchExercises
Working with Real DataLook at the Big PictureFrame the ProblemSelect a Performance MeasureCheck the AssumptionsGet the DataRunning the Code Examples Using Google ColabSaving Your Code Changes and Your DataThe Power and Danger of InteractivityBook Code Versus Notebook CodeDownload the DataTake a Quick Look at the Data StructureCreate a Test SetExplore and Visualize the Data to Gain InsightsVisualizing Geographical DataLook for CorrelationsExperiment with Attribute CombinationsPrepare the Data for Machine Learning AlgorithmsClean the DataHandling Text and Categorical AttributesFeature Scaling and TransformationCustom TransformersTransformation PipelinesSelect and Train a ModelTrain and Evaluate on the Training SetBetter Evaluation Using Cross-ValidationFine-Tune Your ModelGrid SearchRandomized SearchEnsemble MethodsAnalyzing the Best Models and Their ErrorsEvaluate Your System on the Test SetLaunch, Monitor, and Maintain Your SystemTry It Out!Exercises
MNISTTraining a Binary ClassifierPerformance MeasuresMeasuring Accuracy Using Cross-ValidationConfusion MatricesPrecision and RecallThe Precision/Recall Trade-OffThe ROC CurveMulticlass ClassificationError AnalysisMultilabel ClassificationMultioutput ClassificationExercises
Linear RegressionThe Normal EquationComputational ComplexityGradient DescentBatch Gradient DescentStochastic Gradient DescentMini-Batch Gradient DescentPolynomial RegressionLearning CurvesRegularized Linear ModelsRidge RegressionLasso RegressionElastic Net RegressionEarly StoppingLogistic RegressionEstimating ProbabilitiesTraining and Cost FunctionDecision BoundariesSoftmax RegressionExercises
Training and Visualizing a Decision TreeMaking PredictionsEstimating Class ProbabilitiesThe CART Training AlgorithmComputational ComplexityGini Impurity or Entropy?Regularization HyperparametersRegressionSensitivity to Axis OrientationDecision Trees Have a High VarianceExercises
Voting ClassifiersBagging and PastingBagging and Pasting in Scikit-LearnOut-of-Bag EvaluationRandom Patches and Random SubspacesRandom ForestsExtra-TreesFeature ImportanceBoostingAdaBoostGradient BoostingHistogram-Based Gradient BoostingStackingExercises
The Curse of DimensionalityMain Approaches for Dimensionality ReductionProjectionManifold LearningPCAPreserving the VariancePrincipal ComponentsProjecting Down to d DimensionsUsing Scikit-LearnExplained Variance RatioChoosing the Right Number of DimensionsPCA for CompressionRandomized PCAIncremental PCARandom ProjectionLLEOther Dimensionality Reduction TechniquesExercises
Clustering Algorithms: k-means and DBSCANk-Means ClusteringLimits of k-MeansUsing Clustering for Image SegmentationUsing Clustering for Semi-Supervised LearningDBSCANOther Clustering AlgorithmsGaussian MixturesUsing Gaussian Mixtures for Anomaly DetectionSelecting the Number of ClustersBayesian Gaussian Mixture ModelsOther Algorithms for Anomaly and Novelty DetectionExercises

From Biological to Artificial NeuronsBiological NeuronsLogical Computations with NeuronsThe PerceptronThe Multilayer Perceptron and BackpropagationBuilding and Training MLPs with Scikit-LearnRegression MLPsClassification MLPsHyperparameter Tuning GuidelinesNumber of Hidden LayersNumber of Neurons per Hidden LayerLearning RateBatch SizeOther HyperparametersExercises
PyTorch FundamentalsPyTorch TensorsHardware AccelerationAutogradImplementing Linear RegressionLinear Regression Using Tensors and AutogradLinear Regression Using PyTorch’s High-Level APIImplementing a Regression MLPImplementing Mini-Batch Gradient Descent Using DataLoadersModel EvaluationBuilding Nonsequential Models Using Custom ModulesBuilding Models with Multiple InputsBuilding Models with Multiple OutputsBuilding an Image Classifier with PyTorchUsing TorchVision to Load the DatasetBuilding the ClassifierFine-Tuning Neural Network Hyperparameters with OptunaSaving and Loading PyTorch ModelsCompiling and Optimizing a PyTorch ModelExercises
The Vanishing/Exploding Gradients ProblemsGlorot Initialization and He InitializationBetter Activation FunctionsBatch NormalizationLayer NormalizationGradient ClippingReusing Pretrained LayersTransfer Learning with PyTorchUnsupervised PretrainingPretraining on an Auxiliary TaskFaster OptimizersMomentumNesterov Accelerated GradientAdaGradRMSPropAdamAdaMaxNAdamAdamWLearning Rate SchedulingExponential SchedulingCosine AnnealingPerformance SchedulingWarming Up the Learning RateCosine Annealing with Warm Restarts1cycle SchedulingAvoiding Overfitting Through Regularizationℓ1 and ℓ2 RegularizationDropoutMonte Carlo DropoutMax-Norm RegularizationPractical GuidelinesExercises
The Architecture of the Visual CortexConvolutional LayersFiltersStacking Multiple Feature MapsImplementing Convolutional Layers with PyTorchPooling LayersImplementing Pooling Layers with PyTorchCNN ArchitecturesLeNet-5AlexNetGoogLeNetResNetXceptionSENetOther Noteworthy ArchitecturesChoosing the Right CNN ArchitectureGPU RAM Requirements: Inference Versus TrainingReversible Residual Networks (RevNets)Implementing a ResNet-34 CNN Using PyTorchUsing TorchVision’s Pretrained ModelsPretrained Models for Transfer LearningClassification and LocalizationObject DetectionFully Convolutional NetworksYou Only Look OnceObject TrackingSemantic SegmentationExercises
Recurrent Neurons and LayersMemory CellsInput and Output SequencesTraining RNNsForecasting a Time SeriesThe ARMA Model FamilyPreparing the Data for Machine Learning ModelsForecasting Using a Linear ModelForecasting Using a Simple RNNForecasting Using a Deep RNNForecasting Multivariate Time SeriesForecasting Several Time Steps AheadForecasting Using a Sequence-to-Sequence ModelHandling Long SequencesFighting the Unstable Gradients ProblemTackling the Short-Term Memory ProblemExercises
Generating Shakespearean Text Using a Character RNNCreating the Training DatasetEmbeddingsBuilding and Training the Char-RNN ModelGenerating Fake Shakespearean TextSentiment Analysis Using Hugging Face LibrariesTokenization Using the Hugging Face Tokenizers LibraryReusing Pretrained TokenizersBuilding and Training a Sentiment Analysis ModelBidirectional RNNsReusing Pretrained Embeddings and Language ModelsTask-Specific ClassesThe Trainer APIHugging Face PipelinesAn Encoder-Decoder Network for Neural Machine TranslationBeam SearchAttention MechanismsExercises
Attention Is All You Need: The Original Transformer ArchitecturePositional EncodingsMulti-Head AttentionBuilding the Rest of the TransformerBuilding an English-to-Spanish TransformerEncoder-Only Transformers for Natural Language UnderstandingBERT’s ArchitectureBERT PretrainingBERT Fine-TuningOther Encoder-Only ModelsDecoder-Only TransformersGPT-1 Architecture and Generative PretrainingGPT-2 and Zero-Shot LearningGPT-3, In-Context Learning, One-Shot Learning, and Few-Shot LearningUsing GPT-2 to Generate TextUsing GPT-2 for Question AnsweringDownloading and Running an Even Larger Model: Mistral-7BTurning a Large Language Model into a ChatbotFine-Tuning a Model for Chatting and Following Instructions Using SFT and RLHFDirect Preference Optimization (DPO)Fine-Tuning a Model Using the TRL LibraryFrom a Chatbot Model to a Full Chatbot SystemModel Context ProtocolLibraries and ToolsEncoder-Decoder ModelsExercises
Vision TransformersRNNs with Visual AttentionDETR: A CNN-Transformer Hybrid for Object DetectionThe Original ViTData-Efficient Image TransformerPyramid Vision Transformer for Dense Prediction TasksThe Swin Transformer: A Fast and Versatile ViTDINO: Self-Supervised Visual Representation LearningOther Major Vision Models and TechniquesMultimodal TransformersVideoBERT: A BERT Variant for Text plus VideoViLBERT: A Dual-Stream Transformer for Text plus ImageCLIP: A Dual-Encoder Text plus Image Model Trained with Contrastive PretrainingDALL·E: Generating Images from Text PromptsPerceiver: Bridging High-Resolution Modalities with Latent SpacesPerceiver IO: A Flexible Output Mechanism for the PerceiverFlamingo: Open-Ended Visual DialogueBLIP and BLIP-2Other Multimodal ModelsExercises
Efficient Data RepresentationsPerforming PCA with an Undercomplete Linear AutoencoderStacked AutoencodersImplementing a Stacked Autoencoder Using PyTorchVisualizing the ReconstructionsAnomaly Detection Using AutoencodersVisualizing the Fashion MNIST DatasetUnsupervised Pretraining Using Stacked AutoencodersTying WeightsTraining One Autoencoder at a TimeConvolutional AutoencodersDenoising AutoencodersSparse AutoencodersVariational AutoencodersGenerating Fashion MNIST ImagesDiscrete Variational AutoencodersGenerative Adversarial NetworksThe Difficulties of Training GANsDiffusion ModelsExercises
What Is Reinforcement Learning?Policy GradientsIntroduction to the Gymnasium LibraryNeural Network PoliciesEvaluating Actions: The Credit Assignment ProblemSolving the CartPole Using Policy GradientsValue-Based MethodsMarkov Decision ProcessesTemporal Difference LearningQ-LearningExploration PoliciesApproximate Q-Learning and Deep Q-LearningImplementing Deep Q-LearningDQN ImprovementsActor-Critic AlgorithmsMastering Atari Breakout Using the Stable-Baselines3 PPO ImplementationOverview of Some Popular RL AlgorithmsExercisesThank You!
Manual DifferentiationFinite Difference ApproximationForward-Mode AutodiffReverse-Mode Autodiff
Common Number RepresentationsReduced Precision ModelsMixed-Precision TrainingQuantizationLinear QuantizationPost-Training Quantization Using torch.ao.quantizationQuantization-Aware Training (QAT)Quantizing LLMs Using the bitsandbytes LibraryUsing Pre-Quantized Models

Content preview from Hands-On Machine Learning with Scikit-Learn and PyTorch

Chapter 15. Transformers for Natural Language Processing and Chatbots

In a landmark 2017 paper titled “Attention Is All You Need”,⁠¹ a team of Google researchers proposed a novel neural net architecture named the Transformer, which significantly improved the state of the art in neural machine translation (NMT). In short, the Transformer architecture is simply an encoder-decoder model, very much like the one we built in Chapter 14 for English-to-Spanish translation, and it can be used in exactly the same way (see Figure 15-1):

The source text goes in the encoder, which outputs contextualized embeddings (one per token).
The encoder’s output is then fed to the decoder, along with the translated text so far (starting with a start-of-sequence token).
The decoder predicts the next token for each input token.
The last token output by the decoder is appended to the translation.
Steps 2 to 4 are repeated again and again to produce the full translation, one extra token at a time, until an end-of-sequence token is generated. During training, we already have the full translation—it’s the target—so it is fed to the decoder in step 2 (starting with a start-of-sequence token), and steps 4 and 5 are not needed.

Diagram illustrating the Transformer model's process for translating English to Spanish, showing how the encoder generates contextual embeddings and the decoder predicts the next token in the translated sequence.

So what’s new? Well, inside the black box, there are some important differences with our previous ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Start your free trial

Machine Learning with PyTorch and Scikit-Learn

Publisher Resources

ISBN: 9798341607972Errata Page Supplemental Content

Hands-On Machine Learning with Scikit-Learn and PyTorch

by Aurélien Géron

Chapter 15. Transformers for Natural Language Processing and Chatbots

Figure 15-1. Using the Transformer model for English-to-Spanish translation

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

You might also like

Machine Learning with PyTorch and Scikit-Learn

Grokking Machine Learning

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition

Publisher Resources

Chapter 15. Transformers for Natural Language Processing and Chatbots

Figure 15-1. Using the Transformer model for English-to-Spanish translation

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,and much more.

You might also like

Machine Learning with PyTorch and Scikit-Learn

Grokking Machine Learning

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 2nd Edition

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow, 3rd Edition

Publisher Resources

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.