book

The Kaggle Book - Second Edition

Name: The Kaggle Book - Second Edition
ISBN: 9781835083208

by Luca Massaron, Bojan Tunguz, Konrad Banachewicz

December 2025

Intermediate to advanced

708 pages

18h 36m

English

Packt Publishing

Read now

Unlock full access

A Word from Some Kaggle Grandmasters
Preface
Who this book is forWhat this book coversTo get the most out of this bookGet in touchFree Benefits with Your BookHow to Unlock
Part 1: Your Kaggle Launchpad: Mastering the Essentials
Introducing Kaggle and Other Data Science Competitions
Free Benefits with Your BookThe rise of data science competition platformsThe Kaggle competition platformOther competition platformsIntroducing Kaggle competitionsStages of a competitionTypes of competitions and examplesSubmission and leaderboard dynamicsExplaining the Common Task Framework paradigmUnderstanding what can go wrong in a competitionComputational resourcesKaggle NotebooksTeaming and networkingPerformance tiers and rankingsCriticism and opportunitiesSummaryGet This Book’s PDF Version and Exclusive ExtrasJoin our book’s Discord space
Organizing Data with Datasets
Setting up a datasetGathering the dataWorking with datasetsUsing Kaggle datasets in Google ColabLegal caveatsSummaryJoin our book’s Discord space
Working and Learning with Kaggle Notebooks
Technical requirementsSetting up a notebookRunning your notebookSaving notebooks to GitHubSetting a notebook as a utility scriptGetting the most out of notebooks’ resourcesUpgrading to Google Cloud Platform (GCP)Going one step beyondKaggle Learn coursesSummaryGet This Book’s PDF Version and Exclusive ExtrasJoin our book’s Discord space
Kaggle Models
Selecting a Kaggle modelTaskData TypeFrameworkLanguageLicenseFine TunableSizeUsing Kaggle ModelsUploading your model to Kaggle ModelsSummaryJoin our book’s Discord space
Leveraging Discussion Forums
How Kaggle forums workSample approaches to discussionsDiscussions on common challengesInformation leakage and overfittingNetiquetteSummaryGet This Book’s PDF Version and Exclusive ExtrasJoin our book’s Discord space
Part 2: Elevating Your Game: Advanced Techniques for Competitive Success
Competition Tasks and Metrics
Evaluation metrics and objective functionsBasic types of tasksRegressionClassificationOrdinalThe Meta Kaggle datasetHandling never-before-seen metricsMetrics for regression (standard and ordinal)Mean squared error (MSE) and R squaredRMSERMSLEMAEMetrics for classification (label prediction and probability)AccuracyPrecision and recallThe F1 scoreLog lossROC-AUCMatthews correlation coefficient (MCC)Metrics for multi-class classificationMetrics for object detection and segmentation problemsIoUDiceMetrics for multi-label classification and recommendation problemsOptimizing evaluation metricsCustom metrics and custom objective functionsPost-processing your predictionsProbabilistic adjustments of the predictionsSummaryJoin our book’s Discord space

Designing Good Validation
Snooping on the leaderboardThe importance of validation in competitionsBias and varianceTrying different splitting strategiesThe basic train-test splitProbabilistic evaluation methodsk-fold cross-validationUnderstanding how k-fold cross-validation worksK-fold variationsSequential cross-validation in time seriesValidation in financial time seriesNested cross-validationProducing OOF predictionsSubsamplingThe bootstrapTuning your model validation systemUsing adversarial validationExample implementationHandling different distributions of training and test dataHandling data leakageFeature leakageLeakage at the example levelHandling data leakage in KaggleSummaryGet This Book’s PDF Version and Exclusive ExtrasJoin our book’s Discord space
Modeling for Tabular Competitions
The Tabular Playground SeriesSetting a random state for reproducibilityThe importance of EDAPerforming EDA in KaggleDimensionality reduction with t-SNE and UMAPReducing the size of your dataSpeeding up data processingApplying feature engineeringEasily derived featuresMeta-features based on rows and columnsTarget encodingUsing feature importance to evaluate your workPseudo-labelingDenoising with autoencodersAutoML for tabular competitionsNeural networks for tabular competitionsSummaryJoin our book’s Discord space
Hyperparameter Optimization
Basic optimization techniquesGrid searchRandom searchHalving searchKey parameters and how to use themLinear modelsSupport vector machinesRandom forests and extremely randomized treesGradient tree boostingLightGBMXGBoostCatBoostHistGradientBoostingBayesian optimizationUsing scikit-optimizeCustomizing a Bayesian optimization searchExtending Bayesian optimization to neural architecture searchCreating lighter and faster models with KerasTunerThe TPE approach in OptunaUsing Weights & BiasesTracking experimentsVersioning with artifactsImplementing Sweeps optimizationSummaryGet This Book’s PDF Version and Exclusive ExtrasJoin our book’s Discord space
Ensembling with Blending and Stacking Solutions
A brief introduction to ensemble algorithmsAveraging models into an ensembleMajority votingAveraging of model predictionsWeighted averagesAveraging in your cross-validation strategyCorrecting averaging for ROC-AUC evaluationsBlending models using a meta-modelBest practices for blendingStacking models togetherPerforming stackingStacking variationsCreating complex stacking and blending solutionsSummaryJoin our book’s Discord space
Modeling for Computer Vision
Augmentation strategiesKeras built-in augmentationsThe ImageDataGenerator approachPreprocessing layersA deeper dive: the albumentations packageImage classificationObject detectionSemantic segmentationExploring a capstone case study: CZII – CryoET Object Identification competitionThe key characteristics of the datasetModeling implicationsTalking about the data formatEvaluation strategy and challengesData exploration overviewA first baseline solution, the 3D U-Net segmentation approachStarting from a Kaggle baselineExamining the second-place solution: an ensemble of lightweight 3D modelsHow to perform ensembling and inferenceLooking at other top solutions and insightsOverall trends emerged in the competitionSummaryGet This Book’s PDF Version and Exclusive ExtrasJoin our book’s Discord space
Modeling for NLP
Sentiment analysisOpen domain Q&AToxic comments classificationText classification with TF-IDF and logistic regressionImporting the necessary librariesDefining the target labels and loading the datasetWord-level TF-IDF vectorizationCharacter-level TF-IDF vectorizationCombining word and character featuresTraining the logistic regression model and cross-validationCalculating the total cross-validation scoreSaving the predictions to a CSV fileText preprocessing and cleanupImporting libraries, loading files, and setting up global variablesLoading pretrained dictionariesSetting preprocessing parametersDefining contraction patternsSplitting toxic wordsTokenizing with TweetTokenizerURL replacementNormalizing by dictionaryLoading a spaCy modelThe main normalization functionReading and normalizing dataSaving the processed dataText classification with RNNsImports and environment setupLoading preprocessed dataLoading embeddingsSplitting the datasetsBuilding the Keras modelTraining and averaging multiple seedsCreating a submission fileText classification with DistilBERTSetting up the environment and dependenciesLoading and preparing the training dataCreating a custom Dataset class for multi-label classificationSplitting the data into training and validation setsInitializing the tokenizer and creating data loadersDefining the model architecturePreparing the model and optimizer for trainingTraining loopPreparing and processing the test dataInference on the test dataFormatting and saving the predictionsText classification with AutoTrainSetting up the environment and dependenciesSetting up AutoTrain parametersInitializing and creating the Autotrain projectLoading a pretrained model and tokenizerPreparing the test data for inferenceCreating a custom Dataset classRunning predictions with the trainerText classification with LLM embeddings and logistic regressionOpenAI embeddingsInitializing the OpenAI clientDefining a helper function for embeddingsLoading and cleaning the dataHandling specific data anomaliesGenerating embeddings for the dataConverting embeddings to NumPy arraysSaving the embeddings for later useNVIDIA embeddingsDefining a function to get embeddingsSetting the stage: Data and dependenciesCross-validation and trainingIterating over each targetMaking predictions and recording performancePreparing the submissionText augmentation strategiesBasic techniquesText augmentation with back-and-forth translationnlpaugSummaryJoin our book’s Discord space
Generative AI in Kaggle Competitions
Understanding generative AI and LLMsThe working of LLMsUnlocking global communication with Gemma: fine-tuning LLMs for new languagesCompetition format and dataTop solutions overviewFine-tuning Gemma in practice: exampleKey techniques used in fine-tuningLLM prompt recoveryCompetition overviewThird-place solution (team prompt = “don’t say anything”)Origins of the mean promptQuantitative outcome and qualitative lessonsAI assistants for data tasks with GemmaCompetition overviewTop solution: “PyGEM” – a Python programming chatbotSummaryGet This Book’s PDF Version and Exclusive ExtrasJoin our book’s Discord space
Simulation and Optimization Competitions
Technical requirementsWorking with Connect XRock, Paper, ScissorsSanta competition 2020A few other Kaggle game agent competitionsFIDE and the Google Efficient Chess AI ChallengeThe 4th place solution: enhancing Stockfish with a small neural networkChoosing the base engineOptimizing memory usageEnhancing the evaluation function with a neural networkCode example and repositoryBroader implicationsSummaryJoin our book’s Discord space
Part 3: Kaggle for Your Career: Building Your Profile and Finding Opportunities
Creating Your Portfolio of Projects and Ideas
Building your portfolio with KaggleThe golden rules of a good portfolioLeveraging notebooksLeveraging discussionsLeveraging datasetsArranging your online presence beyond KaggleBlogs and publicationsGitHubMaking an online demoWriting a paper on arXivMonitoring ongoing competitionsSummaryGet This Book’s PDF Version and Exclusive ExtrasJoin our book’s Discord space
Finding New Professional Opportunities
Building connections with other competition data scientistsParticipating in Kaggle Days and other Kaggle meetupsGetting spotted and other job opportunitiesHow Kaggle can helpThe STAR approachSummary (and some parting words)Join our book’s Discord space
Unlock Your Exclusive Benefits
Unlock this Book’s Free Benefits in 3 Easy Steps
Other Books You May Enjoy
Index

Content preview from The Kaggle Book - Second Edition

9 Hyperparameter Optimization

How a Kaggle solution performs is not simply determined by the type of learning algorithm you choose. Aside from the data and features you use, it is also strongly determined by the algorithm’s hyperparameters, which must be fixed before training and cannot be learned during training. Choosing suitable features is most effective in tabular data competitions; however, hyperparameter optimization is effective in all competitions of any kind. In fact, given fixed data and an algorithm, hyperparameter optimization is the only sure way to enhance the predictive performance of the algorithm and climb the leaderboard. It also helps in ensembling because an ensemble of tuned models always outperforms an ensemble of untuned ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781835083208

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

The Kaggle Book - Second Edition

by Luca Massaron, Bojan Tunguz, Konrad Banachewicz

9

Hyperparameter Optimization

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.