book

Deep Learning for Biology

by Charles Ravarani, Natasha Latysheva

July 2025

Intermediate to advanced

436 pages

11h 17m

English

O'Reilly Media, Inc.

Read now

Unlock full access

Includes

Quizzes

Preface
Who This Book Is ForWhy We Wrote This BookPrerequisitesWhy We Focus on Molecular BiologyWhy All the Hands-on Programming ProjectsNotebook AvailabilityO’Reilly Online Learning PlatformQuick Tour of the BookChapter SummariesConventions Used in This BookUsing Code ExamplesO’Reilly Online LearningHow to Contact UsAcknowledgments
1. Introduction
Getting StartedDeciding What Your Model Will ReplaceDetermining Your Criteria for SuccessInvest Heavily in EvaluationsDesigning BaselinesTime-Boxing Your ProjectDeciding Whether You Really Need Deep LearningEnsuring That You Have Enough Good DataAssembling a TeamYou Don’t Need a Supercomputer or a PhDTechnical IntroductionWhy JAX and Flax?Python TipsAnatomy of a Training Loop with JAX/FlaxMachine Learning TipsTypes of TasksTypes of ArchitecturesDataset SplitsHyperparametersActivation FunctionsOptimizersInitialization StrategyModel CheckpointingEarly StoppingSelecting a Working EnvironmentSelecting an Interactive NotebookStructuring Your Code for Reuse and DebuggingSetting Up a GPU Development EnvironmentVersion ConflictsA Living Document
2. Learning the Language of Proteins
Biology PrimerProtein Structure Protein FunctionPredicting Protein FunctionMachine Learning PrimerLarge Language ModelsEmbeddingsPretraining and Fine-tuningRepresentations of Proteins and Protein LMsNumerical Representation of a ProteinOne-Hot Encoding of a Protein SequenceLearned Embeddings of Amino AcidsThe ESM2 Protein Language ModelStrategies for Extracting an Embedding for an Entire ProteinExtracellular Versus Membrane Protein EmbeddingsPreparing the DataLoading the CAFA3 DataSplitting the Dataset into SubsetsConverting Protein Sequences into Their Mean EmbeddingsTraining the ModelDefining the Training LoopExamining the Model PredictionsEvaluating Model UsefulnessConducting a Final Check on the Test SetImprovements and ExtensionsBiological and Analytical ExplorationMachine Learning ImprovementsSummary
3. Learning the Logic of DNA
Biology PrimerWhat Exactly Is DNA?Coding and Noncoding RegionsHow Transcription Factors Orchestrate Gene ActivityMeasuring Where Transcription Factors BindMachine Learning PrimerConvolutional Neural NetworksConvolutions for DNA SequencesTransformersAttentionQuery, Key, and Value IntuitionMultiheaded AttentionRepresenting Positional InformationModel InterpretationIn Silico Saturation MutagenesisInput GradientsBuilding a Simple PrototypeBuilding a DatasetDefining a Simple Convolutional ModelIncreasing ComplexityConducting In Silico MutagenesisModeling Multiple Transcription FactorsAdvanced TechniquesAdding Self-attention and Transformer BlocksDefining Various Model ArchitecturesSweeping Over the Different ModelsEvaluating on the Test SplitExtensions and ImprovementsSummary
4. Understanding Drug–Drug Interactions Using Graphs
Biology PrimerBeneficial Drug–Drug InteractionsHarmful Drug–Drug InteractionsDrugBankMachine Learning PrimerRepresenting Graph StructuresGraph Neural NetworksGraph Embeddings and Message PassingCold-Start ProblemGraphSAGESelecting a DatasetDescribing the DatasetExploring the DatasetExamining Drug NamesVisualizing GraphsBuilding a DatasetCreating a Dataset BuilderDownload the Raw DatasetPrepare the AnnotationPrepare the GraphPrepare the PairsSubsetting the GraphThe Dataset ClassBuilding a PrototypeNode EncoderGraph ConvolutionLink PredictionDrug–Drug Interaction ModelTraining the ModelCreate a Manageable DatasetCreate the Training LoopCreate the Pairs ClassCreate the Train Step FunctionCreate the Evaluation MetricTrain the Simplest ModelImproving the ModelChange to AUC LossSet Model Sweeping and Training ParametersTrain on a Larger DatasetExtensionsSummary
5. Detecting Skin Cancer in Medical Images
Biology PrimerSkin CancerCauses and Risk FactorsHow Skin Cancer Is DiagnosedImage-Based Skin Cancer DetectionMachine Learning PrimerConvolutional Neural NetworksUnderstanding a ConvolutionUnderstanding DimensionsPoolingOther Components of a CNNResNetsExploring the DataA First GlimpsePreviewing the ImagesAddressing Dataset IssuesBuilding a DatasetBuilderReadying the DatasetBuilding Skin Cancer Classification ModelsLoading the Flax ResNet50 ModelExtracting the ResNet BackboneBuilding the SkinLesionClassifierHeadBuilding Our ModelsTraining the ModelsThe Training LoopCreating the Multiclass DatasetTraining the Baseline ModelTraining the ResNetFromScratch ModelTraining the FinetunedHeadResNet ModelTraining the FinetunedResNet ModelOptimizing the FinetunedResNet ModelTraining the Optimized FinetunedResNet ModelFurther Improving the ModelSummary
6. Learning Spatial Organization Patterns Within Cells
Biology PrimerSpatial Organization Within the CellProtein LocalizationUnderstanding Protein LocalizationMachine Learning PrimerAutoencoders (AEs)Variational Autoencoders (VAEs)Vector-Quantized Variational Autoencoders (VQ-VAEs)Dissecting a VQ-VAE DiagramTraining a VQ-VAEConstructing the DatasetData RequirementsSourcing the DataGetting a Glimpse of the DatasetImplementing a DatasetBuilder ClassBuilding a Prototype ModelDefining the LocalizationModelThe Encoder: Processing Input ImagesThe VectorQuantizer: Discretizing the EmbeddingsDecoder: Decoding the Discretized Embeddings Back to ImagesClassificationHead: A Simple but Crucial ModuleSetting Up Model TrainingTraining with a Small Image SetInspecting Image ReconstructionExamining Evaluation Metrics Over EpochsTraining a Model Without a Classification TaskUnderstanding the ModelUnderstanding Localization ClusteringInspecting Feature SpectrumsImproving the ModelScaling Up the DataGoing FurtherSummary
7. Tips and Tricks for Deep Learning in Biology
SimplifySimplify Your ModelSimplify and Control Your EnvironmentSimplify the Data and ProblemOverfit to a Single Batch of DataGo Back to BasicsLog EverythingAsk for HelpCommon Data IssuesData LeakageIncorrect Data LabelsImbalanced ClassesDistribution ShiftsBiology-Specific GotchasCommon Model IssuesOverfitting and Poor GeneralizationVanishing or Exploding GradientsTraining InstabilityPoor Model PerformanceHow Well Should You Do?Addressing Poor Model PerformanceFinal Thoughts
Index
About the Authors

Content preview from Deep Learning for Biology

Chapter 6. Learning Spatial Organization Patterns Within Cells

In this chapter, we shift focus from classifying high-level cell states—such as distinguishing cancerous from healthy tissue—to something more low level and foundational: understanding the spatial organization inside individual cells. Specifically, we’ll train a deep learning model to analyze microscopy images and learn where exactly in the cell different proteins are located, a task known as protein localization.

Protein localization plays a crucial role in cell biology. A protein’s position within the cell—for example, whether it’s in the nucleus or the mitochondria—often determines its function. Mislocalization of proteins is implicated in many diseases, even when the protein’s structure is normal (i.e., not mutated or altered). Thanks to modern fluorescence microscopy, we can observe a protein’s location in a cell directly, but the resulting images are often high dimensional, noisy, and hard to interpret at scale.

Unlike earlier chapters, the goal here isn’t to strictly optimize a metric like accuracy, recall, or precision on a specific classification or regression task. Instead, we’ll train a model to learn a latent representation of protein localization directly from raw microscopy images. You can think of a latent space as the model’s internal map—a compressed representation where proteins with similar localization patterns are grouped together, even without explicit labels. This approach falls under representation ...

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.

Read now

Unlock full access

More than 5,000 organizations count on O’Reilly

O’Reilly covers everything we've got, with content to help us build a world-class technology community, upgrade the capabilities and competencies of our teams, and improve overall team performance as well as their engagement.

Julian F.

Head of Cybersecurity

I wanted to learn C and C++, but it didn't click for me until I picked up an O'Reilly book. When I went on the O’Reilly platform, I was astonished to find all the books there, plus live events and sandboxes so you could play around with the technology.

Addison B.

Field Engineer

I’ve been on the O’Reilly platform for more than eight years. I use a couple of learning platforms, but I'm on O'Reilly more than anybody else. When you're there, you start learning. I'm never disappointed.

Amir M.

Data Platform Tech Lead

I'm always learning. So when I got on to O'Reilly, I was like a kid in a candy store. There are playlists. There are answers. There's on-demand training. It's worth its weight in gold, in terms of what it allows me to do.

Mark W.

Embedded Software Engineer

Publisher Resources

ISBN: 9781098168025Errata Page

Cloud Computing

Data Engineering

Data Science

AI & ML

Programming Languages

Software Architecture

IT/Ops

Security

Design

Business

Soft Skills

Deep Learning for Biology

by Charles Ravarani, Natasha Latysheva

Chapter 6. Learning Spatial Organization Patterns Within Cells

Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.