Chapter 3. Learning the Logic of DNA
In this chapter, we’ll build a deep learning model to predict whether a DNA sequence is bound by a class of proteins called transcription factors (TFs). Transcription factors play a central role in gene regulation: they bind to specific DNA sequences and influence whether nearby genes are turned on or off. By recognizing these sequence patterns, we can begin to decode the regulatory logic embedded in the genome.
Unlike the previous chapter—where we used an off-the-shelf protein model from Hugging Face—here we’ll start defining and training our own models from scratch. This gives us more control and helps us better understand how deep learning works on biological data. We’ll explore both convolutional and transformer-based architectures and introduce interpretation techniques to help us understand how our models make predictions.
We will tackle this problem in stages, gradually increasing the complexity:
- 1. Start simple
-
First, we’ll train a basic convolutional network to predict whether a DNA sequence binds a single transcription factor called CTCF. Its binding behavior is relatively easy to predict, making it a great first target. We’ll build the full pipeline: loading data, training the model, and checking whether it captures meaningful biological signals.
- 2. Increase complexity
-
Next, we’ll scale up to predicting whether a sequence binds any of 10 different TFs. We’ll introduce regularization and normalization, improve our evaluation ...
Become an O’Reilly member and get unlimited access to this title plus top books and audiobooks from O’Reilly and nearly 200 top publishers, thousands of courses curated by job role, 150+ live events each month,
and much more.
Read now
Unlock full access