Chapter 4. Machine Learning for Molecules

This chapter covers the basics of performing machine learning on molecular data. Before we dive into the chapter, it might help for us to briefly discuss why molecular machine learning can be a fruitful subject of study. Much of modern materials science and chemistry is driven by the need to design new molecules that have desired properties. While significant scientific work has gone into new design strategies, much random search is sometimes still needed to construct interesting molecules. The dream of molecular machine learning is to replace such random experimentation with guided search, where machine-learned predictors can propose which new molecules might have desired properties. Such accurate predictors could enable the creation of radically new materials and chemicals with useful properties.

This dream is compelling, but how can we get started on this path? The first step is to construct technical methods for transforming molecules into vectors of numbers that can then be passed to learning algorithms. Such methods are called molecular featurizations. We will cover a number of them in this chapter, and more in the next chapter. Molecules are complex entities, and researchers have developed a host of different techniques for featurizing them. These representations include chemical descriptor vectors, 2D graph representations, 3D electrostatic grid representations, orbital basis function representations, and more.

Once featurized, a molecule still needs to be learned from. We will review some algorithms for learning functions on molecules, including simple fully connected networks as well as more sophisticated techniques like graph convolutions. We’ll also describe some of the limitations of graph convolutional techniques, and what we should and should not expect from them. We’ll end the chapter with a molecular machine learning case study on an interesting dataset.

What Is a Molecule?

Before we dive into molecular machine learning in depth, it will be useful to review what exactly a molecule is. This question sounds a little silly, since molecules like H2O and CO2 are introduced to even young children. Isn’t the answer obvious? The fact is, though, that for the vast majority of human existence, we had no idea that molecules existed at all. Consider a thought experiment: how would you convince a skeptical alien that entities called molecules exist? The answer turns out to be quite sophisticated. You might, for example, need to break out a mass spectrometer!

Mass Spectroscopy

Identifying the molecules that are present in a given sample can be quite challenging. The most popular technique at present relies on mass spectroscopy. The basic idea of mass spectroscopy is to bombard a sample with electrons. This bombardment shatters the molecules into fragments. These fragments typically ionize—that is, pick up or lose electrons to become charged. These charged fragments are propelled by an electric field which separates them based on their mass-to-charge ratio. The spread of detected charged fragments is called the spectrum. Figure 4-1 illustrates this process. From the collection of detected fragments, it is often possible to identify the precise molecules that were in the original sample. However, this process is still lossy and difficult. A number of researchers are actively researching techniques to improve mass spectroscopy with deep learning algorithms to ease the identification of the original molecules from the detected charged spectrum.

Note the complexity of performing this detection! Molecules are complicated entities that are tricky to pin down precisely.

For the sake of getting started, let’s presume a definition of a molecule as a group of atoms joined together by physical forces. Molecules are the smallest fundamental unit of a chemical compound that can take part in a chemical reaction. Atoms in a molecule are connected with one another by chemical bonds, which hold them together and restrict their motion relative to each other. Molecules come in a huge range of sizes, from just a few atoms up to many thousands of atoms. Figure 4-2 provides a simple depiction of a molecule in this model.

A simple schematic of a mass spectrometer.
Figure 4-1. A simple schematic of a mass spectrometer. (Source: Wikimedia.)
A simple representation of a caffeine molecule as a ball-and-stick diagram. Atoms are represented as colored balls (black is carbon, red is oxygen, blue is nitrogen, white is hydrogen) joined by sticks which represent chemical bonds.
Figure 4-2. A simple representation of a caffeine molecule as a “ball-and-stick” diagram. Atoms are represented as colored balls (black is carbon, red is oxygen, blue is nitrogen, white is hydrogen) joined by sticks which represent chemical bonds.

With this basic description in hand, we’ll spend the next couple of sections diving into more detail about various aspects of molecular chemistry. It’s not critical that you get all of these concepts on your first reading of this chapter, but it can be useful to have some basic knowledge of the chemical landscape at hand.

Molecules Are Dynamic, Quantum Entities

We’ve just provided a simplistic description of molecules in terms of atoms and bonds. It’s very important to keep in the back of your mind that there’s a lot more going on within any molecule. For one, molecules are dynamic entities, so all the atoms within a given molecule are in rapid motion with respect to one another. The bonds themselves are stretching back and forth and perhaps oscillating in length rapidly. It’s quite common for atoms to rapidly break off from and rejoin molecules. We’ll see a bit more about the dynamic nature of molecules shortly, when we discuss molecular conformations.

Even more strangely, molecules are quantum. There are a lot of layers to saying that an entity is quantum, but as a simple description, it’s important to note that “atoms” and “bonds” are much less well defined than a simple ball-and-stick diagram might imply. There’s a lot of fuzziness in the definitions here. It’s not important that you grasp these complexities at this stage, but remember that our depictions of molecules are very approximate. This can have practical relevance, since some learning tasks may require describing molecules with different depictions than others.

What Are Molecular Bonds?

It may have been a while since you studied basic chemistry, so we will spend time reviewing basic chemical concepts here and there. The most basic question is, what is a chemical bond?

The molecules that make up everyday life are made of atoms, often very large numbers of them. These atoms are joined together by chemical bonds. These bonds essentially “glue” together atoms by their shared electrons. There are many different types of molecular bonds, including covalent bonds and several types of noncovalent bonds.

Covalent bonds

Covalent bonds involve sharing electrons between two atoms, such that the same electrons spend time around both atoms (Figure 4-3). In general, covalent bonds are the strongest type of chemical bond. They are formed and broken in chemical reactions. Covalent bonds tend to be very stable: once they form, it takes a lot of energy to break them, so the atoms can remain bonded for a very long time. This is why molecules behave as distinct objects rather than loose collections of unrelated atoms. In fact, covalent bonds are what define molecules: a molecule is a set of atoms joined by covalent bonds.

Left: two atomic nuclei, each surrounded by a cloud of electrons. Right: as the atoms come close together, the electrons start spending more time in the space between the nuclei. This attracts the nuclei together, forming a covalent bond between the atoms.
Figure 4-3. Left: two atomic nuclei, each surrounded by a cloud of electrons. Right: as the atoms come close together, the electrons start spending more time in the space between the nuclei. This attracts the nuclei together, forming a covalent bond between the atoms.

Noncovalent bonds

Noncovalent bonds don’t involve the direct sharing of electrons between atoms, but they do involve weaker electromagnetic interactions. Since they are not as strong as covalent bonds, they are more ephemeral, constantly breaking and reforming. Noncovalent bonds do not “define” molecules in the same sense that covalent bonds do, but they have a huge effect on determining the shapes molecules take on and the ways different molecules associate with each other.

“Noncovalent bonds” is a generic term covering several different types of interactions. Some examples of noncovalent bonds include hydrogen bonds (Figure 4-4), salt bridges, pi-stacking, and more. These types of interactions often play crucial roles in drug design, since most drugs interact with biological molecules in the human body through noncovalent interactions.

Water molecules have strong hydrogen bonding interactions between hydrogen and oxygen on adjacent molecules. A strong network of hydrogen bonds contributes in part to water's power as a solvent.
Figure 4-4. Water molecules have strong hydrogen bonding interactions between hydrogen and oxygen on adjacent molecules. A strong network of hydrogen bonds contributes in part to water’s power as a solvent. (Source: Wikimedia.)

We’ll run into each of these types of bonds at various points in the book. In this chapter, we will mostly deal with covalent bonds, but noncovalent interactions will become much more crucial when we start studying some biophysical deep models. 

Molecular Graphs

A graph is a mathematical data structure made up of nodes connected together by edges (Figure 4-5). Graphs are incredibly useful abstractions in computer science. In fact, there is a whole branch of mathematics called graph theory dedicated to understanding the properties of graphs and finding ways to manipulate and analyze them. Graphs are used to describe everything from the computers that make up a network, to the pixels that make up an image, to actors who have appeared in movies with Kevin Bacon.

An example of a mathematical graph with six nodes connected by edges.
Figure 4-5. An example of a mathematical graph with six nodes connected by edges. (Source: Wikimedia.)

Importantly, molecules can be viewed as graphs as well (Figure 4-6). In this description, the atoms are the nodes in the graph, and the chemical bonds are the edges. Any molecule can be converted into a corresponding molecular graph.

An example of converting a benzene molecule into a molecular graph. Note that atoms are converted into nodes and chemical bonds into edges.
Figure 4-6. An example of converting a benzene molecule into a molecular graph. Note that atoms are converted into nodes and chemical bonds into edges.

In the remainder of this chapter, we will repeatedly convert molecules into graphs in order to analyze them and learn to make predictions about them.

Molecular Conformations

A molecular graph describes the set of atoms in a molecule and how they are bonded together. But there is another very important thing we need to know: how the atoms are positioned relative to each other in 3D space. This is called the molecule’s conformation.

Atoms, bonds, and conformation are related to each other. If two atoms are covalently bonded, that tends to fix the distance between them, strongly restricting the possible conformations. The angles formed by sets of three or four bonded atoms are also often restricted. Sometimes there will be whole clusters of atoms that are completely rigid, all moving together as a single unit. But other pieces of molecules are flexible, allowing atoms to move relative to each other. For example, many (but not all) covalent bonds allow the groups of atoms they connect to freely rotate around the axis of the bond. This lets the molecule take on many different conformations.

Figure 4-7 shows a very popular molecule: sucrose, also known as table sugar. It is shown both as a 2D chemical structure and as a 3D conformation. Sucrose consists of two rings linked together. Each of the rings is fairly rigid, so its shape changes very little over time. But the linker connecting them is much more flexible, allowing the rings to move relative to each other.

Sucrose, represented as a 3D conformation and a 2D chemical structure.
Figure 4-7. Sucrose, represented as a 3D conformation and a 2D chemical structure. (Adapted from Wikimedia images (Wikimedia and Wikipedia.)

As molecules get larger, the number of feasible conformations they can take grows enormously. For large macromolecules such as proteins (Figure 4-8), computationally exploring the set of possible conformations currently requires very expensive simulations.

A conformation of bacteriorhodopsin (used to capture light energy) rendered in 3D. Protein conformations are particularly complex, with multiple 3D geometric motifs, and serve as a good reminder that molecules have geometry in addition to their chemical formulas.
Figure 4-8. A conformation of bacteriorhodopsin (used to capture light energy) rendered in 3D. Protein conformations are particularly complex, with multiple 3D geometric motifs, and serve as a good reminder that molecules have geometry in addition to their chemical formulas. (Source: Wikimedia.)

Chirality of Molecules

Some molecules (including many drugs) come in two forms that are mirror images of each other. This is called chirality. A chiral molecule has both a “right-handed” form (also known as the “R” form) and a “left-handed” form (also known as the “S” form), as illustrated in Figure 4-9.

Figure 4-9. Axial chirality of a spiro compound (a compound made up of two or more rings joined together). Note that the two chiral variants are respectively denoted as “R” and “S.” This convention is widespread in the chemistry literature.

Chirality is very important, and also a source of much frustration both for laboratory chemists and computational chemists. To begin with, the chemical reactions that produce chiral molecules often don’t distinguish between the forms, producing both chiralities in equal amounts. (These products are called racemic mixtures.) So if you want to end up with just one form, your manufacturing process immediately becomes more complicated. In addition, many physical properties are identical for both chiralities, so many experiments can’t distinguish between chiral versions of a given molecule. The same is true of computational models. For example, both chiralities have identical molecular graphs, so any machine learning model that depends only on the molecular graph will be unable to distinguish between them.

This wouldn’t matter so much if the two forms behaved identically in practice, but that often is not the case. It is possible for the two chiral forms of a drug to bind to totally different proteins, and to have very different effects in your body. In many cases, only one form of a drug has the desired therapeutic effect. The other form just produces extra side effects without having any benefit.

One specific example of the differing effects of chiral compounds is the drug thalidomide, which was prescribed as a sedative in the 1950s and 1960s. This drug was subsequently available over the counter as a treatment for nausea and morning sickness associated with pregnancy. The R form of thalidomide is an effective sedative, while the S form is teratogenic and has been shown to cause severe birth defects. These difficulties are further compounded by the fact that thalidomide interconverts, or racemizes, between the two different forms in the body.

Featurizing a Molecule

With these descriptions of basic chemistry in hand, how do we get started with featurizing molecules? In order to perform machine learning on molecules, we need to transform them into feature vectors that can be used as inputs to models. In this section, we will discuss the DeepChem featurization submodule dc.feat, and explain how to use it to featurize molecules in a variety of fashions.

SMILES Strings and RDKit

SMILES is a popular method for specifying molecules with text strings. The acronym stands for “Simplified Molecular-Input Line-Entry System”, which is sufficiently awkward-sounding that someone must have worked hard to come up with it. A SMILES string describes the atoms and bonds of a molecule in a way that is both concise and reasonably intuitive to chemists. To nonchemists, these strings tend to look like meaningless patterns of random characters. For example, “OCCc1c(C)[n+](cs1)Cc2cnc(C)nc2N” describes the important nutrient thiamine, also known as vitamin B1.

DeepChem uses SMILES strings as its format for representing molecules inside datasets. There are some deep learning models that directly accept SMILES strings as their inputs, attempting to learn to identify meaningful features in the text representation. But much more often, we first convert the string into a different representation (or featurize it) better suited to the problem at hand.

DeepChem depends on another open source chemoinformatics package, RDKit, to facilitate its handling of molecules. RDKit provides lots of features for working with SMILES strings. It plays a central role in converting the strings in datasets to molecular graphs and the other representations described below.

Extended-Connectivity Fingerprints

Chemical fingerprints are vectors of 1s and 0s that represent the presence or absence of specific features in a molecule. Extended-connectivity fingerprints (ECFPs) are a class of featurizations that combine several useful features. They take molecules of arbitrary size and convert them into fixed-length vectors. This is important because lots of models require their inputs to all have exactly the same size. ECFPs let you take molecules of many different sizes and use them all with the same model. ECFPs are also very easy to compare. You can simply take the fingerprints for two molecules and compare corresponding elements. The more elements that match, the more similar the molecules are. Finally, ECFPs are fast to compute.

Each element of the fingerprint vector indicates the presence or absence of a particular molecular feature, defined by some local arrangement of atoms. The algorithm begins by considering every atom independently and looking at a few properties of the atom: its element, the number of covalent bonds it forms, etc. Each unique combination of these properties is a feature, and the corresponding elements of the vector are set to 1 to indicate their presence. The algorithm then works outward, combining each atom with all the ones it is bonded to. This defines a new set of larger features, and the corresponding elements of the vector are set. The most common variant of this technique is the ECFP4 algorithm, which allows for sub-fragments to have a radius of two bonds around a central atom.

The RDKit library provides utilities for computing ECFP4 fingerprints for molecules. DeepChem provides convenient wrappers around these functions. The dc.feat.CircularFingerprint class inherits from Featurizer and provides a standard interface to featurize molecules:

smiles = ['C1CCCCC1', 'O1CCOCC1'] # cyclohexane and dioxane
mols = [Chem.MolFromSmiles(smile) for smile in smiles]
feat = dc.feat.CircularFingerprint(size=1024)
arr = feat.featurize(mols)
# arr is a 2-by-1024 array containing the fingerprints for
# the two molecules

ECFPs do have one important disadvantage: the fingerprint encodes a large amount of information about the molecule, but some information does get lost. It is possible for two different molecules to have identical fingerprints, and given a fingerprint, it is impossible to uniquely determine what molecule it came from.

Molecular Descriptors

An alternative line of thought holds that it’s useful to describe molecules with a set of physiochemical descriptors. These usually correspond to various computed quantities that describe the molecule’s structure. These quantities, such as the log partition coefficient or the polar surface area, are often derived from classical physics or chemistry. The RDKit package computes many such physical descriptors on molecules. The DeepChem featurizer dc.feat.RDKitDescriptors() provides a simple way to perform the same computations:

feat = dc.feat.RDKitDescriptors()
arr = feat.featurize(mols)
# arr is a 2-by-200 array containing properties of the
# two molecules

This featurization is obviously more useful for some problems than others. It will tend to work best for predicting things that depend on relatively generic properties of the molecules. It is unlikely to work for predicting properties that depend on the detailed arrangement of atoms.

Graph Convolutions

The featurizations described in the preceding section were designed by humans. An expert thought carefully about how to represent molecules in a way that could be used as input to machine learning models, then coded the representation by hand. Can we instead let the model figure out for itself the best way to represent molecules? That is what machine learning is all about, after all: instead of designing a featurization ourselves, we can try to learn one automatically from the data.

As an analogy, consider a convolutional neural network for image recognition. The input to the network is the raw image. It consists of a vector of numbers for each pixel, for example the three color components. This is a very simple, totally generic representation of the image. The first convolutional layer learns to recognize simple patterns such as vertical or horizontal lines. Its output is again a vector of numbers for each pixel, but now it is represented in a more abstract way. Each number represents the presence of some local geometric feature.

The network continues through a series of layers. Each one outputs a new representation of the image that is more abstract than the previous layer’s representation, and less closely connected to the raw color components. And these representations are automatically learned from the data, not designed by a human. No one tells the model what patterns to look for to identify whether the image contains a cat. The model figures that out by itself through training.

Graph convolutional networks take this same idea and apply it to graphs. Just as a regular CNN begins with a vector of numbers for each pixel, a graph convolutional network begins with a vector of numbers for each node and/or edge. When the graph represents a molecule, those numbers could be high-level chemical properties of each atom, such as its element, charge, and hybridization state. Just as a regular convolutional layer computes a new vector for each pixel based on a local region of its input, a graph convolutional layer computes a new vector for each node and/or edge. The output is computed by applying a learned convolutional kernel to each local region of the graph, where “local” is now defined in terms of edges between nodes. For example, it might compute an output vector for each atom based on the input vector for that same atom and any other atoms it is directly bonded to.

That is the general idea. When it comes to the details, many different variations have been proposed. Fortunately, DeepChem includes implementations of lots of those architectures, so you can try them out even without understanding all the details. Examples include graph convolutions (GraphConvModel), Weave models (WeaveModel), message passing neural networks (MPNNModel), deep tensor neural networks (DTNNModel), and more.

Graph convolutional networks are a powerful tool for analyzing molecules, but they have one important limitation: the calculation is based solely on the molecular graph. They receive no information about the molecule’s conformation, so they cannot hope to predict anything that is conformation-dependent. This makes them most suitable for small, mostly rigid molecules. In the next chapter we will discuss methods that are more appropriate for large, flexible molecules that can take on many conformations.

Training a Model to Predict Solubility

Let’s put all the pieces together and train a model on a real chemical dataset to predict an important molecular property. First we’ll load the data:

tasks, datasets, transformers = dc.molnet.load_delaney(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = datasets

This dataset contains information about solubility, which is a measure of how easily a molecule dissolves in water. This property is vitally important for any chemical you hope to use as a drug. If it does not dissolve easily, getting enough of it into a patient’s bloodstream to have a therapeutic effect may be impossible. Medicinal chemists spend a lot of time modifying molecules to try to increase their solubility.

Notice that we specify the option featurizer='GraphConv'. We are going to use a graph convolutional model, and this tells MoleculeNet to transform the SMILES string for each molecule into the format required by the model.

Now let’s construct and train the model:

model = GraphConvModel(n_tasks=1, mode='regression', dropout=0.2)
model.fit(train_dataset, nb_epoch=100)

We specify that there is only one task—that is to say, one output value (the solubility)—for each sample. We also specify that this is a regression model, meaning that the labels are continuous numbers and the model should try to reproduce them as accurately as possible. That is in contrast to a classification model, which tries to predict which of a fixed set of classes each sample belongs to. To reduce overfitting, we specify a dropout rate of 0.2, meaning that 20% of the outputs from each convolutional layer will randomly be set to 0.

That’s all there is to it! Now we can evaluate the model and see how well it works. We will use the Pearson correlation coefficient as our evaluation metric:

metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print(model.evaluate(train_dataset, [metric], transformers))
print(model.evaluate(test_dataset, [metric], transformers))

This reports a correlation coefficient of 0.91 for the training set, and 0.70 for the test set. Apparently it is overfitting a little bit, but not too badly. And a correlation coefficient of 0.70 is quite respectable. Our model is successfully predicting the solubilities of molecules based on their molecular structures!

Now that we have the model, we can use it to predict the solubilities of new molecules. Suppose we are interested in the following five molecules, specified as SMILES strings:

smiles = ['COC(C)(C)CCCC(C)CC=CC(C)=CC(=O)OC(C)C',
          'CCOC(=O)CC',
          'CSc1nc(NC(C)C)nc(NC(C)C)n1',
          'CC(C#C)N(C)C(=O)Nc1ccc(Cl)cc1',
          'Cc1cc2ccccc2cc1C']

To use these as inputs to the model, we must first use RDKit to parse the SMILES strings, then use a DeepChem featurizer to convert them to the format expected by the graph convolution:

from rdkit import Chem
mols = [Chem.MolFromSmiles(s) for s in smiles]
featurizer = dc.feat.ConvMolFeaturizer()
x = featurizer.featurize(mols)

Now we can pass them to the model and ask it to predict their solubilities:

predicted_solubility = model.predict_on_batch(x)

MoleculeNet

We have now seen two datasets loaded from the molnet module: the Tox21 toxicity dataset in the previous chapter, and the Delaney solubility dataset in this chapter. MoleculeNet is a large collection of datasets useful for molecular machine learning. As shown in Figure 4-10, it contains data on many sorts of molecular properties. They range from low-level physical properties that can be calculated with quantum mechanics up to very high-level information about interactions with a human body, such as toxicity and side effects.

Figure 4-10. MoleculeNet hosts many different datasets from different molecular sciences. Scientists find it useful to predict quantum, physical chemistry, biophysical, and physiological characteristics of molecules.

When developing new machine learning methods, you can use MoleculeNet as a collection of standard benchmarks to test your method on. At http://moleculenet.ai you can view data on how well a collection of standard models perform on each of the datasets, giving insight into how your own method compares to established techniques.

SMARTS Strings

In many commonly used applications, such as word processing, we need to search for a particular text string. In cheminformatics, we encounter similar situations where we want to determine whether atoms in a molecule match a particular pattern. There are a number of use cases where this may arise:

  • Searching a database of molecules to identify molecules containing a particular substructure

  • Aligning a set of molecules on a common substructure to improve visualization

  • Highlighting a substructure in a plot

  • Constraining a substructure during a calculation

SMARTS is an extension of the SMILES language described previously that can be used to create queries. One can think of SMARTS patterns as similar to regular expressions used for searching text. For instance, when searching a filesystem, one can specify a query like “foo*.bar”, which will match foo.bar, foo3.bar, and foolish.bar. At the simplest level, any SMILES string can also be a SMARTS string. The SMILES string “CCC” is also a valid SMARTS string and will match sequences of three adjacent aliphatic carbon atoms. Let’s take a look at a code example showing how we can define molecules from SMILES strings, display those molecules, and highlight the atoms matching a SMARTS pattern.

First, we will import the necessary libraries and create a list of molecules from a list of SMILES strings. Figure 4-11 shows the result:

from rdkit import Chem
from rdkit.Chem.Draw import MolsToGridImage

smiles_list = ["CCCCC","CCOCC","CCNCC","CCSCC"]
mol_list = [Chem.MolFromSmiles(x) for x in smiles_list]
Figure 4-11. Chemical structures generated from SMILES

Now we can see which SMILES strings match the SMARTS pattern “CCC” (Figure 4-12):

query = Chem.MolFromSmarts("CCC")
match_list = [mol.GetSubstructMatch(query) for mol in
mol_list]
MolsToGridImage(mols=mol_list, molsPerRow=4,
highlightAtomLists=match_list)
Figure 4-12. Molecules matching the SMARTS expression “CCC.”

There are a couple of things to note in this figure. The first is that the SMARTS expression only matches the first structure. The other structures do not contain three adjacent carbons. Note also that there are multiple ways that the SMARTS pattern could match the first molecule in this figure—it could match three adjacent carbon atoms by starting at the first, second, or third carbon atom. There are additional functions in RDKit that will return all possible SMARTS matches, but we won’t cover those now.

Additional wildcard characters can be used to match specific sets of atoms. As with text, the “*” character can be used to match any atom. The SMARTS pattern “C*C” will match an aliphatic carbon attached to any atom attached to another aliphatic carbon (see Figure 4-13).

query = Chem.MolFromSmarts("C*C")
match_list = [mol.GetSubstructMatch(query) for mol in
mol_list]
MolsToGridImage(mols=mol_list, molsPerRow=4,
highlightAtomLists=match_list)
Figure 4-13. Molecules matching the SMARTS expression “C*C”.

The SMARTS syntax can be extended to only allow specific sets of atoms. For instance, the string “C[C,O,N]C” will match carbon attached to carbon, oxygen, or nitrogen, attached to another carbon (Figure 4-14):

query = Chem.MolFromSmarts("C[C,N,O]C")
match_list = [mol.GetSubstructMatch(query) for mol in
mol_list]
MolsToGridImage(mols=mol_list, molsPerRow=4,
highlightAtomLists=match_list)
Figure 4-14. Molecules matching the SMARTS expression “C[C,N,O]C”.

There is a lot more to SMARTS that is beyond the scope of this brief introduction. Interested readers are urged to read the “Daylight Theory Manual” to get deeper insight into SMILES and SMARTS.1 As we will see in Chapter 11, SMARTS can be used to build up sophisticated queries that can identify molecules that may be problematic in biological assays.

Conclusion

In this chapter, you’ve learned the basics of molecular machine learning. After a brief review of basic chemistry, we explored how molecules have traditionally been represented for computing systems. You also learned about graph convolutions, which are a newer approach to modeling molecules in deep learning, and saw a complete working example of how to use machine learning on molecules to predict an important physical property. These techniques will serve as the foundations upon which later chapters will build.

1 Daylight Chemical Information Systems, Inc. “Daylight Theory Manual.” http://www.daylight.com/dayhtml/doc/theory/. 2011.

Get Deep Learning for the Life Sciences now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.