T033 · Molecular representations

Note: This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:

Talktorial T033: This talktorial is part of the TeachOpenCADD pipeline described in the TeachOpenCADD publication, consisting of Talktorials T033 to T038.

Aim of this talktorial

In this talktorial, we conduct the groundwork for the deep learning talktorials. Specifically, we learn about molecular representations and find that representing a molecule in a computer is not a trivial task. Different representations come with their specific implications and (dis-)advantages.

Contents in Theory

  • What is a molecule?

  • Molecular representations

  • Molecular Representations for Humans

  • Computer-age molecular representations

Contents in Practical

  • Conformers

  • Molecular graphs

  • Fingerprints

References

Theory

What is a molecule?

A molecule is a group of atoms that are connected through chemical bonds. The number of atoms can range from only a few (e.g., in low molecular weight drugs) to several thousand (e.g., in proteins) or even millions (e.g., in DNA molecules).

The physical force that is most relevant for the formation of molecules from atoms is the electrostatic force, which is the force that governs the behavior of charged particles. When atoms interact, their electrons can be repelled by the electrons of other atoms or attracted by their nucleus, leading to the formation of a chemical bond (here, we are mostly concerned with covalent bonds where electrons are shared between two atoms).

The formation of molecules gives a group of atoms a characteristic structure and behavior. The properties of a molecule emerge in a non-trivial way from (the interplay of) its constituent atoms. Molecules are the building blocks of living cells and play a key role in the function of living organisms.

In addition to the subdivision into small molecules and macromolecules, the distinction between organic and inorganic molecules is important. Organic molecules are the ones containing carbon-hydrogen or carbon-carbon bonds. This talktorial focuses on small and organic molecules.

Caffeine Molecule

Figure 1: The image is a ball-and-stick representation of a caffeine molecule. Atoms are shown in gray (carbon), white (hydrogen), red (oxygen), and blue (nitrogen). An edge between two atoms indicates a covalent chemical bond with one (single line) or two (double line) shared atoms (triple bonds are also possible, but not present in caffeine). The dashed 5-membered ring represents an aromatic ring system. Two electrons can move freely around in this ring and are not associated with a specific chemical bond within this ring.

Molecular representations

From the viewpoint of physics, molecules are complex three-dimensional objects that are dynamic rather than static and that exhibit quantum mechanical properties. Depending on the task you want to solve, a simplified view of molecules might be helpful.

To apply machine learning to molecules, we need to find a proper representation. As it turns out, there are several possibilities, all with their advantages and disadvantages. Choosing a representation of a molecule gives you control over the simplifications you want to make. They provide the recipe to translate a quantum-mechanical object into a mathematical description. But before we discuss computational representations of molecules, we first want to give a recap on pre-computer-age techniques to represent molecules.

Molecular representations for humans

Small molecules

Without being able to represent molecules, it is also impossible to discuss them and communicate about them. We start with representations and visualizations that are intuitive for humans. Later, we will also look at “computer-age” encodings, suitable as input for deep-learning systems. However, note that this classification is somewhat arbitrary.

Text-based:

The easiest way to identify a molecule is its trivial name. For example, caffeine or aspirin. This is only possible for well-known molecules.

Alternatively, the IUPAC nomenclature of organic chemistry can be used (there is also one for inorganic chemistry, which we will skip here). The goal is that each possible organic molecule is associated with a name that is easy to understand and that unambiguously identifies the molecule. The clarity and consistency make this naming scheme reasonably easy to understand and use. However, not all molecules can be named using this scheme; for instance, some complex organic molecules, such as those that contain rings with multiple substructures.

Example: The caffeine molecule has the IUPAC name: 1,3,7-trimethylpurine-2,6-dione.

Molecular formula:

The molecular formula (sometimes referred to as Hill notation) is the easiest (and most simplified) way of representing a molecule: You simply count the number of atoms of each type. The ordering (from left to right) is arguably a little bit arbitrary: Carbon atoms are listed first, hydrogen atoms next, and all other atoms follow in alphabetical order. However, in the unlikely case that the formula does not contain carbon, all atoms (including hydrogen) are listed alphabetically.

Example: For caffeine, the molecular formula is C8H10N4O2, indicating that there are 8 carbon (C) atoms, 10 hydrogen (H) atoms, 4 nitrogen (N) atoms, and 2 oxygen (O) atoms present.

Naturally, this does not uniquely identify a molecule.

Example: Molecules with the same molecular formula but with different arrangements in space are called isomers. One examples are n-butane and isobutane. Both have the molecular formula C4H10. The atoms in butane are arranged in a straight chain, while the atoms in isobutane are arranged in a branch structure. To be more precise, here, we are dealing with a structural or constitutional isomer because the chemical bonds are different. Molecules that have the same chemical bonds (or refer to the same molecular graph) but still differ in their spatial arrangement are called spatial (or stereo-)isomers.

N-butaneAndIsobutane

Figure 2: N-butane and isobutane.

Visualizations of small molecules:

Visualizations are the easiest way to get a first impression of a molecule. In the first example, we have already seen a visualization of the caffeine molecule. These types of visualizations are referred to as Ball and Stick models. The color scheme follows the CPK coloring. They are easy to understand but difficult to draw.

CPK coloring

Figure 3: CPK coloring from Wikipedia.

2D visualizations are easy to draw and come in many different flavors. For instance, the Lewis structure contains no 3D information (excess electrons that form lone pairs are sometimes shown as dots, we skip this part here).

Similarly, the Skeletal formula (or line-angle formula) removes hydrogen atoms that are next to carbon atoms and draws carbon atoms only implicitly as bends and ends.

For comparison, here are the three visualizations of the ethanol molecule (from WP).

Ethanol visualization

Figure 4: Different visualizations of ethanol.

A special feature of this is the Natta Projection which provides basic (but not in every case complete) information about the relative positions of the atoms in 3D. For instance, consider the kinase inhibitor from the RDKit Cookbook:

Kinase Inhibitor

Figure 5: Natta projection of a kinase inhibitor.

Here,

  • Solid wedges indicate a bond that points out of the plane;

  • Dashed wedges indicate a bond that points into the plane (away from the observer)

You can find the corresponding ball-and-stick plot here.

Large molecules - proteins

Representing proteins:

Proteins are large molecules that are composed of building blocks called amino acids. There exist 20 different natural amino acids. A protein is made up of one or more chains of amino acids. Each chain can be described as a string of characters. Therefore, we identify each of the amino acids with a single letter.

The Pro-thyrotropin-releasing hormone protein is one of the smallest proteins present in the human body. It consists of only 242 amino acids:

MPGPWLLLALALTLNLTGVPGGRAQPEAAQQEAVTAAEHPGLDDFLRQVERLLFLRENIQ
RLQGDQGEHSASQIFQSDWLSKRQHPGKREEEEEEGVEEEEEEEGGAVGPHKRQHPGRRE
DEASWSVDVTQHKRQHPGRRSPWLAYAVPKRQHPGRRLADPKAQRSWEEEEEEEEREEDL
MPEKRQHPGKRALGGPCGPQGAYGQAGLLLGLLDDLSRSQGAEEKRQHPGRRAAWVREPL
EE

However, the amino acid sequence does not contain 3D information, which is important for the function. You can predict it using tools like Alphafold or simply look it up in the Alphafold database.

Visualizations of proteins:

We focus here on small molecules but still want to point out that large molecules are typically visualized differently, most commonly by a Ribbon diagram. The idea is to group common motifs and represent them, for instance, with curls (alpha-Helices) and arrows (beta-strands) connected by loops.

TubbyProtein

Figure 6: Ribbon diagram of the tubby protein.

Talktorial T008 explains how to find proteins in the Protein Data Bank and process them.

Computer-age molecular representations

When we want to communicate or store a molecular description, we better make sure there is little ambiguity left. In machine learning, it is still an open research question when to use which representation. Text-based (or linear) and graph-based representations are becoming both very widely used. Moreover, fingerprint-based methods are sometimes a viable alternative, especially for traditional machine learning methods.

Text-based representations:

Text-based representations use a sequence of characters to specify a molecule. This is possible for practically all (small) molecules relevant in practice. Here, we discuss SMILES, InChI, and SELFIES. For a deeper dive, we refer the reader to the SMILES Talktorial T034.

SMILES (Simplified Molecular Input Line Entry Specification) is the most widely used text-based representation and can be handled by all common frameworks. When we specify a molecule in RDKit, we often use SMILES notation (more on this in the practical part):

mol = Chem.MolFromSmiles("CN1C=NC2=C1C(=O)N(C(=O)N2C)C")

SMILES uses a combination of letters, numbers, and symbols to represent the atoms and bonds in a molecule. The letters identify the elements, the symbols specify the connectivity/branching structure.

SMILES can also represent the chirality of a molecule. This is possible by specifying the relative angle of a bond for an atom. We will see an example in the practical part.

The main problem with SMILES for molecule representation is that two (or more) different SMILES strings might refer to the same molecule. Researchers try to circumvent this by resorting to a canonical SMILES notation. However, the canonicalization depends on the canonicalization algorithms and is therefore not standardized.

In the other direction, a single SMILES string typically identifies no more than one molecule. However, when stereochemistry information is not given in the SMILES string, it leaves room for ambiguity (in some cases, it might not even be possible to remove all ambiguity for different molecular configurations).

InChI (International Chemical Identifier) is a more modern and also widely-used alternative to SMILES. The key advantage is that it exhibits less chemical ambiguity and that a standard canonical exists. The downside is that it is difficult for humans to read.

SELFIES (SELF-referencing embedded string) was introduced in 2020, primarily for machine learning purposes. The advantage is that “every combination of symbols in the Selfies alphabet maps to a chemically valid graph” (Source). This is great for generative tasks where you want to generate molecules because it is impossible to generate invalid molecules.

To summarize, we find that different methods to encode molecules as strings exist. Currently, SMILES is the most widely adopted method and tools support it well, so it will probably be a good starting point.


Molecular graphs:

Representing molecules as graphs allows for a very intuitive and comprehensive representation of a molecule’s structure. In a graph-based representation, atoms are represented as (labeled) nodes, and bonds are represented as (labeled) edges.

One possibility to specify a graph is with an adjacency matrix \(A\). For undirected graphs (which is the case for molecules), the adjacency matrix is symmetric. An entry \(a_{ij}\) indicates the presence (\(a_{ij} = 1\)) or absence (\(a_{ij} = 0\)) of an edge.

One key property of graphs is that they rely on node ordering. The following two matrices identify identical graphs, except for the node ordering. We call these graphs isomorphic.

The fact that we need to pick a node ordering to represent a molecule as a graph means that many graphs represent the same molecule. This problem can be circumvented by restricting ourselves to so-called permutation invariant functions, as we will learn in the next section.

TwoIsomorphicGraphs

Figure 7: Two isomorphic graphs with different node ordering.

Permutation invariance: Assume you build a machine learning model that takes as input molecular graphs and outputs some prediction. It would be desirable that your model guarantees that isomorphic graphs (like Graph 1 and Graph 2) generate the same output. We call neural networks (or functions in general) that have these guarantees (node-)permutation invariant (or equivariant for node-level outputs). However, there is a trade-off. Functions that are permutation invariant are typically not universal. That is, they are not able to tell all graphs that are non-isomorphic apart. If both were given, permutation invariance and the ability to produce a different output for all non-isomorphic graphs, our neural network would solve the graph isomorphism problem (which is computationally extremely difficult). This is also discussed in the Talktorial T035 about graph neural networks.

Representational power: Another problem is that graphs do not contain 3D information. Specifically, different isomers can correspond to the same molecular graph but differ in the relative 3D positions of the atoms. These are called spatial isomers. One can circumvent this (to some degree) by adding 3D information to the node features. This is somewhat ad-hoc and not well-principled. It is also unclear in which cases this is necessary to improve the performance of a predictive model. In the practical part, we will visualize 2-butanol that admits two mirror images (called enantiomer) with the same graph but with different geometry.

Butanol_Enantiomer

Figure 8: The two enantiomers of 2-butanol (source: WP).


Point clouds:

We have already established that molecules are three-dimensional objects. So why not directly use a computational representation that captures the 3D information? The most straightforward way to do this is to annotate the nodes/atoms in a molecular graph with Euclidean 3D coordinates \((x,y,z)\).

The spatial arrangement of atoms is called molecular configuration.

Instead of equipping a graph with 3D coordinates, one can also disregard the information on chemical bonds as this paper claims:

The covalent bonds between atoms do not need to be encoded explicitly because they are attributed by the overlap between the atomic orbitals, and can be inferred from the types and 3D coordinates of respective atoms. In principle, the point-based representation captures the complete structural information about the molecule, and thus serves as the adequate input for molecular representation learning.

SE(3)-invariance: When equipping the atoms of a molecule with their (absolute) 3D position, we have a similar problem as with the node ordering in the previous section. Instead of “Which node ordering do we choose?” We now have to ask “How do we position the molecule in 3D to measure the atom positions?”. As in the node-ordering case, it is difficult (if not impossible) to give a principled answer to this question. The solution is again to restrict ourselves to neural networks that are invariant to translations (move the whole molecule in \(x\), \(y\), or \(z\) direction) or rotations of the molecule. Neural networks (or functions in general) that can guarantee that the output is not affected by translations and rotations are called SE(3)-invariant. One example of architecture can be found here. One possibility to design SE(3)-invariant neural networks is to consider only the distance between all pairs of atoms instead of their absolute position. This leads not only to SE(3)-invariance but to a broader class called E(3)-invariance. This is because also reflections (mirror images) are guaranteed to produce the same output. For molecules, this might be suboptimal because mirror molecules can exhibit different properties (see chirality).

Determining configurations: When you want to add 3D coordinates to atoms, you first need to find them somehow. This is not straightforward, and one can even train a machine learning model to predict this. One possibility is to determine the configuration experimentally. For small molecules, techniques like diffraction methods can be used. For proteins, other techniques, like X-ray crystallography, are available.
Alternatively, one can resort to computational methods. For instance, the popular dataset QM9 provides 3D coordinates of each atom that were computed using DFT (an approximation to the Schrödinger equation).

Conformers: However, another problem of the point cloud perspective on molecules is that there is not a single, fixed configuration for a molecule. Instead, molecules are dynamic objects. Molecular conformers are different configurations that a molecule can take due to rotations around single bonds (double and triple bonds are non-rotatable). These conformations are a result of the rotational energy barriers that exist between different groups of atoms within a molecule. Conformations can change quickly and easily, e.g., in response to changes in temperature. Note that conformers are a special type of isomers. Not all isomers can be reached by rotating single bonds.

Conformer

Figure 9: Some conformers of a simple molecule (Source paper via Datamol tutorial on top of each other.

Each specific conformer is associated with a conformation energy. The likelihood of a specific configuration depends on this energy (low-energy configurations are more likely than high-energy configurations). Specifically, we typically observe local minima. When representing a conformer, one might also store the torsion angles at rotatable bonds instead of the 3D position of each atom (as done here).

Energy landscape

Figure 10: Energy landscape of the molecule 3-Methoxyphenol (Source). The four local minima correspond to four stable conformations.

Note that SE(3)-invariant machine learning models are not invariant to bond rotations. Specifically, they can confuse different conformations of the same molecule. For a detailed discussion on this issue, we refer to the paper Learning 3D Representations of Molecular Chirality with Invariance to Bond Rotations.


Fingerprints:

Molecular fingerprints are representations based on the molecular graph. Typically, they are binary vectors that are relatively sparse (many 0s, very few 1s).

The most commonly used fingerprint is ECFP4. MAP4 is a newer alternative.

Fingerprints do not uniquely identify a molecule. Different molecules can have the same fingerprint. However, the same molecule has a unique fingerprint (there is no canonicalization problem). It is difficult (if not practically impossible) to recover the molecular graph, given the fingerprint.

They can be easily used for classical machine learning tasks because the architecture does not need to be invariant/equivariant to the node-ordering or geometric operations.

Talktorial T004 explains several molecular fingerprints.


Learned representations:

Another interesting line of research is the construction of learned molecular representation. Using a machine learning model to learn representations (from other representations)  can be useful in several ways. For instance, the paper Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules uses a variational auto-encoder to learn a latent and continuous representation based on SMILES.

The paper Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations builds a latent representation based on the translation from one representation to another (e.g., SMILES into the IUPAC name).

One can also use contrastive learning to optimize the latent space of such learned representations.


Practical

In this section, we study how RDKit handles different molecular representations and visualizations. We start with the caffeine molecule from Figure 1. We define the molecule based on a SMILES string and find the visualization in a line-angle formula. But first, we import all libraries:

[1]:
import matplotlib.pyplot as plt
import networkx as nx  # for graphs
import numpy as np  # for matrices
import time

import nglview as nv  # for 3D visualizations

from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from rdkit.Chem import AllChem, rdDistGeom
from rdkit.Chem import rdFingerprintGenerator

IPythonConsole.ipython_useSVG = True
[2]:
# Here we define a caffeine molecule using Smiles.
mol_caffeine = Chem.MolFromSmiles("CN1C=NC2=C1C(=O)N(C(=O)N2C)C")
mol_caffeine
[2]:

Let us now take consider chirality. For instance, consider 2-Butanol. The molecule has two configurations that are mirror images of each other (also called enantiomer). For more information and a 3D visualization, we refer to this tutorial.

[3]:
mol1 = Chem.MolFromSmiles("CC[CH](C)O")  # without chiral information
mol2 = Chem.MolFromSmiles("CC[C@H](C)O")  # enantiomer 1
mol3 = Chem.MolFromSmiles("CC[C@@H](C)O")  # enantiomer 2

Draw.MolsToGridImage([mol1, mol2, mol3], molsPerRow=3, subImgSize=(200, 200))
[3]:
../_images/talktorials_T033_molecular_representations_35_0.svg

NGLViewer allows us to see a ball-and-stick visualization (example taken from this tutorial, we also refer the reader to Talktorial T009).

[4]:
m = Chem.AddHs(
    Chem.MolFromSmiles("CC[C@@H](C)O")
)  # replace with 'CC[C@H](C)O' for other enantiomer
AllChem.EmbedMultipleConfs(m, useExpTorsionAnglePrefs=True, useBasicKnowledge=True)
view = nv.show_rdkit(m)
view

Conformers

We can generate a set of n conformers using rdkit’s improved conformer generator ETKDGv3.

We use aspirin as example molecule.

[5]:
# generate molecule from Smiles
aspirin = Chem.MolFromSmiles("CC(=O)OC1=CC=CC=C1C(=O)O")

Generate 200 conformers

[6]:
# add hydrogens
aspirin_h = Chem.AddHs(aspirin)
# generate conformers
ps = rdDistGeom.ETKDGv3()
ps.randomSeed = 0xD06F00D
ps.numThreads = 10
conf_ids = rdDistGeom.EmbedMultipleConfs(aspirin_h, 200, ps)
len(conf_ids)
[6]:
200
[7]:
# remove hydrogens again
aspirin_3d = Chem.RemoveHs(aspirin_h)
[8]:
# print the x,y,z coordinates of the 4 atoms of the first conformer.
aspirin_3d.GetConformer(0).GetPositions()
[8]:
array([[ 3.17182120e+00,  1.13983512e+00,  6.12133098e-01],
       [ 1.98735544e+00,  3.90842627e-01,  9.18770273e-02],
       [ 1.95458986e+00, -8.18501481e-02, -1.07293595e+00],
       [ 8.66040249e-01,  1.97000500e-01,  9.10742864e-01],
       [-2.13996074e-01, -5.02097657e-01,  3.95340332e-01],
       [-3.07557851e-01, -1.87552241e+00,  5.04952311e-01],
       [-1.40169144e+00, -2.53920230e+00, -2.24731485e-02],
       [-2.39472408e+00, -1.80931156e+00, -6.57708799e-01],
       [-2.31798266e+00, -4.42869698e-01, -7.75777129e-01],
       [-1.20381738e+00,  2.11326811e-01, -2.36122170e-01],
       [-1.08676636e+00,  1.64504351e+00, -3.38093829e-01],
       [-1.97505636e+00,  2.30929967e+00, -9.08553274e-01],
       [ 1.07226488e-03,  2.30152343e+00,  1.89435012e-01]])
[9]:
def show_conformers(molecule, ids):
    """Generate a view of the ligand conformations.

    Parameters
    -----------
    molecule: rdkit.Chem.rdchem.Mol

    Returns
    ----------
    nglview.widget.NGLWidget
    """

    view = nv.NGLWidget()
    print(type(molecule))
    for i in range(0, ids):
        mb = Chem.MolToMolBlock(molecule, confId=i)
        component = view.add_component(mb, ext="sdf")
        time.sleep(0.1)
        component.clear()
        component.add_ball_and_stick(multipleBond=True)
    return view
[10]:
# print the x,y,z coordinates of the 4 atoms of the first conformer.
aspirin_3d.GetConformer(0).GetPositions()
[10]:
array([[ 3.17182120e+00,  1.13983512e+00,  6.12133098e-01],
       [ 1.98735544e+00,  3.90842627e-01,  9.18770273e-02],
       [ 1.95458986e+00, -8.18501481e-02, -1.07293595e+00],
       [ 8.66040249e-01,  1.97000500e-01,  9.10742864e-01],
       [-2.13996074e-01, -5.02097657e-01,  3.95340332e-01],
       [-3.07557851e-01, -1.87552241e+00,  5.04952311e-01],
       [-1.40169144e+00, -2.53920230e+00, -2.24731485e-02],
       [-2.39472408e+00, -1.80931156e+00, -6.57708799e-01],
       [-2.31798266e+00, -4.42869698e-01, -7.75777129e-01],
       [-1.20381738e+00,  2.11326811e-01, -2.36122170e-01],
       [-1.08676636e+00,  1.64504351e+00, -3.38093829e-01],
       [-1.97505636e+00,  2.30929967e+00, -9.08553274e-01],
       [ 1.07226488e-03,  2.30152343e+00,  1.89435012e-01]])
[11]:
def show_conformers(molecule, ids):
    """Generate a view of the ligand conformations.

    Parameters
    -----------
    molecule: rdkit.Chem.rdchem.Mol

    Returns
    ----------
    nglview.widget.NGLWidget
    """

    view = nv.NGLWidget()
    print(type(molecule))
    for i in range(0, ids):
        mb = Chem.MolToMolBlock(molecule, confId=i)
        component = view.add_component(mb, ext="sdf")
        time.sleep(0.1)
        component.clear()
        component.add_ball_and_stick(multipleBond=True)
    return view
[12]:
view = show_conformers(aspirin_3d, 5)
view
<class 'rdkit.Chem.rdchem.Mol'>

Molecular graphs

A common tool to work with graphs is called networkX. We use it to create two isomorphic graphs.

[15]:
# create two adjacency matrices
adj_matrix_1 = np.array([[0, 0, 0, 1], [0, 0, 0, 1], [0, 0, 0, 1], [1, 1, 1, 0]])
adj_matrix_2 = np.array([[0, 1, 1, 1], [1, 0, 0, 0], [1, 0, 0, 0], [1, 0, 0, 0]])
print("Matrix 1:\n", adj_matrix_1)
print("\nMatrix 2:\n", adj_matrix_2)

# convert adjacency matrices to graphs
graph_1 = nx.from_numpy_array(adj_matrix_1)
graph_2 = nx.from_numpy_array(adj_matrix_2)
Matrix 1:
 [[0 0 0 1]
 [0 0 0 1]
 [0 0 0 1]
 [1 1 1 0]]

Matrix 2:
 [[0 1 1 1]
 [1 0 0 0]
 [1 0 0 0]
 [1 0 0 0]]
[16]:
# draw
print("\nAnd the corresponding graphs:\n")
plt.subplot(1, 2, 1)
plt.title("Graph 1")
nx.draw_networkx(graph_1, with_labels=True)
plt.subplot(1, 2, 2)
plt.title("Graph 2")
nx.draw_networkx(graph_2, with_labels=True)

And the corresponding graphs:

../_images/talktorials_T033_molecular_representations_52_1.png

Practical considerations:

Building a molecular graph requires the labeling of nodes and edges. Common frameworks support this. For machine learning, PyTorch Geometric is the most popular framework for building permutation-invariant neural networks. We provide a demonstration in Talktorials T035/T036.

Converting a SMILES string or RDKit object to a PyTorch Geometric graph is not complicated, but somewhat technical. One possible implementation is explained here. Similarly, one can create a networkX graph as explained here.

As you see, molecular graph representations are not standardized and leave a lot of design choices to the user. These include:

  • What are the node features? Only the atom type, or also information on chirality? One can also include the information if the atom is part of a ring.

  • For the edge features: Do you only include the bond type (single, double, triple) or also the bond length? Do you use one-hot encodings or scalars?

  • Do you want to include hydrogen atoms that are bound to carbon atoms, or leave them out?

  • Do you want to add a master node (a dummy atom connected to all other atoms)? It might help some ML algorithms.

  • Do you want to explicitly indicate some substructures in the molecular graph?

Fingerprints

Generate circular fingerprint (ecfp) and print information.

[17]:
fpg = rdFingerprintGenerator.GetMorganGenerator()
fpg.GetInfoString()
[17]:
'Common arguments : countSimulation=0 fpSize=2048 bitsPerFeature=1 includeChirality=0 --- MorganArguments onlyNonzeroInvariants=0 radius=3 --- MorganEnvironmentGenerator --- MorganInvariantGenerator includeRingMembership=1 --- MorganInvariantGenerator useBondTypes=1 useChirality=0'
[18]:
fp = fpg.GetFingerprintAsNumPy(mol_caffeine)
[19]:
def visualize_fingerprint_info(fp):
    print("Number of elements: ", fp.size)
    print("Types of entries: ", set(list(fp)))
    print("Number of Ones: ", np.count_nonzero(fp))
    plt.vlines(
        [i for i in range(fp.size) if fp[i] > 0.5], ymin=0, ymax=300
    )  # 300 gives a nice aspect ratio
    plt.vlines(fp.size, ymin=0, ymax=0.0)  # dummy to calibrate scale
    plt.gca().set_aspect("equal")
[20]:
visualize_fingerprint_info(fp)
Number of elements:  2048
Types of entries:  {0, 1}
Number of Ones:  34
../_images/talktorials_T033_molecular_representations_58_1.png

Try another fingerprint: Rdkit fingerprint

[21]:
fpg2 = rdFingerprintGenerator.GetRDKitFPGenerator()
fp2 = fpg2.GetFingerprintAsNumPy(mol_caffeine)
[22]:
# The "rdkit" fingerprint is less sparse than ECFP
visualize_fingerprint_info(fp2)
Number of elements:  2048
Types of entries:  {0, 1}
Number of Ones:  759
../_images/talktorials_T033_molecular_representations_61_1.png

Discussion

We have discussed different methods for molecular representations:

  • A molecule is a bag/multiset of atoms (molecular formula). This representation cannot differentiate isomers.

  • A molecule is a labeled graph (molecular graphs) or a derived representation (SMILES, fingerprints). This representation can differentiate structural isomers and in some cases some (but not all) spatial isomers.

  • A molecule is a graph, equipped with 3D coordinates (point clouds). This configuration can be predicted or determined experimentally.

  • A molecule is a manipulable 3D object, e.g., with rotational bonds. Isomers that can be reached by rotating single bonds are called conformers.

We have also learned that intermolecular forces are only explicitly represented when they lead to chemical bonds (and sometimes not even then). Other forces might be represented implicitly in the 3D configuration.

We have not discussed methods that only consider the surface area of a molecule.

Quiz

  • Can the same molecule correspond to different molecular graphs?

  • True or false: When you mirror/rotate a molecule, its properties remain the same.

  • Can an ML model predict different properties, depending on which SMILES canonicalization you use as input?

  • Can an ML model predict different properties, depending on which node ordering you use as input?

  • True or false: Only forces between atoms that lead to chemical bonds are relevant to a molecule’s properties.