T009 · Ligand-based pharmacophores

Note: This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:

  • Pratik Dhakal, CADD seminar, 2017, Charité/FU Berlin

  • Florian Gusewski, CADD seminar, 2018, Charité/FU Berlin

  • Jaime Rodríguez-Guerra, Volkamer lab, Charité

  • Dominique Sydow, Volkamer lab, Charité

Talktorial T009: This talktorial is part of the TeachOpenCADD pipeline described in the first TeachOpenCADD paper, comprising of talktorials T001-T010.

Note: Please run this notebook cell by cell. Running all cells in one is possible also, however, part of the nglview 3D representations might be missing.

Aim of this talktorial

In this talktorial, we use known EGFR ligands, which were selected and aligned in the previous talktorial, to identify donor, acceptor, and hydrophobic pharmacophoric features for each ligand. Those features are then clustered to define an ensemble pharmacophore, which represents the properties of the set of known EGFR ligands and can be used to search for novel EGFR ligands via virtual screening.

Learning goals

Contents in Theory

  • Pharmacophore modeling

    • Structure- and ligand-based pharmacophore modeling

  • Virtual screening with pharmacophores

  • Clustering: k-means

Contents in Practical

  • Get pre-aligned ligands from previous talktorial

  • Show ligands with NGLView

  • Extract pharmacophore features

  • Show the pharmacophore features of all ligands

    • Hydrogen bond donors

    • Hydrogen bond acceptors

    • Hydrophobic contacts

  • Collect coordinates of features per feature type

  • Generate ensemble pharmacophores

    • Set static parameters for k-means clustering

    • Set static parameters for cluster selection

    • Define k-means clustering and cluster selection functions

    • Cluster features

    • Select relevant clusters

    • Get selected cluster coordinates

  • Show clusters

    • Hydrogen bond donors

    • Hydrogen bond acceptors

    • Hydrophobic contacts

  • Show ensemble pharmacophore

References

Theory

Pharmacophores

In computer-aided drug design, the description of drug-target interactions with pharmacophores is a well-established method. The term pharmacophore was defined in 1998 by a IUPAC working party: “A pharmacophore is the ensemble of steric and electronic features that is necessary to ensure the optimal supramolecular interactions with a specific biological target structure and to trigger (or to block) its biological response.” (Pure & Appl. Chem. (1998), 70, 1129-43)

In other words, a pharmacophore consists of several pharmacophoric features, which describe important steric and physico-chemical properties of a ligand observed to bind a target under investigation. Such physico-chemical properties (also called feature types) can be hydrogen bond donors/acceptors, hydrophobic/aromatic interactions, or positively/negatively charged groups, and the steric properties are defined by the 3D arrangement of these features.

Structure- and ligand-based pharmacophore modeling

In pharmacophore modeling, two main approaches are used, depending on the biological question and available data sources, i.e. structure- and ligand-based pharmacophore modeling.

Structure-based pharmacophore models are derived from protein-ligand complexes. Features are defined by observed interactions between the protein and ligand, ensuring that only those ligand moieties are used for virtual screening that have already been shown to be involved in ligand binding. However, structures of protein-ligand complexes are not available for all targets. In this case, either complex structures can be generated by modeling the ligand into the target binding site, e.g. via molecular docking, or pharmacophore modeling methods can be invoked that only use the target binding site to detect potential protein-ligand interaction sites.

Ligand-based pharmacophore models are based on a set of ligands known to bind the target under investigation. The common chemical features of these ligands build the pharmacophore model. This method is used for targets with multiple known ligands and in case of missing protein-ligand complex structures. In this talktorial, we will use ligand-based pharmacophore modeling using a set of known EGFR ligands.

For more information on pharmacophore modeling, we recommend Pharmacophore Perception and Applications: Applied Chemoinformatics, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, (2018), 1, 259-82 and J. Chem. Inf. Model. (2005), 45, 160-9.

Structure-based pharmacophore representing protein-ligand interactions (figure by Dominique Sydow)

Figure 1: Structure-based pharmacophore representing protein-ligand interactions. Figure created by Dominique Sydow.

Virtual screening with pharmacophores

As described earlier in Talktorial 4, virtual screening (VS) describes the screening of a query, e.g. a pharmacophore model (T009) or a query compound (T004), against a large library of compounds, in order to identify the small molecules in the library that are most likely to bind a target under investigation (represented by the query). In pharmacophore-based virtual screening, the compound library is matched compound-by-compound into a pharmacophore model and ranked by the best matching results (Structure-Based Virtual Screening: Applied Chemoinformatics, Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim, (2018), 1, 313-31).

Clustering: k-means

In this talktorial we will generate an ensemble pharmacophore by clustering the feature points of several ligand-based pharmacophores. The clustering algorithm uses the k-means clustering, which aims to cluster a data set into k clusters:

  1. k different centroids are selected and each point of the data set is assigned to its closest centroids.

  2. New centroids are calculated based on the current clusters and each point of the data set is newly assigned to its closest centroids.

  3. This procedure is repeated until the centroids are stable.

More info about k-means in Wikipedia.

Practical

[1]:
import collections
from collections import Counter
from pathlib import Path
import operator
import time
import math

import pandas as pd
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from sklearn import datasets, cluster

from rdkit import RDConfig, Chem, Geometry, DistanceGeometry
from rdkit.Chem import (
    ChemicalFeatures,
    rdDistGeom,
    Draw,
    rdMolTransforms,
    AllChem,
)
from rdkit.Chem.Draw import DrawingOptions
from rdkit.Chem.Pharm3D import Pharmacophore, EmbedLib
from rdkit.Numerics import rdAlignment
import nglview as nv

Set the Path to the current notebook.

[2]:
HERE = Path(_dh[-1])
DATA = HERE / "data"

Get pre-aligned ligands from previous talktorial

We retrieve all ligands that were aligned in the previous talktorial.

First, we get the file paths to all ligand PDB files.

[3]:
mol_files = sorted(Path(HERE / "../T008_query_pdb/data/").glob("*_lig.pdb"))
print([f.name for f in mol_files])
['5HG8_lig.pdb', '5UG8_lig.pdb', '5UG9_lig.pdb', '5UGC_lig.pdb']
[4]:
# NBVAL_CHECK_OUTPUT
pdb_ids = [Path(file).stem.split("_")[0] for file in mol_files]
pdb_ids
[4]:
['5HG8', '5UG8', '5UG9', '5UGC']

Second, we read all ligands from these PDB files using RDKit.

[5]:
# NBVAL_CHECK_OUTPUT
molecules = []
for mol_file in mol_files:
    molecule = Chem.MolFromPDBFile(str(mol_file), removeHs=False)
    if molecule is None:
        print(mol_file, "Could not be read")
    else:
        Chem.SanitizeMol(molecule)
        print(Chem.MolToSmiles(molecule))
        molecules.append(molecule)
print(f"Number of molecules: {len(molecules)}")
CCC(O)NC1CCCC(OC2NC(NC3CNN(C)C3)NC3NCCC32)C1
CCC(O)NC1CN(C2NC(NC3CNN(C)C3)C3NCN(C(C)C)C3N2)CC1F
CCC(O)NC1CN(C2NC(NC3CN(C)NC3OC)C3NCN(C(C)C)C3N2)CC1F
CCC(O)NC1CN(C2NC(NC3CN(C)NC3OC)C3NCN(C)C3N2)CC1F
Number of molecules: 4
[6]:
Draw.MolsToGridImage(
    molecules,
    molsPerRow=4,
    legends=[f"From PDB ID: {pdb_id}" for pdb_id in pdb_ids],
)
[6]:
../_images/talktorials_T009_compound_ensemble_pharmacophores_22_0.png

Upon inspection of the structures we notice, that we encountered a problem here: When loading ligands from a PDB file, RDKit does not assign bonds in structures like aromatic rings correctly. We use the RDKit function AssignBondOrdersFromTemplate, which assigns bonds to a molecule based on a reference molecule, e.g. in our case based on the SMILES pattern of the molecule.

Check for further information: RDKit discussion on Aromaticity of non-protein molecules in PDB not detected and RDKit documentation on AssignBondOrdersFromTemplate.

[7]:
# Load SMILES for PDB ligand structures
ligands = pd.read_csv(HERE / "../T008_query_pdb/data/PDB_top_ligands.csv")

# Get SMILES in the same order as in pdb_ids
ligand_smiles = [
    ligands[ligands["@structureId"] == pdb_id]["smiles"].values[0] for pdb_id in pdb_ids
]

# Generate RDKit Mol object from SMILES
reference_molecules = [Chem.MolFromSmiles(smiles) for smiles in ligand_smiles]

# Assign bond orders to molecules based on SMILES patterns of reference molecules
molecules = [
    AllChem.AssignBondOrdersFromTemplate(reference_molecule, molecule)
    for reference_molecule, molecule in zip(reference_molecules, molecules)
]
[8]:
Draw.MolsToGridImage(
    molecules,
    molsPerRow=4,
    legends=[f"From PDB ID: {pdb_id}" for pdb_id in pdb_ids],
)
[8]:
../_images/talktorials_T009_compound_ensemble_pharmacophores_25_0.png

Now the correct bonds are included. We can also have a look at the molecules in 2D. In order to keep the original coordinates, we make a copy for the 2D representation.

[9]:
molecules_2d = []
for molecule in molecules:
    tmp = Chem.Mol(molecule)
    AllChem.Compute2DCoords(tmp)
    molecules_2d.append(tmp)
Draw.MolsToGridImage(
    molecules_2d,
    molsPerRow=4,
    legends=[f"From PDB ID: {pdb_id}" for pdb_id in pdb_ids],
)
[9]:
../_images/talktorials_T009_compound_ensemble_pharmacophores_27_0.png

Visualize with nglview

In the next step we visualize the ligand molecules with nglview using the following function. The molecules will be drawn on top of each other in a single view. This way we are able to visually compare strucural differences between the ligands.

[10]:
def show_ligands(molecules):
    """Generate a view of the ligand molecules.

    Parameters
    -----------
    molecules: list of rdkit.Chem.rdchem.Mol

    Returns
    ----------
    nglview.widget.NGLWidget
    """
    view = nv.NGLWidget()
    for molecule in molecules:
        component = view.add_component(molecule)
        time.sleep(0.1)
        component.clear()
        component.add_ball_and_stick(multipleBond=True)
    return view
[11]:
view = show_ligands(molecules)
view
[12]:
view.render_image(trim=True, factor=2, transparent=True);
[13]:
view._display_image()
[13]:
../_images/talktorials_T009_compound_ensemble_pharmacophores_33_0.png

Extract pharmacophore features

As described above, the aim of this talktorial is to generate a ligand-based ensemble pharmacophore from a set of ligands. First, we need to extract pharmacophore features per ligand. Therefore, we load a feature factory. We use the default feature definitions.

See also the RDKit documentation on chemical features and pharmacophores.

[14]:
feature_factory = AllChem.BuildFeatureFactory(str(Path(RDConfig.RDDataDir) / "BaseFeatures.fdef"))

We take a look at the pharmacophore features that are implemented in RDKit:

[15]:
list(feature_factory.GetFeatureDefs().keys())
[15]:
['Donor.SingleAtomDonor',
 'Acceptor.SingleAtomAcceptor',
 'NegIonizable.AcidicGroup',
 'PosIonizable.BasicGroup',
 'PosIonizable.PosN',
 'PosIonizable.Imidazole',
 'PosIonizable.Guanidine',
 'ZnBinder.ZnBinder1',
 'ZnBinder.ZnBinder2',
 'ZnBinder.ZnBinder3',
 'ZnBinder.ZnBinder4',
 'ZnBinder.ZnBinder5',
 'ZnBinder.ZnBinder6',
 'Aromatic.Arom4',
 'Aromatic.Arom5',
 'Aromatic.Arom6',
 'Aromatic.Arom7',
 'Aromatic.Arom8',
 'Hydrophobe.ThreeWayAttach',
 'Hydrophobe.ChainTwoWayAttach',
 'LumpedHydrophobe.Nitro2',
 'LumpedHydrophobe.RH6_6',
 'LumpedHydrophobe.RH5_5',
 'LumpedHydrophobe.RH4_4',
 'LumpedHydrophobe.RH3_3',
 'LumpedHydrophobe.tButyl',
 'LumpedHydrophobe.iPropyl']

As an example, we extract all features from the first molecule.

[16]:
example_molecule = molecules[0]
example_molecule
[16]:
../_images/talktorials_T009_compound_ensemble_pharmacophores_39_0.png
[17]:
# NBVAL_CHECK_OUTPUT
features = feature_factory.GetFeaturesForMol(example_molecule)
print(f"Number of features found: {len(features)}")
Number of features found: 16

The type of a feature is called family in RDKit and can be retrieved with GetFamily().

[18]:
# NBVAL_CHECK_OUTPUT
features[0].GetFamily()
[18]:
'Donor'

First we get the frequency of feature types for our example molecule.

[19]:
# NBVAL_CHECK_OUTPUT
feature_frequency = collections.Counter([feature.GetFamily() for feature in features])
feature_frequency
[19]:
Counter({'Donor': 4,
         'Acceptor': 5,
         'Aromatic': 4,
         'Hydrophobe': 2,
         'LumpedHydrophobe': 1})

Now apply the functions shown above to all molecules in our ligand set. We calculate the feature type frequency per molecule, then we display it in a DataFrame.

[20]:
molecule_feature_frequencies = []
for molecule in molecules:
    features = [feature.GetFamily() for feature in feature_factory.GetFeaturesForMol(molecule)]
    feature_frequency = collections.Counter(features)
    molecule_feature_frequencies.append(feature_frequency)

feature_frequencies_df = (
    pd.DataFrame(
        molecule_feature_frequencies,
        index=[f"Mol{i}" for i, _ in enumerate(molecules, 1)],
    )
    .fillna(0)
    .astype(int)
)
feature_frequencies_df.transpose()
[20]:
Mol1 Mol2 Mol3 Mol4
Donor 4 2 2 2
Acceptor 5 6 7 7
Aromatic 4 3 3 3
Hydrophobe 2 1 1 1
LumpedHydrophobe 1 1 1 0
PosIonizable 0 1 1 1

Furtheron, we concentrate in this talktorial only on the following feature types: hydrogen bond acceptors (acceptors), hydrogen bond donors (donors), and hydrophobic contacts (hydrophobics).

We retrieve the feature RDKit objects per feature type and per molecule.

[21]:
acceptors = []
donors = []
hydrophobics = []

for molecule in molecules:
    acceptors.append(feature_factory.GetFeaturesForMol(molecule, includeOnly="Acceptor"))
    donors.append(feature_factory.GetFeaturesForMol(molecule, includeOnly="Donor"))
    hydrophobics.append(feature_factory.GetFeaturesForMol(molecule, includeOnly="Hydrophobe"))

features = {
    "donors": donors,
    "acceptors": acceptors,
    "hydrophobics": hydrophobics,
}

Show the pharmacophore features of all ligands

Pharmacophore feature types usually are displayed in defined colors, e.g. usually hydrogen bond donors, hydrogen bond acceptors, and hydrophobic contacts are colored green, red, and yellow, respectively.

[22]:
feature_colors = {
    "donors": (0, 0.9, 0),  # Green
    "acceptors": (0.9, 0, 0),  # Red
    "hydrophobics": (1, 0.9, 0),  # Yellow
}
[23]:
def visualize_features(
    molecules,
    features,
    feature_type="features",
    color="yellow",
    sphere_radius=0.5,
):
    """Generate a view of the molecules highlighting the specified feature type.

    Parameters
    -----------
    molecules: list of rdkit.Chem.rdchem.Mol
        molecules to be visualized
    features: list of tuples of rdkit.Chem.rdMolChemicalFeatures.MolChemicalFeature
        extracted features from molecules chosen to be highlighted
    feature_type: string, optional
        name of the feature to be highlighted
    color: string, optional
        color used to display the highlighted features
    sphere_radius: float, optional
        display size of the highlighted features

    Returns
    ----------
    nglview.widget.NGLWidget
    """
    print(f"Number of {feature_type} in all ligands: {sum([len(i) for i in features])}")
    view = show_ligands(molecules)
    for i, feature_set in enumerate(features, 1):
        for feature in feature_set:
            loc = list(feature.GetPos())
            label = f"{feature_type}_{i}"
            view.shape.add_sphere(loc, color, sphere_radius, label)
    return view

We use this function to visualize the features for the feature types under consideration.

Hydrogen bond donors

[24]:
feature_type = "donors"
view = visualize_features(
    molecules,
    features[feature_type],
    feature_type,
    feature_colors[feature_type],
)
view
Number of donors in all ligands: 10
[25]:
view.render_image(trim=True, factor=2, transparent=True);
[26]:
view._display_image()
[26]:
../_images/talktorials_T009_compound_ensemble_pharmacophores_56_0.png

Hydrogen bond acceptors

[27]:
feature_type = "acceptors"
view = visualize_features(
    molecules,
    features[feature_type],
    feature_type,
    feature_colors[feature_type],
)
view
Number of acceptors in all ligands: 25
[28]:
view.render_image(trim=True, factor=2, transparent=True);
[29]:
view._display_image()
[29]:
../_images/talktorials_T009_compound_ensemble_pharmacophores_60_0.png

Hydrophobic contacts

[30]:
feature_type = "hydrophobics"
view = visualize_features(
    molecules,
    features[feature_type],
    feature_type,
    feature_colors[feature_type],
)
view
Number of hydrophobics in all ligands: 5
[31]:
view.render_image(trim=True, factor=2, transparent=True);
[32]:
view._display_image()
[32]:
../_images/talktorials_T009_compound_ensemble_pharmacophores_64_0.png

Collect coordinates of features per feature type

We want to cluster features per feature type. To do this we collect all coordinates of features and store them per feature type.

[33]:
features_coord = {
    "donors": [list(item.GetPos()) for sublist in features["donors"] for item in sublist],
    "acceptors": [list(item.GetPos()) for sublist in features["acceptors"] for item in sublist],
    "hydrophobics": [
        list(item.GetPos()) for sublist in features["hydrophobics"] for item in sublist
    ],
}

Now, we have the positions of e.g. all acceptor features:

[34]:
features_coord["acceptors"]
[34]:
[[-13.49, 14.732, -27.925],
 [-11.99, 13.164, -28.961],
 [-13.173, 17.112, -28.025],
 [-16.131, 17.359, -22.003],
 [-16.873, 11.631, -25.856],
 [-13.808, 14.802, -27.137],
 [-12.146, 16.255, -28.067],
 [-11.385, 12.957, -29.278],
 [-16.736, 10.854, -25.879],
 [-15.435, 19.394, -27.21],
 [-16.039, 17.251, -22.54],
 [-13.863, 14.66, -27.134],
 [-12.175, 16.118, -27.994],
 [-11.411, 12.856, -29.266],
 [-17.026, 10.809, -26.396],
 [-15.601, 9.901, -28.086],
 [-15.445, 19.229, -27.215],
 [-16.044, 17.165, -22.506],
 [-13.831, 14.581, -27.162],
 [-12.161, 16.037, -28.079],
 [-11.381, 12.733, -29.268],
 [-17.02, 10.739, -26.48],
 [-15.542, 9.814, -28.111],
 [-15.515, 19.1, -27.334],
 [-16.09, 17.203, -22.553]]

Generate ensemble pharmacophores

In order to generate ensemble pharmacophores, we use k-means clustering to cluster features per feature type.

Set static parameters for k-means clustering

  • kq: With this paramter, we determine the number of clusters k per feature type depending on the number of feature points, i.e. per feature type

  • k = number_of_features / kq

  • kq should be selected so that k (feature clusters) is for all clusters at least 1 and not larger than 4-5 clusters

We set our kq used to determine k in k-means as follows:

[35]:
kq = 7

Set static parameters for cluster selection

  • min_cluster_size: We only want to retain clusters that potentially contain features from most molecules in our ligand ensemble. Therefore, we set this variable to 75% of the number of molecules in our ligand ensemble.

  • top_cluster_number: With this parameter, we select only the largest clusters.

[36]:
min_cluster_size = int(len(molecules) * 0.75)
top_cluster_number = 4

Define k-means clustering and cluster selection functions

We define a function that calculates the centers of clusters, which are derived from k-means clustering.

[37]:
def clustering(feature_coord, kq):
    """
    Compute the k-means clustering of input feature coordinates.

    Parameters
    -----------
    feature_coord: list
        feature coordinates extracted from the features of a set of molecules
    kq: int
        Quotient used when calculating number of features k

    Returns
    ----------
    sklearn.cluster._kmeans.KMeans
    """
    # Define parameter k as feature number divided by "k quotient"
    k = math.ceil(len(feature_coord) / kq)
    # Tailor-made adaption of k for hydrophobics in for the example in this talktorial
    k = 2 if k == 1 else k
    print(f"Clustering: \nVariable k in k-means: {k} of {len(feature_coord)} points\n")

    # Initialize k-means and compute clustering
    k_means = cluster.KMeans(n_clusters=k)
    k_means.fit(feature_coord)

    return k_means

We define a function that sorts the clusters by size and outputs a list of indices of the largest clusters.

[38]:
def get_clusters(k_means, min_cluster_size, top_cluster_number):
    """
    Select clusters from a k-means clustering and returns the indices of the selected clusters.
    The selection is based on cluster size. The largest clusters are chosen.

    Parameters
    -----------
    k_means: sklearn.cluster._kmeans.KMeans
        k-means clustering of feature coordinates
    min_cluster_size: int
        minimum size needed for a cluster to be viable for selection
    top_cluster_number: int
        number of clusters to be selected

    Returns
    ----------
    list of int
        indices of selected clusters
    """

    # Sort clusters by size and only show largest
    feature_labels = k_means.labels_

    feature_labels_count = Counter(feature_labels)

    feature_labels_count = sorted(
        feature_labels_count.items(), key=operator.itemgetter(1), reverse=True
    )
    print(f"Clusters sorted by size: \n{feature_labels_count}\n")

    # Get number of the largest clusters, which are larger then the threshold (selected clusters)
    cluster_indices_sel = []

    for cluster_index, cluster_size in feature_labels_count:
        if cluster_size >= min_cluster_size and top_cluster_number > 0:
            cluster_indices_sel.append(cluster_index)
            top_cluster_number -= 1

    print(f"Cluster indices of selected clusters: \n{cluster_indices_sel}\n")

    return cluster_indices_sel

Cluster features

For each feature type, we perform the k-means clustering with our defined clustering function.

[39]:
k_means = {
    "donors": clustering(features_coord["donors"], kq),
    "acceptors": clustering(features_coord["acceptors"], kq),
    "hydrophobics": clustering(features_coord["hydrophobics"], kq),
}
Clustering:
Variable k in k-means: 2 of 10 points

Clustering:
Variable k in k-means: 4 of 25 points

Clustering:
Variable k in k-means: 2 of 5 points

Select relevant clusters

For each feature type, we select relevant clusters with our defined get_clusters function.

[40]:
print("Hydrogen bond donors\n")
cluster_indices_sel_donors = get_clusters(k_means["donors"], min_cluster_size, top_cluster_number)
Hydrogen bond donors

Clusters sorted by size:
[(1, 6), (0, 4)]

Cluster indices of selected clusters:
[1, 0]

[41]:
print("Hydrogen bond acceptors\n")
cluster_indices_sel_acceptors = get_clusters(
    k_means["acceptors"], min_cluster_size, top_cluster_number
)
Hydrogen bond acceptors

Clusters sorted by size:
[(0, 12), (1, 6), (2, 4), (3, 3)]

Cluster indices of selected clusters:
[0, 1, 2, 3]

[42]:
print("Hydrophobic contacts\n")
cluster_indices_sel_hydrophobic = get_clusters(
    k_means["hydrophobics"], min_cluster_size, top_cluster_number
)
Hydrophobic contacts

Clusters sorted by size:
[(0, 4), (1, 1)]

Cluster indices of selected clusters:
[0]

[43]:
cluster_indices_sel = {
    "donors": cluster_indices_sel_donors,
    "acceptors": cluster_indices_sel_acceptors,
    "hydrophobics": cluster_indices_sel_hydrophobic,
}

Get selected cluster coordinates

[44]:
def get_selected_cluster_center_coords(k_means, cluster_indices_sel):
    """Retrieve cluster center coordinates for selected clusters.
    The selected clusters are defined by their indices.

    Parameters
    -----------
    k_means: sklearn.cluster._kmeans.KMeans
        k-means clustering of feature coordinates
    cluster_indices_sel: list of int
        indices of the chosen clusters

    Returns
    ----------
    two-dimensional list of floats
        cluster center coordinates
    """
    cluster_centers = k_means.cluster_centers_

    # Cast to list and then to pandas Series (for element selection by indices)
    cluster_centers = pd.Series(cluster_centers.tolist())

    # Select cluster centers by indices of selected clusters
    cluster_centers_sel = cluster_centers[cluster_indices_sel]

    return list(cluster_centers_sel)
[45]:
cluster_centers_sel = {
    "donors": get_selected_cluster_center_coords(k_means["donors"], cluster_indices_sel["donors"]),
    "acceptors": get_selected_cluster_center_coords(
        k_means["acceptors"], cluster_indices_sel["acceptors"]
    ),
    "hydrophobics": get_selected_cluster_center_coords(
        k_means["hydrophobics"], cluster_indices_sel["hydrophobics"]
    ),
}
[46]:
cluster_centers_sel["acceptors"]
[46]:
[[-12.567833333333333, 14.66725, -28.191333333333333],
 [-16.466333333333335, 10.624666666666666, -26.801333333333332],
 [-16.076, 17.244500000000002, -22.4005],
 [-15.465, 19.241, -27.253]]

Show clusters

Per feature type, we visualize cluster centers alongside with all molecules and all feature points.

[47]:
def visualize_clusters(
    molecules,
    features,
    clusters,
    feature_type="features",
    color="yellow",
    feature_radius=0.5,
    cluster_radius=1,
):
    """Visualize the cluster centers.

    Parameters
    -----------
     molecules: list of rdkit.Chem.rdchem.Mol
        molecules to be visualized
    features: list of tuples of rdkit.Chem.rdMolChemicalFeatures.MolChemicalFeature
        extracted features from molecules chosen to be highlighted
    clusters: Two-dimensional list of floats
        cluster center coordinates
    feature_type: string, optional
        feature type being visualized
    color: string, optional
        color used for rendering the features and cluster centers
    feature_radius: float, optional
        rendering size of the features
    cluster_radius: float, optional
        rendering size of the cluster centers

    Returns
    ----------
    nglview.widget.NGLWidget
    """
    view = visualize_features(
        molecules,
        features,
        feature_type,
        color=color,
        sphere_radius=feature_radius,
    )
    for i, center in enumerate(clusters, 1):
        view.shape.add_sphere(list(center), color, cluster_radius, f"cluster_{feature_type}_{i}")
    return view

Hydrogen bond donors

[48]:
feature_type = "donors"
view = visualize_clusters(
    molecules,
    features[feature_type],
    cluster_centers_sel[feature_type],
    feature_type,
    feature_colors[feature_type],
)
view
Number of donors in all ligands: 10
[51]:
view.render_image(trim=True, factor=2, transparent=True);
[52]:
view._display_image()
[52]:
../_images/talktorials_T009_compound_ensemble_pharmacophores_94_0.png

Hydrogen bond acceptor

[53]:
feature_type = "acceptors"
view = visualize_clusters(
    molecules,
    features[feature_type],
    cluster_centers_sel[feature_type],
    feature_type,
    feature_colors[feature_type],
)
view
Number of acceptors in all ligands: 25
[54]:
view.render_image(trim=True, factor=2, transparent=True);
[55]:
view._display_image()
[55]:
../_images/talktorials_T009_compound_ensemble_pharmacophores_98_0.png

Hydrophobic contacts

[56]:
feature_type = "hydrophobics"
view = visualize_clusters(
    molecules,
    features[feature_type],
    cluster_centers_sel[feature_type],
    feature_type,
    feature_colors[feature_type],
)
view
Number of hydrophobics in all ligands: 5
[58]:
view.render_image(trim=True, factor=2, transparent=True);
[59]:
view._display_image()
[59]:
../_images/talktorials_T009_compound_ensemble_pharmacophores_102_0.png

Show ensemble pharmacophore

In this last step, we combine the clustered pharmacophoric features (i.e. hydrogen bond donors and acceptors as well as hydrophobic contacts), to one ensemble pharmacophore, representing the pharmacophoric properties of the four selected ligands.

[60]:
view = show_ligands(molecules)
# Load clusters
for feature_type in cluster_indices_sel.keys():
    centers = cluster_centers_sel[feature_type]
    for i, loc in enumerate(centers):
        sphere_radius = 1
        feature_color = feature_colors[feature_type]
        label = f"{feature_type}_c{i}"
        view.shape.add_sphere(loc, feature_color, sphere_radius, label)
view
[61]:
view.render_image(trim=True, factor=2, transparent=True);
[62]:
view._display_image()
[62]:
../_images/talktorials_T009_compound_ensemble_pharmacophores_106_0.png

Discussion

In this talktorial, we used a set of pre-aligned ligands, which are known to bind EGFR, to generate an ensemble pharmacophore model. This model could now be used for virtual screening against a large library of small molecules, in order to find novel small molecules that show the observed steric and physico-chemical properties and might therefore also bind to the EGFR binding site.

Before screening, the pharmacophore models are usually further optimized, e.g. features might be omitted in order to reduce the number of features for screening based on biological knowledge (some interaction might be reported as important whereas others are not) or based on chemical expertise.

We do not cover the virtual screening in this talktorial, however refer to an excellent tutorial by Nik Stiefl, demonstrating pharmacophore modeling and virtual screening with RDKit (RDKit UGM 2016 on GitHub).

We used k-means clustering to cluster pharmacophore feature. This clustering approach has the disadvantage that the user needs to define the number of clusters beforehand, which is usually based on a visual inspection of the point distribution before clustering (or during cluster refinement) and is therefore hindering for an automated pharmacophore generation. Density-based clustering methods, also in combination with k-means clustering, can be a solution for this. For further reading we recommend the introduction to Density-Based Clustering at Domino Data Lab.

Quiz

  1. Explain the terms pharmacophoric features and pharmacophore.

  2. Explain the difference between structure- and ligand-based pharmacophore modeling.

  3. Explain how we derived an ensemble pharmacophore.