T007 · Ligand-based screening: machine learning¶

Note: This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:

Jan Philipp Albrecht, CADD seminar 2018, Charité/FU Berlin
Jacob Gora, CADD seminar 2018, Charité/FU Berlin
Talia B. Kimber, 2019-2020, Volkamer lab
Andrea Volkamer, 2019-2020, Volkamer lab

Talktorial T007: This talktorial is part of the TeachOpenCADD pipeline described in the first TeachOpenCADD paper, comprising of talktorials T001-T010.

Aim of this talktorial¶

Due to larger available data sources, machine learning (ML) gained momentum in drug discovery and especially in ligand-based virtual screening. In this talktorial, we learn how to use different supervised ML algorithms to predict the activity of novel compounds against our target of interest (EGFR).

Contents in Theory¶

Data preparation: Molecule encoding
Machine learning (ML)
- Supervised learning
Model validation and evaluation
- Validation strategy: K-fold cross-validation
- Performance measures

Contents in Practical¶

Load compound and activity data
Data preparation
- Data labeling
- Molecule encoding
Machine learning
- Helper functions
- Random forest classifier
- Support vector classifier
- Neural network classifier
- Cross-validation

References¶

“Fingerprints in the RDKit” slides, G. Landrum, RDKit UGM 2012
Extended-connectivity fingerprints (ECFPs): Rogers, David, and Mathew Hahn. “Extended-connectivity fingerprints.” Journal of chemical information and modeling 50.5 (2010): 742-754.
Machine learning (ML):
- Random forest (RF): Breiman, L. “Random Forests”. Machine Learning 45, 5–32 (2001).
- Support vector machines (SVM): Cortes, C., Vapnik, V. “Support-vector networks”. Machine Learning 20, 273–297 (1995).
- Artificial neural networks (ANN): Van Gerven, Marcel, and Sander Bohte. “Artificial neural networks as models of neural information processing.” Frontiers in Computational Neuroscience 11 (2017): 114.
Performance:
- Sensitivity and specificity (Wikipedia)
- ROC curve and AUC (Wikipedia)
See also github notebook by B. Merget from J. Med. Chem., 2017, 60, 474−485
Activity cutoff \(pIC_{50} = 6.3\) used in this talktorial
- Profiling Prediction of Kinase Inhibitors: Toward the Virtual Assay J. Med. Chem. (2017), 60, 474-485
- Notebook accompanying the publication mentioned before: Notebook

Theory¶

To successfully apply ML, we need a large data set of molecules, a molecular encoding, a label per molecule in the data set, and a ML algorithm to train a model. Then, we can make predictions for new molecules.

Figure 1: Machine learning overview: Molecular encoding, label, ML algorithm, prediction. Figure by Andrea Volkamer.

Data preparation: Molecule encoding¶

For ML, molecules need to be converted into a list of features. Often molecular fingerprints are used as representation.

The fingerprints used in this talktorial as implemented in RDKit (more info can be found in a presentation by G. Landrum) are:

maccs: ‘MACCS keys are 166 bit structural key descriptors in which each bit is associated with a SMARTS pattern.’ (see OpenEye’s MACCS docs)
Morgan fingerprints (and ECFP): ‘Extended-Connectivity Fingerprints (ECFPs) are circular topological fingerprints designed for molecular characterization, similarity searching, and structure-activity modeling.’ (see ChemAxon’s ECFP docs) The original implementation of the ECFPs was done in Pipeline Pilot which is not open-source. Instead we use the implementation from RDKit which is called Morgan fingerprint. The two most important parameters of these fingerprints are the radius and fingerprint length. The first specifies the radius of circular neighborhoods considered for each atom. Here two radii are considered: 2 and 3. The length parameter specifies the length to which the bit string representation is hashed. The default length is 2048.

Machine learning (ML)¶

ML can be applied for (text adapted from scikit-learn page):

Classification (supervised): Identify which category an object belongs to (e.g. : Nearest neighbors, Naive Bayes, RF, SVM, …)
Regression: Prediction of a continuous-values attribute associated with an object
Clustering (unsupervised): Automated grouping of similar objects into sets (see also Talktorial T005)

Supervised learning¶

A learning algorithm creates rules by finding patterns in the training data.

Random Forest (RF): Ensemble of decision trees. A single decision tree splits the features of the input vector in a way that maximizes an objective function. In the random forest algorithm, the trees that are grown are de-correlated because the choice of features for the splits are chosen randomly.
Support Vector Machines (SVMs): SVMs can efficiently perform a non-linear classification using what is called the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. The classifier is based on the idea of maximizing the margin as the objective function.
Artificial neural networks (ANNs): An ANN is based on a collection of connected units or nodes called artificial neurons which loosely model the neurons in a biological brain. Each connection, like the synapses in a biological brain, can transmit a signal from one artificial neuron to another. An artificial neuron that receives a signal can process it and then signal additional artificial neurons connected to it.

Figure 2: Example of a neural network with one hidden layer. Figure taken from Wikipedia.

Model validation and evaluation¶

Validation strategy: K-fold cross validation¶

This model validation technique splits the dataset in two groups in an iterative manner:
- Training data set: Considered as the known dataset on which the model is trained
- Test dataset: Unknown dataset on which the model is then tested
- Process is repeated k-times
The goal is to test the ability of the model to predict data which it has never seen before in order to flag problems known as over-fitting and to assess the generalization ability of the model.

Performance measures¶

Sensitivity, also true positive rate
- TPR = TP/(FN + TP)
- Intuitively: Out of all actual positives, how many were predicted as positive?
Specificity, also true negative rate
- TNR = TN/(FP + TN)
- Intuitively: Out of all actual negatives, how many were predicted as negative?
Accuracy, also the trueness
- ACC = (TP + TN)/(TP + TN + FP + FN)
- Intuitively: Proportion of correct predictions.
ROC-curve, receiver operating characteristic curve
- A graphical plot that illustrates the diagnostic ability of our classifier
- Plots the sensitivity against the specificity
AUC, the area under the ROC curve (AUC):
- Describes the probability that a classifier will rank a randomly chosen positive instance higher than a negative one
- Values between 0 and 1, the higher the better

What the model predicts	True active	True inactive
active	True Positive (TP)	False Positive (FP)
inactive	False Negative (FN)	True Negative (TN)

Practical¶

[1]:

from pathlib import Path
from warnings import filterwarnings
import time

import pandas as pd
import numpy as np
from sklearn import svm, metrics, clone
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import KFold, train_test_split
from sklearn.metrics import auc, accuracy_score, recall_score
from sklearn.metrics import roc_curve, roc_auc_score
import matplotlib.pyplot as plt
from rdkit import Chem
from rdkit.Chem import MACCSkeys, rdFingerprintGenerator

from teachopencadd.utils import seed_everything

# Silence some expected warnings
filterwarnings("ignore")
# Fix seed for reproducible results
SEED = 22
seed_everything(SEED)

[2]:

# Set path to this notebook
HERE = Path(_dh[-1])
DATA = HERE / "data"

Load compound and activity data¶

Let’s start by loading our data, which focuses on the Epidermal growth factor receptor (EGFR) kinase. The csv file from Talktorial T002 is loaded into a dataframe with the important columns:

CHEMBL-ID
SMILES string of the corresponding compound
Measured affinity: pIC50

[3]:

# Read data from previous talktorials
chembl_df = pd.read_csv(
    HERE / "../T002_compound_adme/data/EGFR_compounds_lipinski.csv",
    index_col=0,
)

# Look at head
print("Shape of dataframe : ", chembl_df.shape)
chembl_df.head()
# NBVAL_CHECK_OUTPUT

Shape of dataframe :  (4635, 10)

[3]:

	molecule_chembl_id	IC50	units	smiles	pIC50	molecular_weight	n_hba	n_hbd	logp	ro5_fulfilled
0	CHEMBL63786	0.003	nM	Brc1cccc(Nc2ncnc3cc4ccccc4cc23)c1	11.522879	349.021459	3	1	5.2891	True
1	CHEMBL35820	0.006	nM	CCOc1cc2ncnc(Nc3cccc(Br)c3)c2cc1OCC	11.221849	387.058239	5	1	4.9333	True
2	CHEMBL53711	0.006	nM	CN(C)c1cc2c(Nc3cccc(Br)c3)ncnc2cn1	11.221849	343.043258	5	1	3.5969	True
3	CHEMBL66031	0.008	nM	Brc1cccc(Nc2ncnc3cc4[nH]cnc4cc23)c1	11.096910	339.011957	4	2	4.0122	True
4	CHEMBL53753	0.008	nM	CNc1cc2c(Nc3cccc(Br)c3)ncnc2cn1	11.096910	329.027607	5	2	3.5726	True

[4]:

# Keep only the columns we want
chembl_df = chembl_df[["molecule_chembl_id", "smiles", "pIC50"]]
chembl_df.head()
# NBVAL_CHECK_OUTPUT

[4]:

	molecule_chembl_id	smiles	pIC50
0	CHEMBL63786	Brc1cccc(Nc2ncnc3cc4ccccc4cc23)c1	11.522879
1	CHEMBL35820	CCOc1cc2ncnc(Nc3cccc(Br)c3)c2cc1OCC	11.221849
2	CHEMBL53711	CN(C)c1cc2c(Nc3cccc(Br)c3)ncnc2cn1	11.221849
3	CHEMBL66031	Brc1cccc(Nc2ncnc3cc4[nH]cnc4cc23)c1	11.096910
4	CHEMBL53753	CNc1cc2c(Nc3cccc(Br)c3)ncnc2cn1	11.096910

Data preparation¶

Data labeling¶

We need to classify each compound as active or inactive. Therefore, we use the pIC50 value.

pIC50 = -log10(IC50)
IC50 describes the amount of substance needed to inhibit, in vitro, a process by 50% .
A common cut-off value to discretize pIC50 data is 6.3, which we will use for our experiment (refer to J. Med. Chem. (2017), 60, 474-485 and the corresponding notebook)
Note that there are several other suggestions for an activity cut-off ranging from an pIC50 value of 5 to 7 in the literature or even to define an exclusion range when not to take data points.

[5]:

# Add column for activity
chembl_df["active"] = np.zeros(len(chembl_df))

# Mark every molecule as active with an pIC50 of >= 6.3, 0 otherwise
chembl_df.loc[chembl_df[chembl_df.pIC50 >= 6.3].index, "active"] = 1.0

# NBVAL_CHECK_OUTPUT
print("Number of active compounds:", int(chembl_df.active.sum()))
print("Number of inactive compounds:", len(chembl_df) - int(chembl_df.active.sum()))

Number of active compounds: 2631
Number of inactive compounds: 2004

[6]:

chembl_df.head()
# NBVAL_CHECK_OUTPUT

[6]:

	molecule_chembl_id	smiles	pIC50	active
0	CHEMBL63786	Brc1cccc(Nc2ncnc3cc4ccccc4cc23)c1	11.522879	1.0
1	CHEMBL35820	CCOc1cc2ncnc(Nc3cccc(Br)c3)c2cc1OCC	11.221849	1.0
2	CHEMBL53711	CN(C)c1cc2c(Nc3cccc(Br)c3)ncnc2cn1	11.221849	1.0
3	CHEMBL66031	Brc1cccc(Nc2ncnc3cc4[nH]cnc4cc23)c1	11.096910	1.0
4	CHEMBL53753	CNc1cc2c(Nc3cccc(Br)c3)ncnc2cn1	11.096910	1.0

Molecule encoding¶

Now we define a function smiles_to_fp to generate fingerprints from SMILES. For now, we incorporated the choice between the following fingerprints:

maccs
morgan2 and morgan3

[7]:

def smiles_to_fp(smiles, method="maccs", n_bits=2048):
    """
    Encode a molecule from a SMILES string into a fingerprint.

    Parameters
    ----------
    smiles : str
        The SMILES string defining the molecule.

    method : str
        The type of fingerprint to use. Default is MACCS keys.

    n_bits : int
        The length of the fingerprint.

    Returns
    -------
    array
        The fingerprint array.

    """

    # convert smiles to RDKit mol object
    mol = Chem.MolFromSmiles(smiles)

    if method == "maccs":
        return np.array(MACCSkeys.GenMACCSKeys(mol))
    if method == "morgan2":
        fpg = rdFingerprintGenerator.GetMorganGenerator(radius=2, fpSize=n_bits)
        return np.array(fpg.GetFingerprint(mol))
    if method == "morgan3":
        fpg = rdFingerprintGenerator.GetMorganGenerator(radius=3, fpSize=n_bits)
        return np.array(fpg.GetFingerprint(mol))
    else:
        # NBVAL_CHECK_OUTPUT
        print(f"Warning: Wrong method specified: {method}. Default will be used instead.")
        return np.array(MACCSkeys.GenMACCSKeys(mol))

[8]:

compound_df = chembl_df.copy()

[9]:

# Add column for fingerprint
compound_df["fp"] = compound_df["smiles"].apply(smiles_to_fp)
compound_df.head(3)
# NBVAL_CHECK_OUTPUT

[9]:

	molecule_chembl_id	smiles	pIC50	active	fp
0	CHEMBL63786	Brc1cccc(Nc2ncnc3cc4ccccc4cc23)c1	11.522879	1.0	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1	CHEMBL35820	CCOc1cc2ncnc(Nc3cccc(Br)c3)c2cc1OCC	11.221849	1.0	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2	CHEMBL53711	CN(C)c1cc2c(Nc3cccc(Br)c3)ncnc2cn1	11.221849	1.0	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Machine Learning (ML)¶

In the following, we will try several ML approaches to classify our molecules. We will use:

Random Forest (RF)
Support Vector Machine (SVM)
Artificial Neural Network (ANN)

Additionally, we will comment on the results.

The goal is to test the ability of the model to predict data which it has never seen before in order to flag problems known as over fitting and to assess the generalization ability of the model.

We start by defining a function model_training_and_validation which fits a model on a random train-test split of the data and returns measures such as accuracy, sensitivity, specificity and AUC evaluated on the test set. We also plot the ROC curves using plot_roc_curves_for_models.

We then define a function named crossvalidation which executes a cross validation procedure and prints the statistics of the results over the folds.

Helper functions¶

Helper function to plot customized ROC curves. Code inspired by stackoverflow.

[10]:

def plot_roc_curves_for_models(models, test_x, test_y, save_png=False):
    """
    Helper function to plot customized roc curve.

    Parameters
    ----------
    models: dict
        Dictionary of pretrained machine learning models.
    test_x: list
        Molecular fingerprints for test set.
    test_y: list
        Associated activity labels for test set.
    save_png: bool
        Save image to disk (default = False)

    Returns
    -------
    fig:
        Figure.
    """

    fig, ax = plt.subplots()

    # Below for loop iterates through your models list
    for model in models:
        # Select the model
        ml_model = model["model"]
        # Prediction probability on test set
        test_prob = ml_model.predict_proba(test_x)[:, 1]
        # Prediction class on test set
        test_pred = ml_model.predict(test_x)
        # Compute False postive rate and True positive rate
        fpr, tpr, thresholds = metrics.roc_curve(test_y, test_prob)
        # Calculate Area under the curve to display on the plot
        auc = roc_auc_score(test_y, test_prob)
        # Plot the computed values
        ax.plot(fpr, tpr, label=(f"{model['label']} AUC area = {auc:.2f}"))

    # Custom settings for the plot
    ax.plot([0, 1], [0, 1], "r--")
    ax.set_xlabel("False Positive Rate")
    ax.set_ylabel("True Positive Rate")
    ax.set_title("Receiver Operating Characteristic")
    ax.legend(loc="lower right")
    # Save plot
    if save_png:
        fig.savefig(f"{DATA}/roc_auc", dpi=300, bbox_inches="tight", transparent=True)
    return fig

Helper function to calculate model performance.

[11]:

def model_performance(ml_model, test_x, test_y, verbose=True):
    """
    Helper function to calculate model performance

    Parameters
    ----------
    ml_model: sklearn model object
        The machine learning model to train.
    test_x: list
        Molecular fingerprints for test set.
    test_y: list
        Associated activity labels for test set.
    verbose: bool
        Print performance measure (default = True)

    Returns
    -------
    tuple:
        Accuracy, sensitivity, specificity, auc on test set.
    """

    # Prediction probability on test set
    test_prob = ml_model.predict_proba(test_x)[:, 1]

    # Prediction class on test set
    test_pred = ml_model.predict(test_x)

    # Performance of model on test set
    accuracy = accuracy_score(test_y, test_pred)
    sens = recall_score(test_y, test_pred)
    spec = recall_score(test_y, test_pred, pos_label=0)
    auc = roc_auc_score(test_y, test_prob)

    if verbose:
        # Print performance results
        # NBVAL_CHECK_OUTPUT        print(f"Accuracy: {accuracy:.2}")
        print(f"Sensitivity: {sens:.2f}")
        print(f"Specificity: {spec:.2f}")
        print(f"AUC: {auc:.2f}")

    return accuracy, sens, spec, auc

Helper function to fit a machine learning model on a random train-test split of the data and return the performance measures.

[12]:

def model_training_and_validation(ml_model, name, splits, verbose=True):
    """
    Fit a machine learning model on a random train-test split of the data
    and return the performance measures.

    Parameters
    ----------
    ml_model: sklearn model object
        The machine learning model to train.
    name: str
        Name of machine learning algorithm: RF, SVM, ANN
    splits: list
        List of desciptor and label data: train_x, test_x, train_y, test_y.
    verbose: bool
        Print performance info (default = True)

    Returns
    -------
    tuple:
        Accuracy, sensitivity, specificity, auc on test set.

    """
    train_x, test_x, train_y, test_y = splits

    # Fit the model
    ml_model.fit(train_x, train_y)

    # Calculate model performance results
    accuracy, sens, spec, auc = model_performance(ml_model, test_x, test_y, verbose)

    return accuracy, sens, spec, auc

Preprocessing: Split the data (will be reused for the other models)

[13]:

fingerprint_to_model = compound_df.fp.tolist()
label_to_model = compound_df.active.tolist()

# Split data randomly in train and test set
# note that we use test/train_x for the respective fingerprint splits
# and test/train_y for the respective label splits
(
    static_train_x,
    static_test_x,
    static_train_y,
    static_test_y,
) = train_test_split(fingerprint_to_model, label_to_model, test_size=0.2, random_state=SEED)
splits = [static_train_x, static_test_x, static_train_y, static_test_y]
# NBVAL_CHECK_OUTPUT
print("Training data size:", len(static_train_x))
print("Test data size:", len(static_test_x))

Training data size: 3708
Test data size: 927

Random forest classifier¶

We start with a random forest classifier, where we first set the parameters.

We train the model on a random train-test split and plot the results.

[14]:

# Set model parameter for random forest
param = {
    "n_estimators": 100,  # number of trees to grows
    "criterion": "entropy",  # cost function to be optimized for a split
}
model_RF = RandomForestClassifier(**param)

[15]:

# Fit model on single split
performance_measures = model_training_and_validation(model_RF, "RF", splits)

Sensitivity: 0.90
Specificity: 0.77
AUC: 0.90

[16]:

# Initialize the list that stores all models. First one is RF.
models = [{"label": "Model_RF", "model": model_RF}]
# Plot roc curve
plot_roc_curves_for_models(models, static_test_x, static_test_y);

../_images/talktorials_T007_compound_activity_machine_learning_42_0.png

Support vector classifier¶

Here we train a SVM with a radial-basis function kernel (also: squared-exponential kernel). For more information, see sklearn RBF kernel.

[17]:

# Specify model
model_SVM = svm.SVC(kernel="rbf", C=1, gamma=0.1, probability=True)

# Fit model on single split
performance_measures = model_training_and_validation(model_SVM, "SVM", splits)

Sensitivity: 0.92
Specificity: 0.74
AUC: 0.90

[18]:

# Append SVM model
models.append({"label": "Model_SVM", "model": model_SVM})
# Plot roc curve
plot_roc_curves_for_models(models, static_test_x, static_test_y);

../_images/talktorials_T007_compound_activity_machine_learning_45_0.png

Neural network classifier¶

The last approach we try here is a neural network model. We train an MLPClassifier (Multi-layer Perceptron classifier) with 2 layers: the first layer with 5 neurons and the second layer with 3 neurons. As before, we do the crossvalidation procedure and plot the results. For more information on MLP, see sklearn MLPClassifier.

[19]:

# Specify model
model_ANN = MLPClassifier(hidden_layer_sizes=(5, 3), random_state=SEED)

# Fit model on single split
performance_measures = model_training_and_validation(model_ANN, "ANN", splits)

Sensitivity: 0.85
Specificity: 0.73
AUC: 0.86

[20]:

# Append ANN model
models.append({"label": "Model_ANN", "model": model_ANN})
# Plot roc curve
plot_roc_curves_for_models(models, static_test_x, static_test_y, True);

../_images/talktorials_T007_compound_activity_machine_learning_48_0.png

Our models show very good values for all measured values (see AUCs) and thus seem to be predictive.

Cross-validation¶

Next, we will perform cross-validation experiments with the three different models. Therefore, we define a helper function for machine learning model training and validation in a cross-validation loop.

[21]:

def crossvalidation(ml_model, df, n_folds=5, verbose=False):
    """
    Machine learning model training and validation in a cross-validation loop.

    Parameters
    ----------
    ml_model: sklearn model object
        The machine learning model to train.
    df: pd.DataFrame
        Data set with SMILES and their associated activity labels.
    n_folds: int, optional
        Number of folds for cross-validation.
    verbose: bool, optional
        Performance measures are printed.

    Returns
    -------
    None

    """
    t0 = time.time()
    # Shuffle the indices for the k-fold cross-validation
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=SEED)

    # Results for each of the cross-validation folds
    acc_per_fold = []
    sens_per_fold = []
    spec_per_fold = []
    auc_per_fold = []

    # Loop over the folds
    for train_index, test_index in kf.split(df):
        # clone model -- we want a fresh copy per fold!
        fold_model = clone(ml_model)
        # Training

        # Convert the fingerprint and the label to a list
        train_x = df.iloc[train_index].fp.tolist()
        train_y = df.iloc[train_index].active.tolist()

        # Fit the model
        fold_model.fit(train_x, train_y)

        # Testing

        # Convert the fingerprint and the label to a list
        test_x = df.iloc[test_index].fp.tolist()
        test_y = df.iloc[test_index].active.tolist()

        # Performance for each fold
        accuracy, sens, spec, auc = model_performance(fold_model, test_x, test_y, verbose)

        # Save results
        acc_per_fold.append(accuracy)
        sens_per_fold.append(sens)
        spec_per_fold.append(spec)
        auc_per_fold.append(auc)

    # Print statistics of results
    print(
        f"Mean accuracy: {np.mean(acc_per_fold):.2f} \t"
        f"and std : {np.std(acc_per_fold):.2f} \n"
        f"Mean sensitivity: {np.mean(sens_per_fold):.2f} \t"
        f"and std : {np.std(sens_per_fold):.2f} \n"
        f"Mean specificity: {np.mean(spec_per_fold):.2f} \t"
        f"and std : {np.std(spec_per_fold):.2f} \n"
        f"Mean AUC: {np.mean(auc_per_fold):.2f} \t"
        f"and std : {np.std(auc_per_fold):.2f} \n"
        f"Time taken : {time.time() - t0:.2f}s\n"
    )

    return acc_per_fold, sens_per_fold, spec_per_fold, auc_per_fold

Cross-validation

We now apply cross-validation and show the statistics for all three ML models. In real world conditions, cross-validation usually applies 5 or more folds, but for the sake of performance we will reduce it to 3. You can change the value of N_FOLDS in this cell below.

[22]:

N_FOLDS = 3

Note: Next cell takes long to execute

[23]:

for model in models:
    print("\n======= ")
    print(f"{model['label']}")
    crossvalidation(model["model"], compound_df, n_folds=N_FOLDS)


=======
Model_RF
Mean accuracy: 0.83     and std : 0.01
Mean sensitivity: 0.88  and std : 0.02
Mean specificity: 0.76  and std : 0.01
Mean AUC: 0.89  and std : 0.01
Time taken : 1.32s


=======
Model_SVM
Mean accuracy: 0.84     and std : 0.01
Mean sensitivity: 0.90  and std : 0.01
Mean specificity: 0.76  and std : 0.00
Mean AUC: 0.89  and std : 0.01
Time taken : 12.12s


=======
Model_ANN
Mean accuracy: 0.81     and std : 0.00
Mean sensitivity: 0.86  and std : 0.01
Mean specificity: 0.73  and std : 0.01
Mean AUC: 0.87  and std : 0.00
Time taken : 4.31s

We look at the cross-validation performance for molecules encoded using Morgan fingerprint and not MACCS keys.

[24]:

# Reset data frame
compound_df = chembl_df.copy()

[25]:

# Use Morgan fingerprint with radius 3
compound_df["fp"] = compound_df["smiles"].apply(smiles_to_fp, args=("morgan3",))
compound_df.head(3)
# NBVAL_CHECK_OUTPUT

[25]:

	molecule_chembl_id	smiles	pIC50	active	fp
0	CHEMBL63786	Brc1cccc(Nc2ncnc3cc4ccccc4cc23)c1	11.522879	1.0	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
1	CHEMBL35820	CCOc1cc2ncnc(Nc3cccc(Br)c3)c2cc1OCC	11.221849	1.0	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
2	CHEMBL53711	CN(C)c1cc2c(Nc3cccc(Br)c3)ncnc2cn1	11.221849	1.0	[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Note: Next cell takes long to execute

[26]:

for model in models:
    if model["label"] == "Model_SVM":
        # SVM is super slow with long fingerprints
        # and will have a performance similar to RF
        # We can skip it in this test, but if you want
        # to run it, feel free to replace `continue` with `pass`
        continue
    print("\n=======")
    print(model["label"])
    crossvalidation(model["model"], compound_df, n_folds=N_FOLDS)


=======
Model_RF
Mean accuracy: 0.86     and std : 0.01
Mean sensitivity: 0.90  and std : 0.01
Mean specificity: 0.81  and std : 0.00
Mean AUC: 0.92  and std : 0.01
Time taken : 5.21s


=======
Model_ANN
Mean accuracy: 0.81     and std : 0.01
Mean sensitivity: 0.84  and std : 0.01
Mean specificity: 0.77  and std : 0.00
Mean AUC: 0.88  and std : 0.01
Time taken : 12.77s

Discussion¶

Which model performed best on our data set and why?
- All three models perform (very) well on our dataset. The best models are the random forest and support vector machine models which showed a mean AUC of about 90%. Our neural network showed slightly lower results.
- There might be several reasons that random forest and support vector machine models performed best. Our dataset might be easily separable in active/inactive with some simple tree-like decisions or with the radial basis function, respectively. Thus, there is not such a complex pattern in the fingerprints to do this classification.
- A cause for the slightly poorer performance of the ANN could be that there was simply too few data to train the model on.
- Additionally, it is always advisable to have another external validation set for model evaluation.
Was MACCS the right choice?
- Obviously, MACCS was good to start training and validating models to see if a classification is possible.
- However, MACCS keys are rather short (166 bit) compared to others (2048 bit), as for example Morgan fingerprint. As shown in the last simulation, having longer fingerprint helps the learning process. All tested models performed slightly better using Morgan fingerprints (see mean AUC increase).

Where can we go from here?¶

We successfully trained several models.
The next step could be to use these models to do a classification with an unknown screening dataset to predict novel potential EGFR inhibitors.
An example for a large screening data set is e.g. MolPort with over 7 million compounds.
Our models could be used to rank the MolPort compounds and then further study those with the highest predicted probability of being active.
For such an application, see also the TDT Tutorial developed by S. Riniker and G. Landrum, where they trained a fusion model to screen eMolecules for new anti-malaria drugs.

Quiz¶

How can you apply ML for virtual screening?
Which machine learning algorithms do you know?
What are necessary prerequisites to successfully apply ML?