# T025 · Kinase similarity: Kinase pocket (KiSSim fingerprint)¶

Note: This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:

## Aim of this talktorial¶

We will assess the similarity between a set of kinases from a structural point of view using the KiSSim fingerprint. This fingerprint describes the physicochemical and spatial properties in structurally resolved kinases.

### Contents in Theory¶

• Kinase dataset

• Kinase similarity descriptor: Kinase pockets (KiSSim fingerprint)

• Fetching KLIFS data with opencadd.databases.klifs

### Contents in Practical¶

• Define the kinases of interest

• Retrieve and preprocess data

• Set up a remote KLIFS session

• Fetch all structures describing these kinases

• Filter structures

• Show kinase coverage

• Compare structures

• Map structure to kinase distance matrix

• Save kinase distance matrix

## Theory¶

### Kinase dataset¶

We use the kinase selection as defined in Talktorial T023.

### Kinase similarity descriptor: Kinase pockets (KiSSim fingerprint)¶

Side effects often occur when a drug binds not only to its designated target (on-target) but also to other targets (off-targets) that share similar binding sites to form similar protein-ligand interaction patterns. Many binding site comparison tools have been proposed to predict similarities and potential off-targets (see Curr. Comput. Aided Drug Des. (2008), 4, 209-20 and J. Med. Chem. (2016), 9, 4121-51).

Here we use the novel KiSSim (Kinase Structure Similarity) fingerprint, which is based on the KLIFS pocket definition and alignment of $$85$$ pocket residues (see more details in Talktorial T023).

The KiSSim fingerprint encodes each of the $$85$$ residues in the KLIFS binding site with respect to physicochemical and spatial properties (Figure 1). Since all KLIFS pockets are aligned across the structurally covered kinome, we can compare the KiSSim fingerprints bit by bit. Physicochemical properties of each residue include pharmacophoric features, size, side chain orientation, and solvent exposure. Spatial properties describe the distance of each residue to defined important regions in the kinase pocket, for example the hinge region, which forms crucial hydrogen bonds to ligands, or the DFG region, whose conformation determines the activity state of the kinase. See more details on structural motifs in kinases in Talktorial T023.

Figure 1: The KiSSim fingerprint encodes physicochemical and spatial properties of a kinase binding site. Figure taken from: https://github.com/volkamerlab/kissim

### Fetching KLIFS data with opencadd.databases.klifs¶

opencadd is a Python library for structural cheminformatics developed by the Volkamer lab at the Charité in Berlin. This library is a growing collection of modules that help facilitate and standardize common tasks in structural bioinformatics and cheminformatics. Today, we will use the module opencadd.databases.klifs, which allows us to fetch the KLIFS structures as a pandas DataFrame.

## Practical¶

[1]:

from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.metrics import pairwise
import matplotlib.pyplot as plt
import seaborn as sns

[2]:

HERE = Path(_dh[-1])
DATA = HERE / "data"


### Define the kinases of interest¶

Let’s load the kinase selection as defined in Talktorial T023.

[3]:

kinase_selection_df = pd.read_csv(HERE / "../T023_what_is_a_kinase/data/kinase_selection.csv")
kinase_selection_df
# NBVAL_CHECK_OUTPUT

[3]:

kinase kinase_klifs uniprot_id group full_kinase_name
0 EGFR EGFR P00533 TK Epidermal growth factor receptor
1 ErbB2 ErbB2 P04626 TK Erythroblastic leukemia viral oncogene homolog 2
2 PI3K p110a P42336 Atypical Phosphatidylinositol-3-kinase
3 VEGFR2 KDR P35968 TK Vascular endothelial growth factor receptor 2
4 BRAF BRAF P15056 TKL Rapidly accelerated fibrosarcoma isoform B
5 CDK2 CDK2 P24941 CMGC Cyclic-dependent kinase 2
6 LCK LCK P06239 TK Lymphocyte-specific protein tyrosine kinase
7 MET MET P08581 TK Mesenchymal-epithelial transition factor
8 p38a p38a Q16539 CMGC p38 mitogen activated protein kinase alpha

### Retrieve and preprocess data¶

Now, we query the KLIFS database using the opencadd.databases.klifs module to fetch available structures in KLIFS.

#### Set up a remote KLIFS session¶

[4]:

from opencadd.databases.klifs import setup_remote

[5]:

klifs_session = setup_remote()


#### Fetch all structures describing these kinases¶

[6]:

# Get list of kinase names
kinase_names = kinase_selection_df["kinase_klifs"].to_list()

# Get all available structures for these kinases
structures_df = klifs_session.structures.by_kinase_name(kinase_names=kinase_names)
print(f"Number of structures: {len(structures_df)}")
print("Kinases:", *structures_df["kinase.klifs_name"].unique())

Number of structures: 2521
Kinases: CDK2 p38a EGFR ErbB2 MET LCK KDR BRAF p110a


Let’s have a look at what is stored in the structures’ DataFrame:

[7]:

structures_df.columns

[7]:

Index(['structure.klifs_id', 'structure.pdb_id', 'structure.alternate_model',
'structure.chain', 'species.klifs', 'kinase.klifs_id',
'kinase.klifs_name', 'kinase.names', 'kinase.family', 'kinase.group',
'structure.pocket', 'ligand.expo_id', 'ligand_allosteric.expo_id',
'ligand.klifs_id', 'ligand_allosteric.klifs_id', 'ligand.name',
'ligand_allosteric.name', 'structure.dfg', 'structure.ac_helix',
'structure.resolution', 'structure.qualityscore',
'structure.missing_residues', 'structure.missing_atoms',
'structure.rmsd1', 'structure.rmsd2', 'interaction.fingerprint',
'structure.front', 'structure.gate', 'structure.back', 'structure.fp_i',
'structure.fp_ii', 'structure.bp_i_a', 'structure.bp_i_b',
'structure.bp_ii_in', 'structure.bp_ii_a_in', 'structure.bp_ii_b_in',
'structure.bp_ii_out', 'structure.bp_ii_b', 'structure.bp_iii',
'structure.bp_iv', 'structure.bp_v', 'structure.grich_distance',
'structure.grich_angle', 'structure.grich_rotation',
'structure.filepath', 'structure.curation_flag'],
dtype='object')

[8]:

structures_df.head()

[8]:

structure.klifs_id structure.pdb_id structure.alternate_model structure.chain species.klifs kinase.klifs_id kinase.klifs_name kinase.names kinase.family kinase.group ... structure.bp_ii_out structure.bp_ii_b structure.bp_iii structure.bp_iv structure.bp_v structure.grich_distance structure.grich_angle structure.grich_rotation structure.filepath structure.curation_flag
0 4304 4bcp B A Human 198 CDK2 <NA> <NA> <NA> ... False False False False False 15.5676 53.531101 72.388298 <NA> False
1 4204 3qqj B A Human 198 CDK2 <NA> <NA> <NA> ... False False False False False 12.3077 42.122501 61.961201 <NA> False
2 14136 7vdu A A Human 198 CDK2 <NA> <NA> <NA> ... False False False False False 14.4634 49.359100 66.803703 <NA> False
3 14137 7vdu B A Human 198 CDK2 <NA> <NA> <NA> ... False False False False False 14.4634 49.359100 66.803703 <NA> False
4 6571 5if1 B C Human 198 CDK2 <NA> <NA> <NA> ... False False False False False 16.4771 55.897099 85.301102 <NA> False

5 rows × 46 columns

#### Filter structures¶

We filter the structures by different criteria:

• Species: human

• Conformation: DFG-in (the active kinase conformation)

• Resolution: $$\le 3$$ Angström

• Quality score*: $$\ge 6$$

* The KLIFS quality score takes into account the quality of the alignment, as well as the number of missing residues and atoms. A higher score indicates a better structure quality.

[9]:

structures_df = structures_df[
(structures_df["species.klifs"] == "Human")
& (structures_df["structure.dfg"] == "in")
& (structures_df["structure.resolution"] <= 3)
& (structures_df["structure.qualityscore"] >= 6)
]
print(f"Number of structures: {len(structures_df)}")
print("Kinases:", *structures_df["kinase.klifs_name"].unique())

Number of structures: 1655
Kinases: CDK2 p38a EGFR ErbB2 MET LCK KDR BRAF p110a


Save the structure KLIFS IDs for the next step.

[10]:

structure_klifs_ids = structures_df["structure.klifs_id"].to_list()
print(f"Number of structures: {len(structure_klifs_ids)}")

Number of structures: 1655


To make it easier for us to maintain the talktorials, we will now load a set of frozen structure KLIFS IDs (2021-08-23) and continue to work with those. If you would like to work with the latest KLIFS data, please skip the cell below (or change cell type from “Code” > “Raw”).

[11]:

# Load frozen dataset
"structure.klifs_id"
].to_list()
structures_df = structures_df[structures_df["structure.klifs_id"].isin(structure_klifs_ids)].copy()
print(f"Number of structures: {structures_df.shape[0]}")
# NBVAL_CHECK_OUTPUT

Number of structures: 1620


### Show kinase coverage¶

Let’s get the number of structures that describe our kinases (kinase coverage).

[12]:

# Use pandas' groupby method to count the number of structures (rows) per kinase
n_structures_per_kinase = structures_df.groupby("kinase.klifs_name").size().sort_values()
n_structures_per_kinase
# NBVAL_CHECK_OUTPUT

[12]:

kinase.klifs_name
ErbB2      4
KDR        6
LCK       32
p110a     49
BRAF      72
MET       99
p38a     151
EGFR     357
CDK2     850
dtype: int64


Let’s plot the results.

[13]:

fig, ax = plt.subplots()
n_structures_per_kinase.plot(kind="barh", ax=ax)
ax.set_xlabel("Number of structures")
ax.set_ylabel("Kinase name")
for i, value in enumerate(n_structures_per_kinase):
ax.text(value, i, str(value), va="center")
# NBVAL_CHECK_OUTPUT


We see that our dataset is highly imbalanced. While some kinases are structurally resolved very often, other kinases are not. We will have to keep this in mind when interpreting our results later.

We will use in this notebook pre-calculated KiSSim fingerprints for our kinase set. If you wanted to calculate the fingerprints yourself, you could use the following kissim API:

# Generate fingerprints
from kissim.api import encode
kissim_fingerprints = encode(structure_klifs_ids, n_cores=1)

# Save fingerprints in csv file
structure_klifs_ids = list(kissim_fingerprints.data.keys())
kissim_fingerprints_array = [
fingerprint.values_array().tolist()
for structure_klifs_id, fingerprint
in kissim_fingerprints.data.items()
]
kissim_fingerprints_array = np.array(kissim_fingerprints_array)
kissim_fingerprints_df = pd.DataFrame(kissim_fingerprints_array, index=structure_klifs_ids)
kissim_fingerprints_df.to_csv(DATA / "kissim_fingerprints.csv")


Let’s load the pre-calculated KiSSim fingerprints!

[14]:

kissim_fingerprints_df = pd.read_csv(DATA / "kissim_fingerprints.csv", index_col=0)
print(f"Matrix shape: {kissim_fingerprints_df.shape}")
print(f"Number of fingerprints: {kissim_fingerprints_df.shape[0]}")
print(f"Number of fingerprint bits: {kissim_fingerprints_df.shape[1]}")
# NBVAL_CHECK_OUTPUT

Matrix shape: (1611, 1032)
Number of fingerprints: 1611
Number of fingerprint bits: 1032


You might notice that we have slightly fewer fingerprints than structures. This can happen during the kissim fingerprint generation, if a structure cannot be encoded.

Let’s have a look at the fingerprint DataFrame.

• Kinase structures are stored by KLIFS ID (index)

• Each kinase is represented by $$1032$$ features, i.e. $$85$$ residues * $$8$$ physicochem. features + $$85$$ residues * $$4$$ distances + $$12$$ moments

[15]:

kissim_fingerprints_df.head()
# NBVAL_CHECK_OUTPUT

[15]:

0 1 2 3 4 5 6 7 8 9 ... 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031
6285 2.0 0.0 2.0 -1.0 0.0 0.0 2.0 3.0 2.0 1.0 ... 13.150351 11.958837 4.717011 4.843444 4.655707 3.577213 2.771821 4.302192 3.583341 2.066700
10568 2.0 0.0 2.0 -1.0 0.0 0.0 2.0 3.0 2.0 1.0 ... 13.069152 11.883944 4.691527 5.006221 4.679352 3.531177 2.714736 4.165350 3.549843 2.138838
11187 2.0 0.0 2.0 -1.0 0.0 0.0 2.0 3.0 2.0 1.0 ... 13.297023 11.991511 4.590040 5.141397 4.699467 3.625989 2.549692 4.442117 3.699695 2.261646
4060 2.0 0.0 2.0 -1.0 0.0 0.0 2.0 3.0 2.0 1.0 ... 12.910837 11.775556 4.359330 4.844833 4.214195 3.383812 2.699580 3.860920 3.161863 2.185979
10566 2.0 0.0 2.0 -1.0 0.0 0.0 2.0 3.0 2.0 1.0 ... 13.196581 12.115342 4.701176 4.690081 4.683674 3.633418 2.489983 3.972552 3.692501 0.759234

5 rows × 1032 columns

### Compare structures¶

Let’s make a pairwise comparison of the KiSSim fingerprints. We use the nan_euclidean_distances method of sklearn, which calculates the Euclidean distance between all pairwise vectors.

If two vectors have NaN values (which can happen if residues are not resolved in a structure), the following procedure is applied:

When calculating the distance between a pair of samples, this formulation ignores feature coordinates with a missing value in either sample and scales up the weight of the remaining coordinates.

[16]:

structure_distance_matrix_array = pairwise.nan_euclidean_distances(kissim_fingerprints_df.values)

[17]:

# Create DataFrame with structure KLIFS IDs as index/columns
structure_klifs_ids = kissim_fingerprints_df.index.to_list()
structure_distance_matrix_df = pd.DataFrame(
structure_distance_matrix_array, index=structure_klifs_ids, columns=structure_klifs_ids
)
print(f"Structure distance matrix size: {structure_distance_matrix_df.shape}")
print("Show matrix subset:")
structure_distance_matrix_df.iloc[:5, :5]
# NBVAL_CHECK_OUTPUT

Structure distance matrix size: (1611, 1611)
Show matrix subset:

[17]:

6285 10568 11187 4060 10566
6285 0.000000 13.256941 14.001474 26.391543 14.307291
10568 13.256941 0.000000 10.379779 27.882193 16.833932
11187 14.001474 10.379779 0.000000 30.962221 18.338492
4060 26.391543 27.882193 30.962221 0.000000 28.905189
10566 14.307291 16.833932 18.338492 28.905189 0.000000

### Map structure to kinase distance matrix¶

Note: So far we compared individual structures, but we want to compare kinases (which can be represented by several structures, as plotted above).

First, as an intermediate step, we will use the structure distance matrix but — instead of labeling the data with structure KLIFS IDs — we use the corresponding kinase name.

[18]:

# Copy distance matrix to kinase matrix
kinase_distance_matrix_df = structure_distance_matrix_df.copy()
# Replace structure KLIFS IDs with the structures' kinase names
kinase_names = structures_df.set_index("structure.klifs_id").loc[
structure_klifs_ids, "kinase.klifs_name"
]
kinase_distance_matrix_df.index = kinase_names
kinase_distance_matrix_df.columns = kinase_names
print("Show matrix subset:")
kinase_distance_matrix_df.iloc[:5, :5]
# NBVAL_CHECK_OUTPUT

Show matrix subset:

[18]:

kinase.klifs_name CDK2 CDK2 CDK2 CDK2 CDK2
kinase.klifs_name
CDK2 0.000000 13.256941 14.001474 26.391543 14.307291
CDK2 13.256941 0.000000 10.379779 27.882193 16.833932
CDK2 14.001474 10.379779 0.000000 30.962221 18.338492
CDK2 26.391543 27.882193 30.962221 0.000000 28.905189
CDK2 14.307291 16.833932 18.338492 28.905189 0.000000

In this talktorial, we will consider per kinase pair the two structures that show the most similar pockets. Hence, we select the structure pair with the minimum distance as representative for a kinase pair.

[19]:

# We unstack the matrix (each pairwise comparison in a single row)
# We group by kinase names (level=[0, 1] ensures that the order of the kinases is ignored
# We take the minimum value in each kinase pair group
# We unstack the remaining data points
kinase_distance_matrix_df = (
kinase_distance_matrix_df.unstack().groupby(level=[0, 1]).min().unstack(level=1)
)
# Cosmetics: Remove the index and column names
kinase_distance_matrix_df.index.name = None
kinase_distance_matrix_df.columns.name = None

[20]:

print(
f"Structure matrix of shape {structure_distance_matrix_df.shape} "
f"reduced to kinase matrix of shape {kinase_distance_matrix_df.shape}."
)
# NBVAL_CHECK_OUTPUT

Structure matrix of shape (1611, 1611) reduced to kinase matrix of shape (9, 9).

[21]:

# Show matrix with background gradient
cm = sns.light_palette("green", as_cmap=True)

[21]:

BRAF CDK2 EGFR ErbB2 KDR LCK MET p110a p38a
BRAF 0.000 17.156 19.515 21.383 21.089 21.583 20.297 37.611 21.731
CDK2 17.156 0.000 18.147 21.106 19.880 17.973 18.304 36.780 19.481
EGFR 19.515 18.147 0.000 16.392 17.282 16.467 17.498 36.046 22.128
ErbB2 21.383 21.106 16.392 0.000 23.851 23.881 22.563 41.277 24.682
KDR 21.089 19.880 17.282 23.851 0.000 19.255 20.431 41.104 20.263
LCK 21.583 17.973 16.467 23.881 19.255 0.000 19.221 39.022 22.457
MET 20.297 18.304 17.498 22.563 20.431 19.221 0.000 39.414 21.983
p110a 37.611 36.780 36.046 41.277 41.104 39.022 39.414 0.000 38.530
p38a 21.731 19.481 22.128 24.682 20.263 22.457 21.983 38.530 0.000

Note: Since this is a distance matrix, lighter colors indicate similarity, darker colors dissimilarity.

### Save kinase distance matrix¶

[22]:

kinase_distance_matrix_df.to_csv(DATA / "kinase_distance_matrix.csv")


## Discussion¶

In this talktorial, we have assessed kinase similarity using the KiSSim fingerprints, which describe physicochemical and spatial properties of pockets in kinase structures. We have reduced a structure distance matrix to a kinase distance matrix by selecting for each kinase pair the corresponding closest structure pair.

We have to keep two elements in mind:

• Pocket fingerprints like KiSSim encode the full pocket although some residues might be more important for ligand binding than others. As an alternative, one could instead use a subset of residues that have been shown to frequently interact with co-crystallized ligands.

• We only compare here the two closest structures per kinase pair, although we have — at least for kinases such as EGFR and CDK2 — much more structural data available. Aggregating multiple structures is a reasonable strategy but has two disadvantages:

• Some kinases have much higher coverage than others, leading to an imbalance in information content.

• Aggregated kinase fingerprints may cause too much averaging, making pairwise kinase comparison noisier.

The kinase distance matrix above will be reloaded in Talktorial T028, where we compare kinase similarities from different perspectives, including the pocket structure perspective we have talked about in this talktorial.

## Quiz¶

1. Can you think of reasons why it is important to include physicochemical and spatial properties in a fingerprint?

2. Can you think of a reason why the side chain orientation and solvent exposure may be interesting features to consider in a fingerprint?

3. Can you think of research questions for which you could make use of the structure distance matrix?