T025 · Kinase similarity: Kinase pocket (KiSSim fingerprint)

Note: This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:

Aim of this talktorial

We will assess the similarity between a set of kinases from a structural point of view using the KiSSim fingerprint. This fingerprint describes the physicochemical and spatial properties in structurally resolved kinases.

Note: We focus on similarities between orthosteric kinase binding sites; similarities to allosteric binding sites are not covered.

Contents in Theory

  • Kinase dataset

  • Kinase similarity descriptor: Kinase pockets (KiSSim fingerprint)

  • Fetching KLIFS data with opencadd.databases.klifs

Contents in Practical

  • Define the kinases of interest

  • Retrieve and preprocess data

    • Set up a remote KLIFS session

    • Fetch all structures describing these kinases

    • Filter structures

  • Show kinase coverage

  • Calculate KiSSim fingerprints

  • Compare structures

  • Map structure to kinase distance matrix

  • Save kinase distance matrix

References

Theory

Kinase dataset

We use the kinase selection as defined in Talktorial T023.

Kinase similarity descriptor: Kinase pockets (KiSSim fingerprint)

Side effects often occur when a drug binds not only to its designated target (on-target) but also to other targets (off-targets) that share similar binding sites to form similar protein-ligand interaction patterns. Many binding site comparison tools have been proposed to predict similarities and potential off-targets (see Curr. Comput. Aided Drug Des. (2008), 4, 209-20 and J. Med. Chem. (2016), 9, 4121-51).

Here we use the novel KiSSim (Kinase Structure Similarity) fingerprint, which is based on the KLIFS pocket definition and alignment of \(85\) pocket residues (see more details in Talktorial T023).

The KiSSim fingerprint encodes each of the \(85\) residues in the KLIFS binding site with respect to physicochemical and spatial properties (Figure 1). Since all KLIFS pockets are aligned across the structurally covered kinome, we can compare the KiSSim fingerprints bit by bit. Physicochemical properties of each residue include pharmacophoric features, size, side chain orientation, and solvent exposure. Spatial properties describe the distance of each residue to defined important regions in the kinase pocket, for example the hinge region, which forms crucial hydrogen bonds to ligands, or the DFG region, whose conformation determines the activity state of the kinase. See more details on structural motifs in kinases in Talktorial T023.

KiSSim fingerprint

Figure 1: The KiSSim fingerprint encodes physicochemical and spatial properties of a kinase binding site. Figure taken from: https://github.com/volkamerlab/kissim

Fetching KLIFS data with opencadd.databases.klifs

opencadd is a Python library for structural cheminformatics developed by the Volkamer lab at the Charité in Berlin. This library is a growing collection of modules that help facilitate and standardize common tasks in structural bioinformatics and cheminformatics. Today, we will use the module opencadd.databases.klifs, which allows us to fetch the KLIFS structures as a pandas DataFrame.

For more information about this library and the KLIFS OpenAPI, please refer to Talktorial T012.

Practical

[1]:
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.metrics import pairwise
import matplotlib.pyplot as plt
import seaborn as sns
from opencadd.databases.klifs import setup_remote
import kissim
[2]:
HERE = Path(_dh[-1])
DATA = HERE / "data"
[3]:
configs = pd.read_csv(HERE / "../T023_what_is_a_kinase/data/pipeline_configs.csv")
configs = configs.set_index("variable")["default_value"]

DEMO = bool(int(configs["DEMO"]))
N_STRUCTURES_PER_KINASE = int(configs["N_STRUCTURES_PER_KINASE"])
N_CORES = int(configs["N_CORES"])

print(f"Run in demo mode: {DEMO}")
if not DEMO:
    if N_STRUCTURES_PER_KINASE > 0:
        print(f"Number of structures per kinase: {N_STRUCTURES_PER_KINASE}")
    else:
        print(f"Number of structures per kinase: all available structures")
    print(f"Number of cores used: {N_CORES}")

# NBVAL_CHECK_OUTPUT
Run in demo mode: True

Define the kinases of interest

Let’s load the kinase selection as defined in Talktorial T023.

[4]:
kinase_selection_df = pd.read_csv(HERE / "../T023_what_is_a_kinase/data/kinase_selection.csv")
kinase_selection_df
# NBVAL_CHECK_OUTPUT
[4]:
kinase kinase_klifs uniprot_id group full_kinase_name
0 EGFR EGFR P00533 TK Epidermal growth factor receptor
1 ErbB2 ErbB2 P04626 TK Erythroblastic leukemia viral oncogene homolog 2
2 PI3K p110a P42336 Atypical Phosphatidylinositol-3-kinase
3 VEGFR2 KDR P35968 TK Vascular endothelial growth factor receptor 2
4 BRAF BRAF P15056 TKL Rapidly accelerated fibrosarcoma isoform B
5 CDK2 CDK2 P24941 CMGC Cyclic-dependent kinase 2
6 LCK LCK P06239 TK Lymphocyte-specific protein tyrosine kinase
7 MET MET P08581 TK Mesenchymal-epithelial transition factor
8 p38a p38a Q16539 CMGC p38 mitogen activated protein kinase alpha

Retrieve and preprocess data

Now, we query the KLIFS database using the opencadd.databases.klifs module to fetch available structures in KLIFS.

Set up a remote KLIFS session

[5]:
from opencadd.databases.klifs import setup_remote
[6]:
klifs_session = setup_remote()

Fetch all structures describing these kinases

[7]:
# Get list of kinase names
kinase_names = kinase_selection_df["kinase_klifs"].to_list()

# Get all available structures for these kinases
structures_df = klifs_session.structures.by_kinase_name(kinase_names=kinase_names)
print(f"Number of structures: {len(structures_df)}")
print("Kinases:", *structures_df["kinase.klifs_name"].unique())
Number of structures: 2523
Kinases: CDK2 p38a EGFR ErbB2 MET LCK KDR BRAF p110a

Let’s have a look at what is stored in the structures’ DataFrame:

[8]:
structures_df.columns
[8]:
Index(['structure.klifs_id', 'structure.pdb_id', 'structure.alternate_model',
       'structure.chain', 'species.klifs', 'kinase.klifs_id',
       'kinase.klifs_name', 'kinase.names', 'kinase.family', 'kinase.group',
       'structure.pocket', 'ligand.expo_id', 'ligand_allosteric.expo_id',
       'ligand.klifs_id', 'ligand_allosteric.klifs_id', 'ligand.name',
       'ligand_allosteric.name', 'structure.dfg', 'structure.ac_helix',
       'structure.resolution', 'structure.qualityscore',
       'structure.missing_residues', 'structure.missing_atoms',
       'structure.rmsd1', 'structure.rmsd2', 'interaction.fingerprint',
       'structure.front', 'structure.gate', 'structure.back', 'structure.fp_i',
       'structure.fp_ii', 'structure.bp_i_a', 'structure.bp_i_b',
       'structure.bp_ii_in', 'structure.bp_ii_a_in', 'structure.bp_ii_b_in',
       'structure.bp_ii_out', 'structure.bp_ii_b', 'structure.bp_iii',
       'structure.bp_iv', 'structure.bp_v', 'structure.grich_distance',
       'structure.grich_angle', 'structure.grich_rotation',
       'structure.filepath', 'structure.curation_flag'],
      dtype='object')
[9]:
structures_df.head()
[9]:
structure.klifs_id structure.pdb_id structure.alternate_model structure.chain species.klifs kinase.klifs_id kinase.klifs_name kinase.names kinase.family kinase.group ... structure.bp_ii_out structure.bp_ii_b structure.bp_iii structure.bp_iv structure.bp_v structure.grich_distance structure.grich_angle structure.grich_rotation structure.filepath structure.curation_flag
0 3978 1gy3 - A Human 198 CDK2 <NA> <NA> <NA> ... False False False False False 17.5289 58.732300 75.622704 <NA> False
1 4020 4fkg A A Human 198 CDK2 <NA> <NA> <NA> ... False False False False False 14.3067 49.563202 70.460602 <NA> False
2 4021 3ig7 - A Human 198 CDK2 <NA> <NA> <NA> ... False False False False False 14.5012 50.189201 77.397400 <NA> False
3 4540 1hcl B A Human 198 CDK2 <NA> <NA> <NA> ... False False False False False 14.9689 51.229698 63.422600 <NA> False
4 4022 2r3k C A Human 198 CDK2 <NA> <NA> <NA> ... False False False False False 14.3124 50.132500 73.646698 <NA> False

5 rows × 46 columns

Filter structures

We filter the structures by different criteria:

  • Species: human

  • Conformation: DFG-in (the active kinase conformation)

  • Resolution: \(\le 3\) Angström

  • Quality score*: \(\ge 6\)

* The KLIFS quality score takes into account the quality of the alignment, as well as the number of missing residues and atoms. A higher score indicates a better structure quality.

[10]:
structures_df = structures_df[
    (structures_df["species.klifs"] == "Human")
    & (structures_df["structure.dfg"] == "in")
    & (structures_df["structure.resolution"] <= 3)
    & (structures_df["structure.qualityscore"] >= 6)
]
print(f"Number of structures: {len(structures_df)}")
print("Kinases:", *structures_df["kinase.klifs_name"].unique())
Number of structures: 1657
Kinases: CDK2 p38a EGFR ErbB2 MET LCK KDR BRAF p110a

Save the structure KLIFS IDs for the next step.

[11]:
structure_klifs_ids = structures_df["structure.klifs_id"].to_list()
print(f"Number of structures: {len(structure_klifs_ids)}")
Number of structures: 1657

Note for demo mode: To make it easier for us to maintain the talktorials, we will now load a set of frozen structure KLIFS IDs (2021-08-23) and continue to work with those.

Note for non-demo mode: Did you specify N_STRUCTURES_PER_KINASE in the configuration file? If you e.g. set a value of 3, we will select in the following the top 3 structures per kinase in terms of resolution and KLIFS quality score.

[12]:
if DEMO:
    # Load frozen dataset
    print("Notebook is run in demo mode - load frozen structure set.")
    structure_klifs_ids = pd.read_csv(DATA / "frozen_structure_klifs_ids.csv")[
        "structure.klifs_id"
    ].to_list()
    structures_df = structures_df[
        structures_df["structure.klifs_id"].isin(structure_klifs_ids)
    ].copy()
else:
    if N_STRUCTURES_PER_KINASE > 0:
        print(f"Select {N_STRUCTURES_PER_KINASE} structures per kinase for downstream analysis.")
        # Sort structures by kinase and quality
        structures_df = structures_df.sort_values(
            by=["kinase.klifs_name", "structure.resolution", "structure.qualityscore"],
            ascending=[True, True, False],
        )
        # Reduce number of structures per kinase
        structures_df = structures_df.groupby("kinase.klifs_name").head(N_STRUCTURES_PER_KINASE)
        structure_klifs_ids = structures_df["structure.klifs_id"].to_list()
    else:
        print(f"Use all available structures per kinase for downstream analysis.")

print(f"Number of structures: {structures_df.shape[0]}")
# NBVAL_CHECK_OUTPUT
Notebook is run in demo mode - load frozen structure set.
Number of structures: 1620

Show kinase coverage

Let’s get the number of structures that describe our kinases (kinase coverage).

[13]:
# Use pandas' groupby method to count the number of structures (rows) per kinase
n_structures_per_kinase = structures_df.groupby("kinase.klifs_name").size().sort_values()
n_structures_per_kinase
# NBVAL_CHECK_OUTPUT
[13]:
kinase.klifs_name
ErbB2      4
KDR        6
LCK       32
p110a     49
BRAF      72
MET       99
p38a     151
EGFR     357
CDK2     850
dtype: int64

Let’s plot the results.

[14]:
fig, ax = plt.subplots()
n_structures_per_kinase.plot(kind="barh", ax=ax)
ax.set_xlabel("Number of structures")
ax.set_ylabel("Kinase name")
for i, value in enumerate(n_structures_per_kinase):
    ax.text(value, i, str(value), va="center")
../_images/talktorials_T025_kinase_similarity_kissim_37_0.png

We see that our dataset is highly imbalanced. While some kinases are structurally resolved very often, other kinases are not. We will have to keep this in mind when interpreting our results later.

Calculate KiSSim fingerprints

We use the kissim API to encode our structures as KiSSim fingerprints and save the fingerprints as CSV file.

Note for demo mode: We use pre-calculated KiSSim fingerprints for our kinase set (i.e. the next code cell will be skipped).

[15]:
if DEMO:
    print("Notebook is run in demo mode - we will use pre-calculated fingerprints.")
else:
    print("Calculate and save KiSSim fingerprints...")
    # Calculate fingerprints
    from kissim.api import encode

    kissim_fingerprints = encode(structure_klifs_ids, n_cores=N_CORES)

    # Save fingerprints in csv file
    structure_klifs_ids = list(kissim_fingerprints.data.keys())
    kissim_fingerprints_array = [
        fingerprint.values_array().tolist()
        for structure_klifs_id, fingerprint in kissim_fingerprints.data.items()
    ]
    kissim_fingerprints_array = np.array(kissim_fingerprints_array)
    kissim_fingerprints_df = pd.DataFrame(kissim_fingerprints_array, index=structure_klifs_ids)
    kissim_fingerprints_df.to_csv(DATA / "kissim_fingerprints.csv")

# NBVAL_CHECK_OUTPUT
Notebook is run in demo mode - we will use pre-calculated fingerprints.

Let’s load the KiSSim fingerprints from the CSV file.

[16]:
kissim_fingerprints_df = pd.read_csv(DATA / "kissim_fingerprints.csv", index_col=0)
print(f"Matrix shape: {kissim_fingerprints_df.shape}")
print(f"Number of fingerprints: {kissim_fingerprints_df.shape[0]}")
print(f"Number of fingerprint bits: {kissim_fingerprints_df.shape[1]}")
# NBVAL_CHECK_OUTPUT
Matrix shape: (1611, 1032)
Number of fingerprints: 1611
Number of fingerprint bits: 1032

You might notice that we have slightly fewer fingerprints than structures. This can happen during the kissim fingerprint generation, if a structure cannot be encoded.

Let’s have a look at the fingerprint DataFrame.

  • Kinase structures are stored by KLIFS ID (index)

  • Each kinase is represented by \(1032\) features, i.e. \(85\) residues * \(8\) physicochem. features + \(85\) residues * \(4\) distances + \(12\) moments

[17]:
kissim_fingerprints_df.head()
# NBVAL_CHECK_OUTPUT
[17]:
0 1 2 3 4 5 6 7 8 9 ... 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031
6285 2.0 0.0 2.0 -1.0 0.0 0.0 2.0 3.0 2.0 1.0 ... 13.150351 11.958837 4.717011 4.843444 4.655707 3.577213 2.771821 4.302192 3.583341 2.066700
10568 2.0 0.0 2.0 -1.0 0.0 0.0 2.0 3.0 2.0 1.0 ... 13.069152 11.883944 4.691527 5.006221 4.679352 3.531177 2.714736 4.165350 3.549843 2.138838
11187 2.0 0.0 2.0 -1.0 0.0 0.0 2.0 3.0 2.0 1.0 ... 13.297023 11.991511 4.590040 5.141397 4.699467 3.625989 2.549692 4.442117 3.699695 2.261646
4060 2.0 0.0 2.0 -1.0 0.0 0.0 2.0 3.0 2.0 1.0 ... 12.910837 11.775556 4.359330 4.844833 4.214195 3.383812 2.699580 3.860920 3.161863 2.185979
10566 2.0 0.0 2.0 -1.0 0.0 0.0 2.0 3.0 2.0 1.0 ... 13.196581 12.115342 4.701176 4.690081 4.683674 3.633418 2.489983 3.972552 3.692501 0.759234

5 rows × 1032 columns

Compare structures

Let’s make a pairwise comparison of the KiSSim fingerprints. We use the nan_euclidean_distances method of sklearn, which calculates the Euclidean distance between all pairwise vectors.

If two vectors have NaN values (which can happen if residues are not resolved in a structure), the following procedure is applied:

When calculating the distance between a pair of samples, this formulation ignores feature coordinates with a missing value in either sample and scales up the weight of the remaining coordinates.

Please find more information in the sklearn documentation.

[18]:
structure_distance_matrix_array = pairwise.nan_euclidean_distances(kissim_fingerprints_df.values)
[19]:
# Create DataFrame with structure KLIFS IDs as index/columns
structure_klifs_ids = kissim_fingerprints_df.index.to_list()
structure_distance_matrix_df = pd.DataFrame(
    structure_distance_matrix_array, index=structure_klifs_ids, columns=structure_klifs_ids
)
print(f"Structure distance matrix size: {structure_distance_matrix_df.shape}")
print("Show matrix subset:")
structure_distance_matrix_df.iloc[:5, :5]
# NBVAL_CHECK_OUTPUT
Structure distance matrix size: (1611, 1611)
Show matrix subset:
[19]:
6285 10568 11187 4060 10566
6285 0.000000 13.256941 14.001474 26.391543 14.307291
10568 13.256941 0.000000 10.379779 27.882193 16.833932
11187 14.001474 10.379779 0.000000 30.962221 18.338492
4060 26.391543 27.882193 30.962221 0.000000 28.905189
10566 14.307291 16.833932 18.338492 28.905189 0.000000

Map structure to kinase distance matrix

Note: So far we compared individual structures, but we want to compare kinases (which can be represented by several structures, as plotted above).

First, as an intermediate step, we will use the structure distance matrix but — instead of labeling the data with structure KLIFS IDs — we use the corresponding kinase name.

[20]:
# Copy distance matrix to kinase matrix
kinase_distance_matrix_df = structure_distance_matrix_df.copy()
# Replace structure KLIFS IDs with the structures' kinase names
kinase_names = structures_df.set_index("structure.klifs_id").loc[
    structure_klifs_ids, "kinase.klifs_name"
]
kinase_distance_matrix_df.index = kinase_names
kinase_distance_matrix_df.columns = kinase_names
print("Show matrix subset:")
kinase_distance_matrix_df.iloc[:5, :5]
# NBVAL_CHECK_OUTPUT
Show matrix subset:
[20]:
kinase.klifs_name CDK2 CDK2 CDK2 CDK2 CDK2
kinase.klifs_name
CDK2 0.000000 13.256941 14.001474 26.391543 14.307291
CDK2 13.256941 0.000000 10.379779 27.882193 16.833932
CDK2 14.001474 10.379779 0.000000 30.962221 18.338492
CDK2 26.391543 27.882193 30.962221 0.000000 28.905189
CDK2 14.307291 16.833932 18.338492 28.905189 0.000000

In this talktorial, we will consider per kinase pair the two structures that show the most similar pockets. Hence, we select the structure pair with the minimum distance as representative for a kinase pair.

[21]:
# We unstack the matrix (each pairwise comparison in a single row)
# We group by kinase names (level=[0, 1] ensures that the order of the kinases is ignored
# We take the minimum value in each kinase pair group
# We unstack the remaining data points
kinase_distance_matrix_df = (
    kinase_distance_matrix_df.unstack().groupby(level=[0, 1]).min().unstack(level=1)
)
# Cosmetics: Remove the index and column names
kinase_distance_matrix_df.index.name = None
kinase_distance_matrix_df.columns.name = None
[22]:
print(
    f"Structure matrix of shape {structure_distance_matrix_df.shape} "
    f"reduced to kinase matrix of shape {kinase_distance_matrix_df.shape}."
)
# NBVAL_CHECK_OUTPUT
Structure matrix of shape (1611, 1611) reduced to kinase matrix of shape (9, 9).
[23]:
# Show matrix with background gradient
cm = sns.light_palette("green", as_cmap=True)
kinase_distance_matrix_df.style.background_gradient(cmap=cm).format("{:.3f}")
[23]:
  BRAF CDK2 EGFR ErbB2 KDR LCK MET p110a p38a
BRAF 0.000 17.156 19.515 21.383 21.089 21.583 20.297 37.611 21.731
CDK2 17.156 0.000 18.147 21.106 19.880 17.973 18.304 36.780 19.481
EGFR 19.515 18.147 0.000 16.392 17.282 16.467 17.498 36.046 22.128
ErbB2 21.383 21.106 16.392 0.000 23.851 23.881 22.563 41.277 24.682
KDR 21.089 19.880 17.282 23.851 0.000 19.255 20.431 41.104 20.263
LCK 21.583 17.973 16.467 23.881 19.255 0.000 19.221 39.022 22.457
MET 20.297 18.304 17.498 22.563 20.431 19.221 0.000 39.414 21.983
p110a 37.611 36.780 36.046 41.277 41.104 39.022 39.414 0.000 38.530
p38a 21.731 19.481 22.128 24.682 20.263 22.457 21.983 38.530 0.000

Note: Since this is a distance matrix, lighter colors indicate similarity, darker colors dissimilarity.

Save kinase distance matrix

[24]:
kinase_distance_matrix_df.to_csv(DATA / "kinase_distance_matrix.csv")

Discussion

In this talktorial, we have assessed kinase similarity using the KiSSim fingerprints, which describe physicochemical and spatial properties of pockets in kinase structures. We have reduced a structure distance matrix to a kinase distance matrix by selecting for each kinase pair the corresponding closest structure pair.

We have to keep two elements in mind:

  • Pocket fingerprints like KiSSim encode the full pocket although some residues might be more important for ligand binding than others. As an alternative, one could instead use a subset of residues that have been shown to frequently interact with co-crystallized ligands.

  • We only compare here the two closest structures per kinase pair, although we have — at least for kinases such as EGFR and CDK2 — much more structural data available. Aggregating multiple structures is a reasonable strategy but has two disadvantages:

    • Some kinases have much higher coverage than others, leading to an imbalance in information content.

    • Aggregated kinase fingerprints may cause too much averaging, making pairwise kinase comparison noisier.

The kinase distance matrix above will be reloaded in Talktorial T028, where we compare kinase similarities from different perspectives, including the pocket structure perspective we have talked about in this talktorial.

Quiz

  1. Can you think of reasons why it is important to include physicochemical and spatial properties in a fingerprint?

  2. Can you think of a reason why the side chain orientation and solvent exposure may be interesting features to consider in a fingerprint?

  3. Can you think of research questions for which you could make use of the structure distance matrix?