T024 · Kinase similarity: Sequence¶
Note: This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.
Authors:
Talia B. Kimber, 2021, Volkamer lab, Charité
Dominique Sydow, 2021, Volkamer lab, Charité
Andrea Volkamer, 2021, Volkamer lab, Charité
Aim of this talktorial¶
In this talktorial, we investigate sequence similarity for kinases of interest. KLIFS API is used to retrieve the \(85\) residues of the pocket sequence for each kinase.
Two similarity measures are implemented:
Sequence identity, i.e., the similarity which is based on character-wise discrepancy.
Sequence similarity, i.e., the similarity which is based on a substitution matrix, thus, reflecting similarities between amino acids.
Note: We focus on similarities between orthosteric kinase binding sites; similarities to allosteric binding sites are not covered.
Contents in Theory¶
Kinase dataset
Kinase similarity descriptor: Sequence
Identity score
Substitution score
From similarity matrix to distance matrix
Contents in Practical¶
Define the kinases of interest
Retrieve sequences from KLIFS
Sequence similarity
Identity score
Substitution score
Kinase similarity
Visualize similarity as kinase matrix
Save kinase similarity matrix
Kinase distance matrix
Save kinase distance matrix
References¶
Kinase dataset: Molecules (2021), 26(3), 629
KLIFS
KLIFS URL: https://klifs.net/
KLIFS database: Nucleic Acid Res. (2020), 49(D1), D562-D569
Sequence-based kinase clustering: Manning et al. Science (2002), 298(5600), 1912-1934
Substitution matrix: PNAS (1992), 89(22), 10915-10919
Biotite
Documentation: https://www.biotite-python.org/index.html
Sequence logo: http://www.cbs.dtu.dk/biotools/Seq2Logo/
Theory¶
Kinase dataset¶
We use the kinase selection as defined in Talktorial T023.
Kinase similarity descriptor: Sequence¶
As mentioned in the previous talktorial, sequence is often used to assess kinase similarity (see the phylogenetic tree developed by Manning et al. Science (2002), 298(5600), 1912-1934).
In this talktorial, the KLIFS pocket sequence is used for two main reasons:
The sequence is of fixed length (it contains \(85\) residues), which makes computation for pairwise similarity between two sequences straightforward.
The binding pocket is where the action takes place. Why consider the full kinase sequence when an \(85\) residues sequence contains most relevant information?
Figure 1: The sequence logo shows amino acid binding motifs and sequence profiles such as amino acid depletion. For example, the sequence logo easily visualizes the conserved G-rich loop (position 4-9) and the DFG motif (position 81-83), see Talktorial T023 for more details.
Note for reproducibility: The figure is generated using the Seq2Logo tool available at http://www.cbs.dtu.dk/biotools/Seq2Logo/. The input is the \(85\) residues KLIFS binding pocket for the query kinases. All parameters are the the default ones. For the graphical layout, the entry Stacks Per Line is set to \(85\) and Page size to \(1200 \times 480.\)
We now describe two ways to compare pocket sequences.
Identity score¶
A simple way of assessing the similarity between two sequences is to use the so-called identity score. First, a match vector is created: it checks whether for each position the characters from the two sequences are identical. If they are, the entry is set to \(1,\) and \(0\) otherwise.
The identity score is computed by summing the elements in the match vector and normalizing the entry by the length of the sequence, which is \(85\) in the case of KLIFS pocket sequence.
Let’s consider the identity matrix \(I\) below:
A |
C |
D |
E |
… |
|
---|---|---|---|---|---|
A |
1 |
0 |
0 |
0 |
… |
C |
0 |
1 |
0 |
0 |
… |
D |
0 |
0 |
1 |
0 |
… |
E |
0 |
0 |
0 |
1 |
… |
… |
… |
… |
… |
… |
… |
and let \(M = len(K_i) = len(K_j)\) for two kinases \(K_i\) and \(K_j\).
We use the following as similarity between kinases \(K_i\) and \(K_j\):
where \(K_i[n]\) represents the amino acid at position \(n\) of the sequence of kinase \(i\).
Substitution score¶
Although the identity score is an easy measure of similarity, it does not take into account the rate at which an amino acid may change into another and treats all residues uniformly.
The substitution score takes the changes of the amino acids over evolutionary time into account. It makes use of a substitution matrix, where each entry gives a score between two amino acids. In this talktorial, we use the BLOSUM substitution matrix PNAS (1992), 89(22), 10915-10919, implemented in biotite
.
The BLOSUM substitution matrix \(SM\) is defined as below (the full matrix will be displayed in the Practical part):
A |
C |
D |
E |
… |
|
---|---|---|---|---|---|
A |
4 |
0 |
-2 |
-1 |
… |
C |
0 |
9 |
-3 |
-4 |
… |
D |
-2 |
-3 |
6 |
2 |
… |
E |
-1 |
-4 |
2 |
5 |
… |
… |
… |
… |
… |
… |
… |
The BLOSUM substitution matrix is symmetric, as shown in the practical part.
For convenience, we will translate and rescale the matrix using the following:
for translation, and
for all \(x, y\)
for rescaling, such that
and
We use the following as similarity between kinases \(K_i\) and \(K_j\):
where \(K_i[n]\) represents the amino acid at position \(n\) of the sequence of kinase \(i\) and \(M = len(K_i) = len(K_j)\).
From similarity matrix to distance matrix¶
In order to apply some clustering algorithm to assess the similarity between kinases, it is necessary to start with a distance matrix. A similarity matrix is not, by definition, a distance matrix. For example, the diagonal elements are not zero. For now, we map the similarity matrix \(SM\) to a distance matrix \(DM\) using
See Talktorial T028 for more details.
Practical¶
[1]:
from pathlib import Path
import requests
import pandas as pd
import numpy as np
import seaborn as sns
import biotite.sequence.align as align
[2]:
HERE = Path(_dh[-1])
DATA = HERE / "data"
[3]:
configs = pd.read_csv(HERE / "../T023_what_is_a_kinase/data/pipeline_configs.csv")
configs = configs.set_index("variable")["default_value"]
DEMO = bool(int(configs["DEMO"]))
print(f"Run in demo mode: {DEMO}")
# NBVAL_CHECK_OUTPUT
Run in demo mode: True
Define the kinases of interest¶
Let’s load the kinase selection as defined in Talktorial T023.
[4]:
kinase_selection_df = pd.read_csv(
HERE
/ "../T023_what_is_a_kinase/\
data/kinase_selection.csv"
)
kinase_selection_df
# NBVAL_CHECK_OUTPUT
[4]:
kinase | kinase_klifs | uniprot_id | group | full_kinase_name | |
---|---|---|---|---|---|
0 | EGFR | EGFR | P00533 | TK | Epidermal growth factor receptor |
1 | ErbB2 | ErbB2 | P04626 | TK | Erythroblastic leukemia viral oncogene homolog 2 |
2 | PI3K | p110a | P42336 | Atypical | Phosphatidylinositol-3-kinase |
3 | VEGFR2 | KDR | P35968 | TK | Vascular endothelial growth factor receptor 2 |
4 | BRAF | BRAF | P15056 | TKL | Rapidly accelerated fibrosarcoma isoform B |
5 | CDK2 | CDK2 | P24941 | CMGC | Cyclic-dependent kinase 2 |
6 | LCK | LCK | P06239 | TK | Lymphocyte-specific protein tyrosine kinase |
7 | MET | MET | P08581 | TK | Mesenchymal-epithelial transition factor |
8 | p38a | p38a | Q16539 | CMGC | p38 mitogen activated protein kinase alpha |
Retrieve sequences from KLIFS¶
We use the KLIFS API to retrieve the \(85\)-long pocket sequence for each kinase.
[5]:
def klifs_pocket_sequence(klifs_kinase_name):
"""
Retrieves the pocket sequence from KLIFS using the API.
Parameters
----------
klifs_kinase_name : str
The KLIFS name of the kinase of interest.
Returns
-------
str :
The 85 residues pocket sequence from KLIFS,
if the kinase name is valid, None otherwise.
"""
response = requests.get(
f"https://klifs.net/api/kinase_ID?kinase_name={klifs_kinase_name}&species=HUMAN"
)
if response.status_code == 200:
return response.json()[0]["pocket"]
else:
print(f"KLIFS failed for kinase {klifs_kinase_name}")
return None
We create a dictionary made of the kinase name and its associated pocket sequence. This dictionary is used throughout this notebook.
[6]:
kinase_sequences_dict = {}
for kinase in kinase_selection_df["kinase_klifs"]:
kinase_sequences_dict[kinase] = klifs_pocket_sequence(kinase)
for name, sequence in kinase_sequences_dict.items():
print(f"{name:10}{sequence}")
# NBVAL_CHECK_OUTPUT
EGFR KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA
ErbB2 KVLGSGAFGTVYKVAIKVLEILDEAYVMAGVGPYVSRLLGIQLVTQLMPYGCLLDHVREYLEDVRLVHRDLAARNVLVITDFGLA
p110a CRIMSSAKRPLWLIIFKNGDLRQDMLTLQIIRLRMLPYGCLVGLIEVVRSHTIMQIQCKATFI--LGIGDRHNSNIMVHIDFGHF
KDR KPLGRGAFGQVIEVAVKMLALMSELKILIHIGLNVVNLLGAMVIVEFCKFGNLSTYLRSFLASRKCIHRDLAARNILLICDFGLA
BRAF QRIGSGSFGTVYKVAVKMLAFKNEVGVLRKTRVNILLFMGYAIVTQWCEGSSLYHHLHIYLHAKSIIHRDLKSNNIFLIGDFGLA
CDK2 EKIGEGTYGVVYKVALKKITAIREISLLKELNPNIVKLLDVYLVFEFLH-QDLKKFMDAFCHSHRVLHRDLKPQNLLILADFGLA
LCK ERLGAGQFGEVWMVAVKSLAFLAEANLMKQLQQRLVRLYAVYIITEYMENGSLVDFLKTFIEERNYIHRDLRAANILVIADFGLA
MET EVIGRGHFGCVYHCAVKSLQFLTEGIIMKDFSPNVLSLLGILVVLPYMKHGDLRNFIRNYLASKKFVHRDLAARNCMLVADFGLA
p38a SPVGSGAYGSVCAVAVKKLRTYRELRLLKHMKENVIGLLDVYLVTHLMG-ADLNNIVKCYIHSADIIHRDLKPSNLAVILDFGLA
As shown in the cell above, some sequences have missing residues, denoted by “-”. Let’s plot these sequences as heatmap for a quick visual on conserved regions.
[7]:
# Cast dict to DataFrame
kinase_sequences_df = pd.DataFrame(
[list(i) for i in kinase_sequences_dict.values()],
index=kinase_sequences_dict.keys(),
columns=range(1, 86),
)
# Cast letters to integers (gap is set to None)
letter_to_int = {letter: ix for ix, letter in enumerate(list("ACDEFGHIKLMNPQRSTVWY"), 1)}
letter_to_int["-"] = None
kinase_sequences_int_df = kinase_sequences_df.applymap(lambda x: letter_to_int[x])
# Show heatmap (qualitative colormap)
ax = sns.heatmap(kinase_sequences_int_df, cmap="tab20", cbar=False)
Sequence similarity¶
Given two kinases, we create functions which account for identity or substitution similarity, as described in the Theory part.
Identity score¶
We first define a function which compares element-wise characters in two sequences.
[8]:
def identity_score(sequence1, sequence2):
"""
Computes the element-wise binary similarity between two sequences.
Parameters
----------
sequence1 : np.array
An array of characters describing the first sequence.
sequence2 : np.array
An array of characters describing the second sequence.
Returns
-------
np.array
The bool array for each character.
1 if the elements are identical,
0 otherwise.
"""
# True is the character is the same, False otherwise
return np.compare_chararrays(sequence1, sequence2, cmp="==", rstrip=True)
Substitution score¶
We now define the function which is more specific to amino acids grouping and use the biotite
library to retrieve the BLOSUM substitution matrix.
The substitution matrix can be retrieve from biotite
using the following command:
[9]:
substitution_matrix = align.SubstitutionMatrix.std_protein_matrix()
print(substitution_matrix)
# NBVAL_CHECK_OUTPUT
A C D E F G H I K L M N P Q R S T V W Y B Z X *
A 4 0 -2 -1 -2 0 -2 -1 -1 -1 -1 -2 -1 -1 -1 1 0 0 -3 -2 -2 -1 0 -4
C 0 9 -3 -4 -2 -3 -3 -1 -3 -1 -1 -3 -3 -3 -3 -1 -1 -1 -2 -2 -3 -3 -2 -4
D -2 -3 6 2 -3 -1 -1 -3 -1 -4 -3 1 -1 0 -2 0 -1 -3 -4 -3 4 1 -1 -4
E -1 -4 2 5 -3 -2 0 -3 1 -3 -2 0 -1 2 0 0 -1 -2 -3 -2 1 4 -1 -4
F -2 -2 -3 -3 6 -3 -1 0 -3 0 0 -3 -4 -3 -3 -2 -2 -1 1 3 -3 -3 -1 -4
G 0 -3 -1 -2 -3 6 -2 -4 -2 -4 -3 0 -2 -2 -2 0 -2 -3 -2 -3 -1 -2 -1 -4
H -2 -3 -1 0 -1 -2 8 -3 -1 -3 -2 1 -2 0 0 -1 -2 -3 -2 2 0 0 -1 -4
I -1 -1 -3 -3 0 -4 -3 4 -3 2 1 -3 -3 -3 -3 -2 -1 3 -3 -1 -3 -3 -1 -4
K -1 -3 -1 1 -3 -2 -1 -3 5 -2 -1 0 -1 1 2 0 -1 -2 -3 -2 0 1 -1 -4
L -1 -1 -4 -3 0 -4 -3 2 -2 4 2 -3 -3 -2 -2 -2 -1 1 -2 -1 -4 -3 -1 -4
M -1 -1 -3 -2 0 -3 -2 1 -1 2 5 -2 -2 0 -1 -1 -1 1 -1 -1 -3 -1 -1 -4
N -2 -3 1 0 -3 0 1 -3 0 -3 -2 6 -2 0 0 1 0 -3 -4 -2 3 0 -1 -4
P -1 -3 -1 -1 -4 -2 -2 -3 -1 -3 -2 -2 7 -1 -2 -1 -1 -2 -4 -3 -2 -1 -2 -4
Q -1 -3 0 2 -3 -2 0 -3 1 -2 0 0 -1 5 1 0 -1 -2 -2 -1 0 3 -1 -4
R -1 -3 -2 0 -3 -2 0 -3 2 -2 -1 0 -2 1 5 -1 -1 -3 -3 -2 -1 0 -1 -4
S 1 -1 0 0 -2 0 -1 -2 0 -2 -1 1 -1 0 -1 4 1 -2 -3 -2 0 0 0 -4
T 0 -1 -1 -1 -2 -2 -2 -1 -1 -1 -1 0 -1 -1 -1 1 5 0 -2 -2 -1 -1 0 -4
V 0 -1 -3 -2 -1 -3 -3 3 -2 1 1 -3 -2 -2 -3 -2 0 4 -3 -1 -3 -2 -1 -4
W -3 -2 -4 -3 1 -2 -2 -3 -3 -2 -1 -4 -4 -2 -3 -3 -2 -3 11 2 -4 -3 -2 -4
Y -2 -2 -3 -2 3 -3 2 -1 -2 -1 -1 -2 -3 -1 -2 -2 -2 -1 2 7 -3 -2 -1 -4
B -2 -3 4 1 -3 -1 0 -3 0 -4 -3 3 -2 0 -1 0 -1 -3 -4 -3 4 1 -1 -4
Z -1 -3 1 4 -3 -2 0 -3 1 -3 -1 0 -1 3 0 0 -1 -2 -3 -2 1 4 -1 -4
X 0 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 -1 -1 0 0 -1 -2 -1 -1 -1 -1 -4
* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1
Check for symmetry:
[10]:
substitution_matrix.is_symmetric()
# NBVAL_CHECK_OUTPUT
[10]:
True
Let’s perform the translation-rescaling step we discussed in the Theory part.
[11]:
def _translate_rescale_substitution_matrix(
substitution_matrix=align.SubstitutionMatrix.std_protein_matrix(),
):
"""
Translate and rescale substitution matrix.
Parameters
----------
substitution_matrix : biotite.sequence.align.SubstitutionMatrix
A substitution matrix specific to amino acids.
The default is align.SubstitutionMatrix.std_protein_matrix()
from biotite, which represents BLOSUM62.
Returns
-------
pd.DataFrame
Translated and rescaled substitution matrix as DataFrame
(index/columns contain letters).
"""
# Retrieve np.array from substitution matrix
score_matrix = substitution_matrix.score_matrix()
# Translation of substitution matrix
translated_matrix = score_matrix - np.min(score_matrix)
# Rescaling
normalized_score_matrix = np.zeros(score_matrix.shape)
for i in range(score_matrix.shape[0]):
for j in range(score_matrix.shape[0]):
normalized_score_matrix[i, j] = translated_matrix[i, j] / np.sqrt(
translated_matrix[i, i] * translated_matrix[j, j]
)
# Create DataFrame from matrix with letters as index/column names
normalized_score_matrix = pd.DataFrame(
normalized_score_matrix,
columns=substitution_matrix.get_alphabet1(),
index=substitution_matrix.get_alphabet1(),
)
# Check for symmetry
symmetric = (
normalized_score_matrix.values == normalized_score_matrix.values.transpose()
).all()
if not symmetric:
raise ValueError(f"Translated/rescaled matrix is not symmetric.")
return normalized_score_matrix
Let’s take a look at the translated and rescaled version of the substitution matrix.
[12]:
_translate_rescale_substitution_matrix().style.format("{:.2f}")
[12]:
A | C | D | E | F | G | H | I | K | L | M | N | P | Q | R | S | T | V | W | Y | B | Z | X | * | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
A | 1.00 | 0.39 | 0.22 | 0.35 | 0.22 | 0.45 | 0.20 | 0.38 | 0.35 | 0.38 | 0.35 | 0.22 | 0.32 | 0.35 | 0.35 | 0.62 | 0.47 | 0.50 | 0.09 | 0.21 | 0.25 | 0.38 | 0.82 | 0.00 |
C | 0.39 | 1.00 | 0.09 | 0.00 | 0.18 | 0.09 | 0.08 | 0.29 | 0.09 | 0.29 | 0.28 | 0.09 | 0.08 | 0.09 | 0.09 | 0.29 | 0.28 | 0.29 | 0.14 | 0.17 | 0.10 | 0.10 | 0.32 | 0.00 |
D | 0.22 | 0.09 | 1.00 | 0.63 | 0.10 | 0.30 | 0.27 | 0.11 | 0.32 | 0.00 | 0.11 | 0.50 | 0.29 | 0.42 | 0.21 | 0.45 | 0.32 | 0.11 | 0.00 | 0.10 | 0.89 | 0.56 | 0.55 | 0.00 |
E | 0.35 | 0.00 | 0.63 | 1.00 | 0.11 | 0.21 | 0.38 | 0.12 | 0.56 | 0.12 | 0.22 | 0.42 | 0.30 | 0.67 | 0.44 | 0.47 | 0.33 | 0.24 | 0.09 | 0.20 | 0.59 | 0.94 | 0.58 | 0.00 |
F | 0.22 | 0.18 | 0.10 | 0.11 | 1.00 | 0.10 | 0.27 | 0.45 | 0.11 | 0.45 | 0.42 | 0.10 | 0.00 | 0.11 | 0.11 | 0.22 | 0.21 | 0.34 | 0.41 | 0.67 | 0.11 | 0.11 | 0.55 | 0.00 |
G | 0.45 | 0.09 | 0.30 | 0.21 | 0.10 | 1.00 | 0.18 | 0.00 | 0.21 | 0.00 | 0.11 | 0.40 | 0.19 | 0.21 | 0.21 | 0.45 | 0.21 | 0.11 | 0.16 | 0.10 | 0.34 | 0.22 | 0.55 | 0.00 |
H | 0.20 | 0.08 | 0.27 | 0.38 | 0.27 | 0.18 | 1.00 | 0.10 | 0.29 | 0.10 | 0.19 | 0.46 | 0.17 | 0.38 | 0.38 | 0.31 | 0.19 | 0.10 | 0.15 | 0.52 | 0.41 | 0.41 | 0.50 | 0.00 |
I | 0.38 | 0.29 | 0.11 | 0.12 | 0.45 | 0.00 | 0.10 | 1.00 | 0.12 | 0.75 | 0.59 | 0.11 | 0.11 | 0.12 | 0.12 | 0.25 | 0.35 | 0.88 | 0.09 | 0.32 | 0.12 | 0.12 | 0.61 | 0.00 |
K | 0.35 | 0.09 | 0.32 | 0.56 | 0.11 | 0.21 | 0.29 | 0.12 | 1.00 | 0.24 | 0.33 | 0.42 | 0.30 | 0.56 | 0.67 | 0.47 | 0.33 | 0.24 | 0.09 | 0.20 | 0.47 | 0.59 | 0.58 | 0.00 |
L | 0.38 | 0.29 | 0.00 | 0.12 | 0.45 | 0.00 | 0.10 | 0.75 | 0.24 | 1.00 | 0.71 | 0.11 | 0.11 | 0.24 | 0.24 | 0.25 | 0.35 | 0.62 | 0.18 | 0.32 | 0.00 | 0.12 | 0.61 | 0.00 |
M | 0.35 | 0.28 | 0.11 | 0.22 | 0.42 | 0.11 | 0.19 | 0.59 | 0.33 | 0.71 | 1.00 | 0.21 | 0.20 | 0.44 | 0.33 | 0.35 | 0.33 | 0.59 | 0.26 | 0.30 | 0.12 | 0.35 | 0.58 | 0.00 |
N | 0.22 | 0.09 | 0.50 | 0.42 | 0.10 | 0.40 | 0.46 | 0.11 | 0.42 | 0.11 | 0.21 | 1.00 | 0.19 | 0.42 | 0.42 | 0.56 | 0.42 | 0.11 | 0.00 | 0.19 | 0.78 | 0.45 | 0.55 | 0.00 |
P | 0.32 | 0.08 | 0.29 | 0.30 | 0.00 | 0.19 | 0.17 | 0.11 | 0.30 | 0.11 | 0.20 | 0.19 | 1.00 | 0.30 | 0.20 | 0.32 | 0.30 | 0.21 | 0.00 | 0.09 | 0.21 | 0.32 | 0.35 | 0.00 |
Q | 0.35 | 0.09 | 0.42 | 0.67 | 0.11 | 0.21 | 0.38 | 0.12 | 0.56 | 0.24 | 0.44 | 0.42 | 0.30 | 1.00 | 0.56 | 0.47 | 0.33 | 0.24 | 0.17 | 0.30 | 0.47 | 0.82 | 0.58 | 0.00 |
R | 0.35 | 0.09 | 0.21 | 0.44 | 0.11 | 0.21 | 0.38 | 0.12 | 0.67 | 0.24 | 0.33 | 0.42 | 0.20 | 0.56 | 1.00 | 0.35 | 0.33 | 0.12 | 0.09 | 0.20 | 0.35 | 0.47 | 0.58 | 0.00 |
S | 0.62 | 0.29 | 0.45 | 0.47 | 0.22 | 0.45 | 0.31 | 0.25 | 0.47 | 0.25 | 0.35 | 0.56 | 0.32 | 0.47 | 0.35 | 1.00 | 0.59 | 0.25 | 0.09 | 0.21 | 0.50 | 0.50 | 0.82 | 0.00 |
T | 0.47 | 0.28 | 0.32 | 0.33 | 0.21 | 0.21 | 0.19 | 0.35 | 0.33 | 0.35 | 0.33 | 0.42 | 0.30 | 0.33 | 0.33 | 0.59 | 1.00 | 0.47 | 0.17 | 0.20 | 0.35 | 0.35 | 0.77 | 0.00 |
V | 0.50 | 0.29 | 0.11 | 0.24 | 0.34 | 0.11 | 0.10 | 0.88 | 0.24 | 0.62 | 0.59 | 0.11 | 0.21 | 0.24 | 0.12 | 0.25 | 0.47 | 1.00 | 0.09 | 0.32 | 0.12 | 0.25 | 0.61 | 0.00 |
W | 0.09 | 0.14 | 0.00 | 0.09 | 0.41 | 0.16 | 0.15 | 0.09 | 0.09 | 0.18 | 0.26 | 0.00 | 0.00 | 0.17 | 0.09 | 0.09 | 0.17 | 0.09 | 1.00 | 0.47 | 0.00 | 0.09 | 0.30 | 0.00 |
Y | 0.21 | 0.17 | 0.10 | 0.20 | 0.67 | 0.10 | 0.52 | 0.32 | 0.20 | 0.32 | 0.30 | 0.19 | 0.09 | 0.30 | 0.20 | 0.21 | 0.20 | 0.32 | 0.47 | 1.00 | 0.11 | 0.21 | 0.52 | 0.00 |
B | 0.25 | 0.10 | 0.89 | 0.59 | 0.11 | 0.34 | 0.41 | 0.12 | 0.47 | 0.00 | 0.12 | 0.78 | 0.21 | 0.47 | 0.35 | 0.50 | 0.35 | 0.12 | 0.00 | 0.11 | 1.00 | 0.62 | 0.61 | 0.00 |
Z | 0.38 | 0.10 | 0.56 | 0.94 | 0.11 | 0.22 | 0.41 | 0.12 | 0.59 | 0.12 | 0.35 | 0.45 | 0.32 | 0.82 | 0.47 | 0.50 | 0.35 | 0.25 | 0.09 | 0.21 | 0.62 | 1.00 | 0.61 | 0.00 |
X | 0.82 | 0.32 | 0.55 | 0.58 | 0.55 | 0.55 | 0.50 | 0.61 | 0.58 | 0.61 | 0.58 | 0.55 | 0.35 | 0.58 | 0.58 | 0.82 | 0.77 | 0.61 | 0.30 | 0.52 | 0.61 | 0.61 | 1.00 | 0.00 |
* | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
Let’s define a function that calculates the substitution scores between two sequences (will make use of the previously defined function).
[13]:
def substitution_score(
sequence1, sequence2, substitution_matrix=align.SubstitutionMatrix.std_protein_matrix()
):
"""
Retrieve the match score given the substitution matrix.
Parameters
----------
sequence1 : np.array
An array of characters describing the first sequence.
sequence2 : np.array
An array of characters describing the second sequence.
substitution_matrix : biotite.sequence.align.SubstitutionMatrix
A substitution matrix specific to amino acids.
The default is align.SubstitutionMatrix.std_protein_matrix()
from biotite, which represents BLOSUM62.
Returns
-------
np.array
The vector of match score
using the normalized substitution matrix.
"""
substitution_matrix_df = _translate_rescale_substitution_matrix(substitution_matrix)
match_score_array = np.zeros(len(sequence1))
for i, (character_seq1, character_seq2) in enumerate(zip(sequence1, sequence2)):
match_score_array[i] = substitution_matrix_df.loc[character_seq1, character_seq2]
return match_score_array
Kinase similarity¶
Given two kinases, we create a function which computes the sequence similarity between them using one of the two measures, the identity or the substitution.
[14]:
def sequence_similarity(sequence_1, sequence_2, type_="identity"):
"""
Compares two sequences using a given metric.
Parameters
----------
sequence_1, sequence_2 : str
The two sequences of strings for comparison.
type_ : str
The type of metric to compute the similarity.
The default is `identity`.
Returns
-------
float :
The similarity between the pocket sequences of the two kinases.
"""
# Replace possible unavailable residue
# noted in KLIFS with "-"
# by the symbol "*" for biotite
sequence_1 = sequence_1.replace("-", "*")
sequence_2 = sequence_2.replace("-", "*")
if len(sequence_1) != len(sequence_1):
raise ValueError(f"Mismatch in sequence lengths.")
else:
seq_array1 = np.array(list(sequence_1))
seq_array2 = np.array(list(sequence_2))
if type_ == "identity":
is_match_array = identity_score(seq_array1, seq_array2)
similarity_normed = sum(is_match_array) / len(sequence_1)
elif type_ == "substitution":
match_score_array = substitution_score(seq_array1, seq_array2)
similarity_normed = sum(match_score_array) / len(sequence_1)
else:
raise ValueError(f"Type {type_} not defined.")
return similarity_normed
Let’s look at the sequence similarity between two kinases (see also Figure 2):
[15]:
if DEMO:
example1 = "EGFR"
example2 = "MET"
else:
example1 = kinase_selection_df["kinase_klifs"][0]
example2 = kinase_selection_df["kinase_klifs"][1]
print("The sequences are:\n")
for key in (example1, example2):
print(f"{key:5s}: {kinase_sequences_dict[key]}")
# NBVAL_CHECK_OUTPUT
The sequences are:
EGFR : KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA
MET : EVIGRGHFGCVYHCAVKSLQFLTEGIIMKDFSPNVLSLLGILVVLPYMKHGDLRNFIRNYLASKKFVHRDLAARNCMLVADFGLA
[16]:
example_seq_similarity = sequence_similarity(
kinase_sequences_dict[example1], kinase_sequences_dict[example2], "identity"
)
print(
f"Pocket sequence similarity between {example1} and {example2} kinases: "
f"{example_seq_similarity:.2f} using identity."
)
# NBVAL_CHECK_OUTPUT
Pocket sequence similarity between EGFR and MET kinases: 0.46 using identity.
[17]:
example_seq_similarity = sequence_similarity(
kinase_sequences_dict[example1], kinase_sequences_dict[example2], "substitution"
)
print(
f"Pocket sequence similarity between {example1} and {example2} kinases: "
f"{example_seq_similarity:.2f} using substitution."
)
# NBVAL_CHECK_OUTPUT
Pocket sequence similarity between EGFR and MET kinases: 0.71 using substitution.
Figure 2: Sequences and sequence similarity between the kinases EGFR and MET.
We can also look at self-similarity:
[18]:
example_seq_similarity = sequence_similarity(
kinase_sequences_dict[example1], kinase_sequences_dict[example1], type_="identity"
)
print(
f"Pocket sequence similarity between {example1} itself: "
f"{example_seq_similarity:.2f} using identity."
)
# NBVAL_CHECK_OUTPUT
Pocket sequence similarity between EGFR itself: 1.00 using identity.
[19]:
example_seq_similarity = sequence_similarity(
kinase_sequences_dict[example1], kinase_sequences_dict[example1], type_="substitution"
)
print(
f"Pocket sequence similarity between {example1} itself: "
f"{example_seq_similarity:.2f} using substitution."
)
# NBVAL_CHECK_OUTPUT
Pocket sequence similarity between EGFR itself: 1.00 using substitution.
As expected, the similarity between a kinase and itself leads to the highest possible score:
Visualize similarity as kinase matrix¶
[20]:
def pairwise_kinase_similarities(kinase_sequence_dictionary, similarity_measure):
"""
Calculate pairwise similarities between a set of kinases.
Parameters
----------
kinase_sequence_dictionary : dict
A dictionary containing the kinase name as key
and the kinase sequence as value.
similarity_measure : str
Select similarity measure: "identity" or "substitution".
Returns
-------
pd.DataFrame
All-against-all similarities between input kinases.
"""
# Initialize matrix with 0
kinase_similarity_matrix = np.zeros(
(len(kinase_sequence_dictionary), len(kinase_sequence_dictionary))
)
# Iterate over matrix and fill with similarity values
for i, kinase_sequence1 in enumerate(kinase_sequence_dictionary.values()):
for j, kinase_sequence2 in enumerate(kinase_sequence_dictionary.values()):
kinase_similarity_matrix[i, j] = sequence_similarity(
kinase_sequence1, kinase_sequence2, similarity_measure
)
# Cast matrix to DataFrame
kinase_similarity_matrix_df = pd.DataFrame(
data=kinase_similarity_matrix,
index=kinase_sequences_dict.keys(),
columns=kinase_sequences_dict.keys(),
)
kinase_similarity_matrix_df.index.name = None
kinase_similarity_matrix_df.columns.name = None
# Check for symmetry
symmetric = (
kinase_similarity_matrix_df.values == kinase_similarity_matrix_df.values.transpose()
).all()
if not symmetric:
raise ValueError(f"Matrix is not symmetric.")
return kinase_similarity_matrix_df
We visualize the similarity matrices using identity and substitution:
Kinase similarity matrix: Identity¶
[21]:
kinase_similarity_matrix_identity_df = pairwise_kinase_similarities(
kinase_sequences_dict, similarity_measure="identity"
)
kinase_similarity_matrix_identity_df
# NBVAL_CHECK_OUTPUT
[21]:
EGFR | ErbB2 | p110a | KDR | BRAF | CDK2 | LCK | MET | p38a | |
---|---|---|---|---|---|---|---|---|---|
EGFR | 1.000000 | 0.894118 | 0.117647 | 0.470588 | 0.376471 | 0.317647 | 0.447059 | 0.458824 | 0.388235 |
ErbB2 | 0.894118 | 1.000000 | 0.117647 | 0.435294 | 0.400000 | 0.329412 | 0.423529 | 0.470588 | 0.400000 |
p110a | 0.117647 | 0.117647 | 1.000000 | 0.152941 | 0.152941 | 0.105882 | 0.141176 | 0.105882 | 0.141176 |
KDR | 0.470588 | 0.435294 | 0.152941 | 1.000000 | 0.400000 | 0.341176 | 0.435294 | 0.470588 | 0.388235 |
BRAF | 0.376471 | 0.400000 | 0.152941 | 0.400000 | 1.000000 | 0.329412 | 0.388235 | 0.376471 | 0.376471 |
CDK2 | 0.317647 | 0.329412 | 0.105882 | 0.341176 | 0.329412 | 1.000000 | 0.376471 | 0.364706 | 0.470588 |
LCK | 0.447059 | 0.423529 | 0.141176 | 0.435294 | 0.388235 | 0.376471 | 1.000000 | 0.400000 | 0.388235 |
MET | 0.458824 | 0.470588 | 0.105882 | 0.470588 | 0.376471 | 0.364706 | 0.400000 | 1.000000 | 0.364706 |
p38a | 0.388235 | 0.400000 | 0.141176 | 0.388235 | 0.376471 | 0.470588 | 0.388235 | 0.364706 | 1.000000 |
[22]:
# Show matrix with background gradient
cm = sns.light_palette("green", as_cmap=True)
kinase_similarity_matrix_identity_df.style.background_gradient(cmap=cm).format("{:.3f}")
[22]:
EGFR | ErbB2 | p110a | KDR | BRAF | CDK2 | LCK | MET | p38a | |
---|---|---|---|---|---|---|---|---|---|
EGFR | 1.000 | 0.894 | 0.118 | 0.471 | 0.376 | 0.318 | 0.447 | 0.459 | 0.388 |
ErbB2 | 0.894 | 1.000 | 0.118 | 0.435 | 0.400 | 0.329 | 0.424 | 0.471 | 0.400 |
p110a | 0.118 | 0.118 | 1.000 | 0.153 | 0.153 | 0.106 | 0.141 | 0.106 | 0.141 |
KDR | 0.471 | 0.435 | 0.153 | 1.000 | 0.400 | 0.341 | 0.435 | 0.471 | 0.388 |
BRAF | 0.376 | 0.400 | 0.153 | 0.400 | 1.000 | 0.329 | 0.388 | 0.376 | 0.376 |
CDK2 | 0.318 | 0.329 | 0.106 | 0.341 | 0.329 | 1.000 | 0.376 | 0.365 | 0.471 |
LCK | 0.447 | 0.424 | 0.141 | 0.435 | 0.388 | 0.376 | 1.000 | 0.400 | 0.388 |
MET | 0.459 | 0.471 | 0.106 | 0.471 | 0.376 | 0.365 | 0.400 | 1.000 | 0.365 |
p38a | 0.388 | 0.400 | 0.141 | 0.388 | 0.376 | 0.471 | 0.388 | 0.365 | 1.000 |
Kinase similarity matrix: Substitution¶
[23]:
kinase_similarity_matrix_substitution_df = pairwise_kinase_similarities(
kinase_sequences_dict, similarity_measure="substitution"
)
kinase_similarity_matrix_substitution_df
# NBVAL_CHECK_OUTPUT
[23]:
EGFR | ErbB2 | p110a | KDR | BRAF | CDK2 | LCK | MET | p38a | |
---|---|---|---|---|---|---|---|---|---|
EGFR | 1.000000 | 0.940963 | 0.427062 | 0.716047 | 0.655028 | 0.648447 | 0.711321 | 0.711258 | 0.644644 |
ErbB2 | 0.940963 | 1.000000 | 0.413638 | 0.702118 | 0.654600 | 0.630308 | 0.685075 | 0.697967 | 0.635173 |
p110a | 0.427062 | 0.413638 | 1.000000 | 0.422705 | 0.436459 | 0.423994 | 0.451336 | 0.393699 | 0.431194 |
KDR | 0.716047 | 0.702118 | 0.422705 | 1.000000 | 0.671268 | 0.653379 | 0.687648 | 0.713806 | 0.653444 |
BRAF | 0.655028 | 0.654600 | 0.436459 | 0.671268 | 1.000000 | 0.646755 | 0.672933 | 0.638158 | 0.637912 |
CDK2 | 0.648447 | 0.630308 | 0.423994 | 0.653379 | 0.646755 | 1.000000 | 0.681253 | 0.656025 | 0.723093 |
LCK | 0.711321 | 0.685075 | 0.451336 | 0.687648 | 0.672933 | 0.681253 | 1.000000 | 0.690879 | 0.662581 |
MET | 0.711258 | 0.697967 | 0.393699 | 0.713806 | 0.638158 | 0.656025 | 0.690879 | 1.000000 | 0.629355 |
p38a | 0.644644 | 0.635173 | 0.431194 | 0.653444 | 0.637912 | 0.723093 | 0.662581 | 0.629355 | 1.000000 |
[24]:
# Show matrix with background gradient
cm = sns.light_palette("green", as_cmap=True)
kinase_similarity_matrix_substitution_df.style.background_gradient(cmap=cm).format("{:.3f}")
[24]:
EGFR | ErbB2 | p110a | KDR | BRAF | CDK2 | LCK | MET | p38a | |
---|---|---|---|---|---|---|---|---|---|
EGFR | 1.000 | 0.941 | 0.427 | 0.716 | 0.655 | 0.648 | 0.711 | 0.711 | 0.645 |
ErbB2 | 0.941 | 1.000 | 0.414 | 0.702 | 0.655 | 0.630 | 0.685 | 0.698 | 0.635 |
p110a | 0.427 | 0.414 | 1.000 | 0.423 | 0.436 | 0.424 | 0.451 | 0.394 | 0.431 |
KDR | 0.716 | 0.702 | 0.423 | 1.000 | 0.671 | 0.653 | 0.688 | 0.714 | 0.653 |
BRAF | 0.655 | 0.655 | 0.436 | 0.671 | 1.000 | 0.647 | 0.673 | 0.638 | 0.638 |
CDK2 | 0.648 | 0.630 | 0.424 | 0.653 | 0.647 | 1.000 | 0.681 | 0.656 | 0.723 |
LCK | 0.711 | 0.685 | 0.451 | 0.688 | 0.673 | 0.681 | 1.000 | 0.691 | 0.663 |
MET | 0.711 | 0.698 | 0.394 | 0.714 | 0.638 | 0.656 | 0.691 | 1.000 | 0.629 |
p38a | 0.645 | 0.635 | 0.431 | 0.653 | 0.638 | 0.723 | 0.663 | 0.629 | 1.000 |
When we compare the matrices calculated based on the identity and substitution score, the overall pattern is similar, while the values are generally higher using the substitution score.
Note: For all downstream analysis, we will only consider the kinase similarity matrix calculated based on the substitution matrix.
[25]:
kinase_similarity_matrix_df = kinase_similarity_matrix_substitution_df
Save kinase similarity matrix¶
[26]:
file_name = f"kinase_similarity_matrix.csv"
kinase_similarity_matrix_df.to_csv(DATA / file_name)
Kinase distance matrix¶
Since all entries are between \(0\) and \(1\), the similarity matrix \(SM\) is mapped to a distance matrix:
[27]:
print(
f"The values of the similarity matrix lie between: "
f"{kinase_similarity_matrix_df.min().min():.2f}"
f" and {kinase_similarity_matrix_df.max().max():.2f}"
)
# NBVAL_CHECK_OUTPUT
The values of the similarity matrix lie between: 0.39 and 1.00
[28]:
kinase_distance_matrix_df = 1 - kinase_similarity_matrix_df
[29]:
kinase_distance_matrix_df.style.background_gradient(cmap=cm).format("{:.3f}")
[29]:
EGFR | ErbB2 | p110a | KDR | BRAF | CDK2 | LCK | MET | p38a | |
---|---|---|---|---|---|---|---|---|---|
EGFR | 0.000 | 0.059 | 0.573 | 0.284 | 0.345 | 0.352 | 0.289 | 0.289 | 0.355 |
ErbB2 | 0.059 | 0.000 | 0.586 | 0.298 | 0.345 | 0.370 | 0.315 | 0.302 | 0.365 |
p110a | 0.573 | 0.586 | 0.000 | 0.577 | 0.564 | 0.576 | 0.549 | 0.606 | 0.569 |
KDR | 0.284 | 0.298 | 0.577 | 0.000 | 0.329 | 0.347 | 0.312 | 0.286 | 0.347 |
BRAF | 0.345 | 0.345 | 0.564 | 0.329 | 0.000 | 0.353 | 0.327 | 0.362 | 0.362 |
CDK2 | 0.352 | 0.370 | 0.576 | 0.347 | 0.353 | 0.000 | 0.319 | 0.344 | 0.277 |
LCK | 0.289 | 0.315 | 0.549 | 0.312 | 0.327 | 0.319 | 0.000 | 0.309 | 0.337 |
MET | 0.289 | 0.302 | 0.606 | 0.286 | 0.362 | 0.344 | 0.309 | 0.000 | 0.371 |
p38a | 0.355 | 0.365 | 0.569 | 0.347 | 0.362 | 0.277 | 0.337 | 0.371 | 0.000 |
Save kinase distance matrix¶
[30]:
file_name = f"kinase_distance_matrix.csv"
kinase_distance_matrix_df.to_csv(DATA / file_name)
Discussion¶
In this talktorial, we investigate how sequences can be used to measure similarity between kinases. The focus is set on the pocket sequence, which is retrieved from KLIFS. Sequence similarity can be assessed using two scores: 1. the identity, which treats all amino acids uniformly, and 2. the substitution, which takes into account the rate of change of residues over evolutionary time.
The kinase similarity matrix above will be reloaded in Talktorial T028, where we compare kinase similarities from different perspectives, including the pocket sequence perspective we have talked about in this talktorial.
Quiz¶
Should the full kinase sequence be used instead of the pocket sequence?
How does the similarity using identity behave with respect to mutations?
How does similarity using identity compare to similarity using substitution?