T024 · Kinase similarity: Sequence¶

Note: This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:

Talia B. Kimber, 2021, Volkamer lab, Charité
Dominique Sydow, 2021, Volkamer lab, Charité
Andrea Volkamer, 2021, Volkamer lab, Charité

Aim of this talktorial¶

In this talktorial, we investigate sequence similarity for kinases of interest. KLIFS API is used to retrieve the \(85\) residues of the pocket sequence for each kinase.

Two similarity measures are implemented:

Sequence identity, i.e., the similarity which is based on character-wise discrepancy.
Sequence similarity, i.e., the similarity which is based on a substitution matrix, thus, reflecting similarities between amino acids.

Note: We focus on similarities between orthosteric kinase binding sites; similarities to allosteric binding sites are not covered.

Contents in Theory¶

Kinase dataset
Kinase similarity descriptor: Sequence
- Identity score
- Substitution score
From similarity matrix to distance matrix

Contents in Practical¶

Define the kinases of interest
Retrieve sequences from KLIFS
Sequence similarity
- Identity score
- Substitution score
Kinase similarity
- Visualize similarity as kinase matrix
- Save kinase similarity matrix
Kinase distance matrix
- Save kinase distance matrix

References¶

Kinase dataset: Molecules (2021), 26(3), 629
KLIFS
- KLIFS URL: https://klifs.net/
- KLIFS database: Nucleic Acid Res. (2020), 49(D1), D562-D569
Sequence-based kinase clustering: Manning et al. Science (2002), 298(5600), 1912-1934
Substitution matrix: PNAS (1992), 89(22), 10915-10919
Biotite
- Documentation: https://www.biotite-python.org/index.html
- Blosum matrix
Sequence logo: http://www.cbs.dtu.dk/biotools/Seq2Logo/

Theory¶

Kinase dataset¶

We use the kinase selection as defined in Talktorial T023.

Kinase similarity descriptor: Sequence¶

As mentioned in the previous talktorial, sequence is often used to assess kinase similarity (see the phylogenetic tree developed by Manning et al. Science (2002), 298(5600), 1912-1934).

In this talktorial, the KLIFS pocket sequence is used for two main reasons:

The sequence is of fixed length (it contains \(85\) residues), which makes computation for pairwise similarity between two sequences straightforward.
The binding pocket is where the action takes place. Why consider the full kinase sequence when an \(85\) residues sequence contains most relevant information?

Figure 1: The sequence logo shows amino acid binding motifs and sequence profiles such as amino acid depletion. For example, the sequence logo easily visualizes the conserved G-rich loop (position 4-9) and the DFG motif (position 81-83), see Talktorial T023 for more details.

Note for reproducibility: The figure is generated using the Seq2Logo tool available at http://www.cbs.dtu.dk/biotools/Seq2Logo/. The input is the \(85\) residues KLIFS binding pocket for the query kinases. All parameters are the the default ones. For the graphical layout, the entry Stacks Per Line is set to \(85\) and Page size to \(1200 \times 480.\)

We now describe two ways to compare pocket sequences.

Identity score¶

A simple way of assessing the similarity between two sequences is to use the so-called identity score. First, a match vector is created: it checks whether for each position the characters from the two sequences are identical. If they are, the entry is set to \(1,\) and \(0\) otherwise.

The identity score is computed by summing the elements in the match vector and normalizing the entry by the length of the sequence, which is \(85\) in the case of KLIFS pocket sequence.

Let’s consider the identity matrix \(I\) below:

	A	C	D	E	…
A	1	0	0	0	…
C	0	1	0	0	…
D	0	0	1	0	…
E	0	0	0	1	…
…	…	…	…	…	…

and let \(M = len(K_i) = len(K_j)\) for two kinases \(K_i\) and \(K_j\).

We use the following as similarity between kinases \(K_i\) and \(K_j\):

\[\text{similarity}(K_i, K_j) = \frac{1}{M} \sum_{n=1}^{M} I (K_i[n], K_j[n]),\]

where \(K_i[n]\) represents the amino acid at position \(n\) of the sequence of kinase \(i\).

Substitution score¶

Although the identity score is an easy measure of similarity, it does not take into account the rate at which an amino acid may change into another and treats all residues uniformly.

The substitution score takes the changes of the amino acids over evolutionary time into account. It makes use of a substitution matrix, where each entry gives a score between two amino acids. In this talktorial, we use the BLOSUM substitution matrix PNAS (1992), 89(22), 10915-10919, implemented in biotite.

The BLOSUM substitution matrix \(SM\) is defined as below (the full matrix will be displayed in the Practical part):

	A	C	D	E	…
A	4	0	-2	-1	…
C	0	9	-3	-4	…
D	-2	-3	6	2	…
E	-1	-4	2	5	…
…	…	…	…	…	…

The BLOSUM substitution matrix is symmetric, as shown in the practical part.

For convenience, we will translate and rescale the matrix using the following:

\[SM' = SM - min(SM),\]

for translation, and

for all \(x, y\)

\[SM''(x, y) = \frac{SM'[x,y]}{\sqrt{SM'[x,x]*SM'[y,y]}},\]

for rescaling, such that

\[SM''[x, x] = 1 \text{ for all } x,\]

and

\[SM''[x, y] \in [0, 1] \text{ for all } x,y.\]

We use the following as similarity between kinases \(K_i\) and \(K_j\):

\[\text{similarity}(K_i, K_j) = \frac{1}{M} \sum_{n=1}^{M} SM''\big(K_i[n], K_j[n]\big),\]

where \(K_i[n]\) represents the amino acid at position \(n\) of the sequence of kinase \(i\) and \(M = len(K_i) = len(K_j)\).

From similarity matrix to distance matrix¶

In order to apply some clustering algorithm to assess the similarity between kinases, it is necessary to start with a distance matrix. A similarity matrix is not, by definition, a distance matrix. For example, the diagonal elements are not zero. For now, we map the similarity matrix \(SM\) to a distance matrix \(DM\) using

\[DM = 1-SM.\]

See Talktorial T028 for more details.

Practical¶

[1]:

from pathlib import Path
import requests

import pandas as pd
import numpy as np
import seaborn as sns
import biotite.sequence.align as align

[2]:

HERE = Path(_dh[-1])
DATA = HERE / "data"

[3]:

configs = pd.read_csv(HERE / "../T023_what_is_a_kinase/data/pipeline_configs.csv")
configs = configs.set_index("variable")["default_value"]
DEMO = bool(int(configs["DEMO"]))
print(f"Run in demo mode: {DEMO}")
# NBVAL_CHECK_OUTPUT

Run in demo mode: True

Define the kinases of interest¶

Let’s load the kinase selection as defined in Talktorial T023.

[4]:

kinase_selection_df = pd.read_csv(
    HERE
    / "../T023_what_is_a_kinase/\
data/kinase_selection.csv"
)
kinase_selection_df
# NBVAL_CHECK_OUTPUT

[4]:

	kinase	kinase_klifs	uniprot_id	group	full_kinase_name
0	EGFR	EGFR	P00533	TK	Epidermal growth factor receptor
1	ErbB2	ErbB2	P04626	TK	Erythroblastic leukemia viral oncogene homolog 2
2	PI3K	p110a	P42336	Atypical	Phosphatidylinositol-3-kinase
3	VEGFR2	KDR	P35968	TK	Vascular endothelial growth factor receptor 2
4	BRAF	BRAF	P15056	TKL	Rapidly accelerated fibrosarcoma isoform B
5	CDK2	CDK2	P24941	CMGC	Cyclic-dependent kinase 2
6	LCK	LCK	P06239	TK	Lymphocyte-specific protein tyrosine kinase
7	MET	MET	P08581	TK	Mesenchymal-epithelial transition factor
8	p38a	p38a	Q16539	CMGC	p38 mitogen activated protein kinase alpha

Retrieve sequences from KLIFS¶

We use the KLIFS API to retrieve the \(85\)-long pocket sequence for each kinase.

[5]:

def klifs_pocket_sequence(klifs_kinase_name):
    """
    Retrieves the pocket sequence from KLIFS using the API.

    Parameters
    ----------
    klifs_kinase_name : str
        The KLIFS name of the kinase of interest.

    Returns
    -------
    str :
        The 85 residues pocket sequence from KLIFS,
        if the kinase name is valid, None otherwise.
    """
    response = requests.get(
        f"https://klifs.net/api/kinase_ID?kinase_name={klifs_kinase_name}&species=HUMAN"
    )

    if response.status_code == 200:
        return response.json()[0]["pocket"]
    else:
        print(f"KLIFS failed for kinase {klifs_kinase_name}")
        return None

We create a dictionary made of the kinase name and its associated pocket sequence. This dictionary is used throughout this notebook.

[6]:

kinase_sequences_dict = {}
for kinase in kinase_selection_df["kinase_klifs"]:
    kinase_sequences_dict[kinase] = klifs_pocket_sequence(kinase)

for name, sequence in kinase_sequences_dict.items():
    print(f"{name:10}{sequence}")
# NBVAL_CHECK_OUTPUT

EGFR      KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA
ErbB2     KVLGSGAFGTVYKVAIKVLEILDEAYVMAGVGPYVSRLLGIQLVTQLMPYGCLLDHVREYLEDVRLVHRDLAARNVLVITDFGLA
p110a     CRIMSSAKRPLWLIIFKNGDLRQDMLTLQIIRLRMLPYGCLVGLIEVVRSHTIMQIQCKATFI--LGIGDRHNSNIMVHIDFGHF
KDR       KPLGRGAFGQVIEVAVKMLALMSELKILIHIGLNVVNLLGAMVIVEFCKFGNLSTYLRSFLASRKCIHRDLAARNILLICDFGLA
BRAF      QRIGSGSFGTVYKVAVKMLAFKNEVGVLRKTRVNILLFMGYAIVTQWCEGSSLYHHLHIYLHAKSIIHRDLKSNNIFLIGDFGLA
CDK2      EKIGEGTYGVVYKVALKKITAIREISLLKELNPNIVKLLDVYLVFEFLH-QDLKKFMDAFCHSHRVLHRDLKPQNLLILADFGLA
LCK       ERLGAGQFGEVWMVAVKSLAFLAEANLMKQLQQRLVRLYAVYIITEYMENGSLVDFLKTFIEERNYIHRDLRAANILVIADFGLA
MET       EVIGRGHFGCVYHCAVKSLQFLTEGIIMKDFSPNVLSLLGILVVLPYMKHGDLRNFIRNYLASKKFVHRDLAARNCMLVADFGLA
p38a      SPVGSGAYGSVCAVAVKKLRTYRELRLLKHMKENVIGLLDVYLVTHLMG-ADLNNIVKCYIHSADIIHRDLKPSNLAVILDFGLA

As shown in the cell above, some sequences have missing residues, denoted by “-”. Let’s plot these sequences as heatmap for a quick visual on conserved regions.

[7]:

# Cast dict to DataFrame
kinase_sequences_df = pd.DataFrame(
    [list(i) for i in kinase_sequences_dict.values()],
    index=kinase_sequences_dict.keys(),
    columns=range(1, 86),
)
# Cast letters to integers (gap is set to None)
letter_to_int = {letter: ix for ix, letter in enumerate(list("ACDEFGHIKLMNPQRSTVWY"), 1)}
letter_to_int["-"] = None
kinase_sequences_int_df = kinase_sequences_df.applymap(lambda x: letter_to_int[x])
# Show heatmap (qualitative colormap)
ax = sns.heatmap(kinase_sequences_int_df, cmap="tab20", cbar=False)

../_images/talktorials_T024_kinase_similarity_sequence_37_0.png

Sequence similarity¶

Given two kinases, we create functions which account for identity or substitution similarity, as described in the Theory part.

Identity score¶

We first define a function which compares element-wise characters in two sequences.

[8]:

def identity_score(sequence1, sequence2):
    """
    Computes the element-wise binary similarity between two sequences.

    Parameters
    ----------
    sequence1 : np.array
        An array of characters describing the first sequence.
    sequence2 : np.array
        An array of characters describing the second sequence.

    Returns
    -------
    np.array
        The bool array for each character.
        1 if the elements are identical,
        0 otherwise.
    """
    # True is the character is the same, False otherwise
    return np.compare_chararrays(sequence1, sequence2, cmp="==", rstrip=True)

Substitution score¶

We now define the function which is more specific to amino acids grouping and use the biotite library to retrieve the BLOSUM substitution matrix.

The substitution matrix can be retrieve from biotite using the following command:

[9]:

substitution_matrix = align.SubstitutionMatrix.std_protein_matrix()
print(substitution_matrix)
# NBVAL_CHECK_OUTPUT

    A   C   D   E   F   G   H   I   K   L   M   N   P   Q   R   S   T   V   W   Y   B   Z   X   *
A   4   0  -2  -1  -2   0  -2  -1  -1  -1  -1  -2  -1  -1  -1   1   0   0  -3  -2  -2  -1   0  -4
C   0   9  -3  -4  -2  -3  -3  -1  -3  -1  -1  -3  -3  -3  -3  -1  -1  -1  -2  -2  -3  -3  -2  -4
D  -2  -3   6   2  -3  -1  -1  -3  -1  -4  -3   1  -1   0  -2   0  -1  -3  -4  -3   4   1  -1  -4
E  -1  -4   2   5  -3  -2   0  -3   1  -3  -2   0  -1   2   0   0  -1  -2  -3  -2   1   4  -1  -4
F  -2  -2  -3  -3   6  -3  -1   0  -3   0   0  -3  -4  -3  -3  -2  -2  -1   1   3  -3  -3  -1  -4
G   0  -3  -1  -2  -3   6  -2  -4  -2  -4  -3   0  -2  -2  -2   0  -2  -3  -2  -3  -1  -2  -1  -4
H  -2  -3  -1   0  -1  -2   8  -3  -1  -3  -2   1  -2   0   0  -1  -2  -3  -2   2   0   0  -1  -4
I  -1  -1  -3  -3   0  -4  -3   4  -3   2   1  -3  -3  -3  -3  -2  -1   3  -3  -1  -3  -3  -1  -4
K  -1  -3  -1   1  -3  -2  -1  -3   5  -2  -1   0  -1   1   2   0  -1  -2  -3  -2   0   1  -1  -4
L  -1  -1  -4  -3   0  -4  -3   2  -2   4   2  -3  -3  -2  -2  -2  -1   1  -2  -1  -4  -3  -1  -4
M  -1  -1  -3  -2   0  -3  -2   1  -1   2   5  -2  -2   0  -1  -1  -1   1  -1  -1  -3  -1  -1  -4
N  -2  -3   1   0  -3   0   1  -3   0  -3  -2   6  -2   0   0   1   0  -3  -4  -2   3   0  -1  -4
P  -1  -3  -1  -1  -4  -2  -2  -3  -1  -3  -2  -2   7  -1  -2  -1  -1  -2  -4  -3  -2  -1  -2  -4
Q  -1  -3   0   2  -3  -2   0  -3   1  -2   0   0  -1   5   1   0  -1  -2  -2  -1   0   3  -1  -4
R  -1  -3  -2   0  -3  -2   0  -3   2  -2  -1   0  -2   1   5  -1  -1  -3  -3  -2  -1   0  -1  -4
S   1  -1   0   0  -2   0  -1  -2   0  -2  -1   1  -1   0  -1   4   1  -2  -3  -2   0   0   0  -4
T   0  -1  -1  -1  -2  -2  -2  -1  -1  -1  -1   0  -1  -1  -1   1   5   0  -2  -2  -1  -1   0  -4
V   0  -1  -3  -2  -1  -3  -3   3  -2   1   1  -3  -2  -2  -3  -2   0   4  -3  -1  -3  -2  -1  -4
W  -3  -2  -4  -3   1  -2  -2  -3  -3  -2  -1  -4  -4  -2  -3  -3  -2  -3  11   2  -4  -3  -2  -4
Y  -2  -2  -3  -2   3  -3   2  -1  -2  -1  -1  -2  -3  -1  -2  -2  -2  -1   2   7  -3  -2  -1  -4
B  -2  -3   4   1  -3  -1   0  -3   0  -4  -3   3  -2   0  -1   0  -1  -3  -4  -3   4   1  -1  -4
Z  -1  -3   1   4  -3  -2   0  -3   1  -3  -1   0  -1   3   0   0  -1  -2  -3  -2   1   4  -1  -4
X   0  -2  -1  -1  -1  -1  -1  -1  -1  -1  -1  -1  -2  -1  -1   0   0  -1  -2  -1  -1  -1  -1  -4
*  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4  -4   1

Check for symmetry:

[10]:

substitution_matrix.is_symmetric()
# NBVAL_CHECK_OUTPUT

[10]:

True

Let’s perform the translation-rescaling step we discussed in the Theory part.

[11]:

def _translate_rescale_substitution_matrix(
    substitution_matrix=align.SubstitutionMatrix.std_protein_matrix(),
):
    """
    Translate and rescale substitution matrix.

    Parameters
    ----------
    substitution_matrix : biotite.sequence.align.SubstitutionMatrix
        A substitution matrix specific to amino acids.
        The default is align.SubstitutionMatrix.std_protein_matrix()
        from biotite, which represents BLOSUM62.

    Returns
    -------
    pd.DataFrame
        Translated and rescaled substitution matrix as DataFrame
        (index/columns contain letters).
    """
    # Retrieve np.array from substitution matrix
    score_matrix = substitution_matrix.score_matrix()

    # Translation of substitution matrix
    translated_matrix = score_matrix - np.min(score_matrix)

    # Rescaling
    normalized_score_matrix = np.zeros(score_matrix.shape)
    for i in range(score_matrix.shape[0]):
        for j in range(score_matrix.shape[0]):
            normalized_score_matrix[i, j] = translated_matrix[i, j] / np.sqrt(
                translated_matrix[i, i] * translated_matrix[j, j]
            )

    # Create DataFrame from matrix with letters as index/column names
    normalized_score_matrix = pd.DataFrame(
        normalized_score_matrix,
        columns=substitution_matrix.get_alphabet1(),
        index=substitution_matrix.get_alphabet1(),
    )

    # Check for symmetry
    symmetric = (
        normalized_score_matrix.values == normalized_score_matrix.values.transpose()
    ).all()
    if not symmetric:
        raise ValueError(f"Translated/rescaled matrix is not symmetric.")

    return normalized_score_matrix

Let’s take a look at the translated and rescaled version of the substitution matrix.

[12]:

_translate_rescale_substitution_matrix().style.format("{:.2f}")

[12]:

	A	C	D	E	F	G	H	I	K	L	M	N	P	Q	R	S	T	V	W	Y	B	Z	X	*
A	1.00	0.39	0.22	0.35	0.22	0.45	0.20	0.38	0.35	0.38	0.35	0.22	0.32	0.35	0.35	0.62	0.47	0.50	0.09	0.21	0.25	0.38	0.82	0.00
C	0.39	1.00	0.09	0.00	0.18	0.09	0.08	0.29	0.09	0.29	0.28	0.09	0.08	0.09	0.09	0.29	0.28	0.29	0.14	0.17	0.10	0.10	0.32	0.00
D	0.22	0.09	1.00	0.63	0.10	0.30	0.27	0.11	0.32	0.00	0.11	0.50	0.29	0.42	0.21	0.45	0.32	0.11	0.00	0.10	0.89	0.56	0.55	0.00
E	0.35	0.00	0.63	1.00	0.11	0.21	0.38	0.12	0.56	0.12	0.22	0.42	0.30	0.67	0.44	0.47	0.33	0.24	0.09	0.20	0.59	0.94	0.58	0.00
F	0.22	0.18	0.10	0.11	1.00	0.10	0.27	0.45	0.11	0.45	0.42	0.10	0.00	0.11	0.11	0.22	0.21	0.34	0.41	0.67	0.11	0.11	0.55	0.00
G	0.45	0.09	0.30	0.21	0.10	1.00	0.18	0.00	0.21	0.00	0.11	0.40	0.19	0.21	0.21	0.45	0.21	0.11	0.16	0.10	0.34	0.22	0.55	0.00
H	0.20	0.08	0.27	0.38	0.27	0.18	1.00	0.10	0.29	0.10	0.19	0.46	0.17	0.38	0.38	0.31	0.19	0.10	0.15	0.52	0.41	0.41	0.50	0.00
I	0.38	0.29	0.11	0.12	0.45	0.00	0.10	1.00	0.12	0.75	0.59	0.11	0.11	0.12	0.12	0.25	0.35	0.88	0.09	0.32	0.12	0.12	0.61	0.00
K	0.35	0.09	0.32	0.56	0.11	0.21	0.29	0.12	1.00	0.24	0.33	0.42	0.30	0.56	0.67	0.47	0.33	0.24	0.09	0.20	0.47	0.59	0.58	0.00
L	0.38	0.29	0.00	0.12	0.45	0.00	0.10	0.75	0.24	1.00	0.71	0.11	0.11	0.24	0.24	0.25	0.35	0.62	0.18	0.32	0.00	0.12	0.61	0.00
M	0.35	0.28	0.11	0.22	0.42	0.11	0.19	0.59	0.33	0.71	1.00	0.21	0.20	0.44	0.33	0.35	0.33	0.59	0.26	0.30	0.12	0.35	0.58	0.00
N	0.22	0.09	0.50	0.42	0.10	0.40	0.46	0.11	0.42	0.11	0.21	1.00	0.19	0.42	0.42	0.56	0.42	0.11	0.00	0.19	0.78	0.45	0.55	0.00
P	0.32	0.08	0.29	0.30	0.00	0.19	0.17	0.11	0.30	0.11	0.20	0.19	1.00	0.30	0.20	0.32	0.30	0.21	0.00	0.09	0.21	0.32	0.35	0.00
Q	0.35	0.09	0.42	0.67	0.11	0.21	0.38	0.12	0.56	0.24	0.44	0.42	0.30	1.00	0.56	0.47	0.33	0.24	0.17	0.30	0.47	0.82	0.58	0.00
R	0.35	0.09	0.21	0.44	0.11	0.21	0.38	0.12	0.67	0.24	0.33	0.42	0.20	0.56	1.00	0.35	0.33	0.12	0.09	0.20	0.35	0.47	0.58	0.00
S	0.62	0.29	0.45	0.47	0.22	0.45	0.31	0.25	0.47	0.25	0.35	0.56	0.32	0.47	0.35	1.00	0.59	0.25	0.09	0.21	0.50	0.50	0.82	0.00
T	0.47	0.28	0.32	0.33	0.21	0.21	0.19	0.35	0.33	0.35	0.33	0.42	0.30	0.33	0.33	0.59	1.00	0.47	0.17	0.20	0.35	0.35	0.77	0.00
V	0.50	0.29	0.11	0.24	0.34	0.11	0.10	0.88	0.24	0.62	0.59	0.11	0.21	0.24	0.12	0.25	0.47	1.00	0.09	0.32	0.12	0.25	0.61	0.00
W	0.09	0.14	0.00	0.09	0.41	0.16	0.15	0.09	0.09	0.18	0.26	0.00	0.00	0.17	0.09	0.09	0.17	0.09	1.00	0.47	0.00	0.09	0.30	0.00
Y	0.21	0.17	0.10	0.20	0.67	0.10	0.52	0.32	0.20	0.32	0.30	0.19	0.09	0.30	0.20	0.21	0.20	0.32	0.47	1.00	0.11	0.21	0.52	0.00
B	0.25	0.10	0.89	0.59	0.11	0.34	0.41	0.12	0.47	0.00	0.12	0.78	0.21	0.47	0.35	0.50	0.35	0.12	0.00	0.11	1.00	0.62	0.61	0.00
Z	0.38	0.10	0.56	0.94	0.11	0.22	0.41	0.12	0.59	0.12	0.35	0.45	0.32	0.82	0.47	0.50	0.35	0.25	0.09	0.21	0.62	1.00	0.61	0.00
X	0.82	0.32	0.55	0.58	0.55	0.55	0.50	0.61	0.58	0.61	0.58	0.55	0.35	0.58	0.58	0.82	0.77	0.61	0.30	0.52	0.61	0.61	1.00	0.00
*	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00

Let’s define a function that calculates the substitution scores between two sequences (will make use of the previously defined function).

[13]:

def substitution_score(
    sequence1, sequence2, substitution_matrix=align.SubstitutionMatrix.std_protein_matrix()
):
    """
    Retrieve the match score given the substitution matrix.

    Parameters
    ----------
    sequence1 : np.array
        An array of characters describing the first sequence.
    sequence2 : np.array
        An array of characters describing the second sequence.
    substitution_matrix : biotite.sequence.align.SubstitutionMatrix
        A substitution matrix specific to amino acids.
        The default is align.SubstitutionMatrix.std_protein_matrix()
        from biotite, which represents BLOSUM62.

    Returns
    -------
    np.array
        The vector of match score
        using the normalized substitution matrix.
    """
    substitution_matrix_df = _translate_rescale_substitution_matrix(substitution_matrix)

    match_score_array = np.zeros(len(sequence1))
    for i, (character_seq1, character_seq2) in enumerate(zip(sequence1, sequence2)):
        match_score_array[i] = substitution_matrix_df.loc[character_seq1, character_seq2]
    return match_score_array

Kinase similarity¶

Given two kinases, we create a function which computes the sequence similarity between them using one of the two measures, the identity or the substitution.

[14]:

def sequence_similarity(sequence_1, sequence_2, type_="identity"):
    """
    Compares two sequences using a given metric.

    Parameters
    ----------
    sequence_1, sequence_2 : str
        The two sequences of strings for comparison.

    type_ : str
        The type of metric to compute the similarity.
        The default is `identity`.

    Returns
    -------
    float :
        The similarity between the pocket sequences of the two kinases.
    """

    # Replace possible unavailable residue
    # noted in KLIFS with "-"
    # by the symbol "*" for biotite
    sequence_1 = sequence_1.replace("-", "*")
    sequence_2 = sequence_2.replace("-", "*")

    if len(sequence_1) != len(sequence_1):
        raise ValueError(f"Mismatch in sequence lengths.")
    else:
        seq_array1 = np.array(list(sequence_1))
        seq_array2 = np.array(list(sequence_2))

        if type_ == "identity":
            is_match_array = identity_score(seq_array1, seq_array2)
            similarity_normed = sum(is_match_array) / len(sequence_1)
        elif type_ == "substitution":
            match_score_array = substitution_score(seq_array1, seq_array2)
            similarity_normed = sum(match_score_array) / len(sequence_1)
        else:
            raise ValueError(f"Type {type_} not defined.")

        return similarity_normed

Let’s look at the sequence similarity between two kinases (see also Figure 2):

[15]:

if DEMO:
    example1 = "EGFR"
    example2 = "MET"
else:
    example1 = kinase_selection_df["kinase_klifs"][0]
    example2 = kinase_selection_df["kinase_klifs"][1]

print("The sequences are:\n")
for key in (example1, example2):
    print(f"{key:5s}: {kinase_sequences_dict[key]}")
# NBVAL_CHECK_OUTPUT

The sequences are:

EGFR : KVLGSGAFGTVYKVAIKELEILDEAYVMASVDPHVCRLLGIQLITQLMPFGCLLDYVREYLEDRRLVHRDLAARNVLVITDFGLA
MET  : EVIGRGHFGCVYHCAVKSLQFLTEGIIMKDFSPNVLSLLGILVVLPYMKHGDLRNFIRNYLASKKFVHRDLAARNCMLVADFGLA

[16]:

example_seq_similarity = sequence_similarity(
    kinase_sequences_dict[example1], kinase_sequences_dict[example2], "identity"
)

print(
    f"Pocket sequence similarity between {example1} and {example2} kinases: "
    f"{example_seq_similarity:.2f} using identity."
)
# NBVAL_CHECK_OUTPUT

Pocket sequence similarity between EGFR and MET kinases: 0.46 using identity.

[17]:

example_seq_similarity = sequence_similarity(
    kinase_sequences_dict[example1], kinase_sequences_dict[example2], "substitution"
)
print(
    f"Pocket sequence similarity between {example1} and {example2} kinases: "
    f"{example_seq_similarity:.2f} using substitution."
)
# NBVAL_CHECK_OUTPUT

Pocket sequence similarity between EGFR and MET kinases: 0.71 using substitution.

Figure 2: Sequences and sequence similarity between the kinases EGFR and MET.

We can also look at self-similarity:

[18]:

example_seq_similarity = sequence_similarity(
    kinase_sequences_dict[example1], kinase_sequences_dict[example1], type_="identity"
)
print(
    f"Pocket sequence similarity between {example1} itself: "
    f"{example_seq_similarity:.2f} using identity."
)
# NBVAL_CHECK_OUTPUT

Pocket sequence similarity between EGFR itself: 1.00 using identity.

[19]:

example_seq_similarity = sequence_similarity(
    kinase_sequences_dict[example1], kinase_sequences_dict[example1], type_="substitution"
)
print(
    f"Pocket sequence similarity between {example1} itself: "
    f"{example_seq_similarity:.2f} using substitution."
)
# NBVAL_CHECK_OUTPUT

Pocket sequence similarity between EGFR itself: 1.00 using substitution.

As expected, the similarity between a kinase and itself leads to the highest possible score:

Visualize similarity as kinase matrix¶

[20]:

def pairwise_kinase_similarities(kinase_sequence_dictionary, similarity_measure):
    """
    Calculate pairwise similarities between a set of kinases.

    Parameters
    ----------
    kinase_sequence_dictionary : dict
        A dictionary containing the kinase name as key
        and the kinase sequence as value.
    similarity_measure : str
        Select similarity measure: "identity" or "substitution".

    Returns
    -------
    pd.DataFrame
        All-against-all similarities between input kinases.
    """
    # Initialize matrix with 0
    kinase_similarity_matrix = np.zeros(
        (len(kinase_sequence_dictionary), len(kinase_sequence_dictionary))
    )
    # Iterate over matrix and fill with similarity values
    for i, kinase_sequence1 in enumerate(kinase_sequence_dictionary.values()):
        for j, kinase_sequence2 in enumerate(kinase_sequence_dictionary.values()):
            kinase_similarity_matrix[i, j] = sequence_similarity(
                kinase_sequence1, kinase_sequence2, similarity_measure
            )

    # Cast matrix to DataFrame
    kinase_similarity_matrix_df = pd.DataFrame(
        data=kinase_similarity_matrix,
        index=kinase_sequences_dict.keys(),
        columns=kinase_sequences_dict.keys(),
    )
    kinase_similarity_matrix_df.index.name = None
    kinase_similarity_matrix_df.columns.name = None

    # Check for symmetry
    symmetric = (
        kinase_similarity_matrix_df.values == kinase_similarity_matrix_df.values.transpose()
    ).all()
    if not symmetric:
        raise ValueError(f"Matrix is not symmetric.")

    return kinase_similarity_matrix_df

We visualize the similarity matrices using identity and substitution:

Kinase similarity matrix: Identity¶

[21]:

kinase_similarity_matrix_identity_df = pairwise_kinase_similarities(
    kinase_sequences_dict, similarity_measure="identity"
)
kinase_similarity_matrix_identity_df
# NBVAL_CHECK_OUTPUT

[21]:

	EGFR	ErbB2	p110a	KDR	BRAF	CDK2	LCK	MET	p38a
EGFR	1.000000	0.894118	0.117647	0.470588	0.376471	0.317647	0.447059	0.458824	0.388235
ErbB2	0.894118	1.000000	0.117647	0.435294	0.400000	0.329412	0.423529	0.470588	0.400000
p110a	0.117647	0.117647	1.000000	0.152941	0.152941	0.105882	0.141176	0.105882	0.141176
KDR	0.470588	0.435294	0.152941	1.000000	0.400000	0.341176	0.435294	0.470588	0.388235
BRAF	0.376471	0.400000	0.152941	0.400000	1.000000	0.329412	0.388235	0.376471	0.376471
CDK2	0.317647	0.329412	0.105882	0.341176	0.329412	1.000000	0.376471	0.364706	0.470588
LCK	0.447059	0.423529	0.141176	0.435294	0.388235	0.376471	1.000000	0.400000	0.388235
MET	0.458824	0.470588	0.105882	0.470588	0.376471	0.364706	0.400000	1.000000	0.364706
p38a	0.388235	0.400000	0.141176	0.388235	0.376471	0.470588	0.388235	0.364706	1.000000

[22]:

# Show matrix with background gradient
cm = sns.light_palette("green", as_cmap=True)
kinase_similarity_matrix_identity_df.style.background_gradient(cmap=cm).format("{:.3f}")

[22]:

	EGFR	ErbB2	p110a	KDR	BRAF	CDK2	LCK	MET	p38a
EGFR	1.000	0.894	0.118	0.471	0.376	0.318	0.447	0.459	0.388
ErbB2	0.894	1.000	0.118	0.435	0.400	0.329	0.424	0.471	0.400
p110a	0.118	0.118	1.000	0.153	0.153	0.106	0.141	0.106	0.141
KDR	0.471	0.435	0.153	1.000	0.400	0.341	0.435	0.471	0.388
BRAF	0.376	0.400	0.153	0.400	1.000	0.329	0.388	0.376	0.376
CDK2	0.318	0.329	0.106	0.341	0.329	1.000	0.376	0.365	0.471
LCK	0.447	0.424	0.141	0.435	0.388	0.376	1.000	0.400	0.388
MET	0.459	0.471	0.106	0.471	0.376	0.365	0.400	1.000	0.365
p38a	0.388	0.400	0.141	0.388	0.376	0.471	0.388	0.365	1.000

Kinase similarity matrix: Substitution¶

[23]:

kinase_similarity_matrix_substitution_df = pairwise_kinase_similarities(
    kinase_sequences_dict, similarity_measure="substitution"
)
kinase_similarity_matrix_substitution_df
# NBVAL_CHECK_OUTPUT

[23]:

	EGFR	ErbB2	p110a	KDR	BRAF	CDK2	LCK	MET	p38a
EGFR	1.000000	0.940963	0.427062	0.716047	0.655028	0.648447	0.711321	0.711258	0.644644
ErbB2	0.940963	1.000000	0.413638	0.702118	0.654600	0.630308	0.685075	0.697967	0.635173
p110a	0.427062	0.413638	1.000000	0.422705	0.436459	0.423994	0.451336	0.393699	0.431194
KDR	0.716047	0.702118	0.422705	1.000000	0.671268	0.653379	0.687648	0.713806	0.653444
BRAF	0.655028	0.654600	0.436459	0.671268	1.000000	0.646755	0.672933	0.638158	0.637912
CDK2	0.648447	0.630308	0.423994	0.653379	0.646755	1.000000	0.681253	0.656025	0.723093
LCK	0.711321	0.685075	0.451336	0.687648	0.672933	0.681253	1.000000	0.690879	0.662581
MET	0.711258	0.697967	0.393699	0.713806	0.638158	0.656025	0.690879	1.000000	0.629355
p38a	0.644644	0.635173	0.431194	0.653444	0.637912	0.723093	0.662581	0.629355	1.000000

[24]:

# Show matrix with background gradient
cm = sns.light_palette("green", as_cmap=True)
kinase_similarity_matrix_substitution_df.style.background_gradient(cmap=cm).format("{:.3f}")

[24]:

	EGFR	ErbB2	p110a	KDR	BRAF	CDK2	LCK	MET	p38a
EGFR	1.000	0.941	0.427	0.716	0.655	0.648	0.711	0.711	0.645
ErbB2	0.941	1.000	0.414	0.702	0.655	0.630	0.685	0.698	0.635
p110a	0.427	0.414	1.000	0.423	0.436	0.424	0.451	0.394	0.431
KDR	0.716	0.702	0.423	1.000	0.671	0.653	0.688	0.714	0.653
BRAF	0.655	0.655	0.436	0.671	1.000	0.647	0.673	0.638	0.638
CDK2	0.648	0.630	0.424	0.653	0.647	1.000	0.681	0.656	0.723
LCK	0.711	0.685	0.451	0.688	0.673	0.681	1.000	0.691	0.663
MET	0.711	0.698	0.394	0.714	0.638	0.656	0.691	1.000	0.629
p38a	0.645	0.635	0.431	0.653	0.638	0.723	0.663	0.629	1.000

When we compare the matrices calculated based on the identity and substitution score, the overall pattern is similar, while the values are generally higher using the substitution score.

Note: For all downstream analysis, we will only consider the kinase similarity matrix calculated based on the substitution matrix.

[25]:

kinase_similarity_matrix_df = kinase_similarity_matrix_substitution_df

Save kinase similarity matrix¶

[26]:

file_name = f"kinase_similarity_matrix.csv"
kinase_similarity_matrix_df.to_csv(DATA / file_name)

Kinase distance matrix¶

Since all entries are between \(0\) and \(1\), the similarity matrix \(SM\) is mapped to a distance matrix:

[27]:

print(
    f"The values of the similarity matrix lie between: "
    f"{kinase_similarity_matrix_df.min().min():.2f}"
    f" and {kinase_similarity_matrix_df.max().max():.2f}"
)
# NBVAL_CHECK_OUTPUT

The values of the similarity matrix lie between: 0.39 and 1.00

[28]:

kinase_distance_matrix_df = 1 - kinase_similarity_matrix_df

[29]:

kinase_distance_matrix_df.style.background_gradient(cmap=cm).format("{:.3f}")

[29]:

	EGFR	ErbB2	p110a	KDR	BRAF	CDK2	LCK	MET	p38a
EGFR	0.000	0.059	0.573	0.284	0.345	0.352	0.289	0.289	0.355
ErbB2	0.059	0.000	0.586	0.298	0.345	0.370	0.315	0.302	0.365
p110a	0.573	0.586	0.000	0.577	0.564	0.576	0.549	0.606	0.569
KDR	0.284	0.298	0.577	0.000	0.329	0.347	0.312	0.286	0.347
BRAF	0.345	0.345	0.564	0.329	0.000	0.353	0.327	0.362	0.362
CDK2	0.352	0.370	0.576	0.347	0.353	0.000	0.319	0.344	0.277
LCK	0.289	0.315	0.549	0.312	0.327	0.319	0.000	0.309	0.337
MET	0.289	0.302	0.606	0.286	0.362	0.344	0.309	0.000	0.371
p38a	0.355	0.365	0.569	0.347	0.362	0.277	0.337	0.371	0.000

Save kinase distance matrix¶

[30]:

file_name = f"kinase_distance_matrix.csv"
kinase_distance_matrix_df.to_csv(DATA / file_name)

Discussion¶

In this talktorial, we investigate how sequences can be used to measure similarity between kinases. The focus is set on the pocket sequence, which is retrieved from KLIFS. Sequence similarity can be assessed using two scores: 1. the identity, which treats all amino acids uniformly, and 2. the substitution, which takes into account the rate of change of residues over evolutionary time.

The kinase similarity matrix above will be reloaded in Talktorial T028, where we compare kinase similarities from different perspectives, including the pocket sequence perspective we have talked about in this talktorial.

Quiz¶

Should the full kinase sequence be used instead of the pocket sequence?
How does the similarity using identity behave with respect to mutations?
How does similarity using identity compare to similarity using substitution?