# T011 · Querying online API webservices¶

Note: This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:

## Aim of this talktorial¶

In this notebook, you will learn how to programmatically use online web-services from Python, in the context of drug design. By the end of this talktorial, you will be familiar with REST services and web scraping.

### Contents in Theory¶

• Data access from a server-side perspective

### Contents in Practical¶

• Accessing dynamically generated content

• Programmatic interfaces

• Document parsing

• Browser remote control

### References¶

This guide is very practical and omits some technical definitions for the sake of clarity. However, you should also handle some basic terminology to fully understand what is going on behind the scenes.

## Theory¶

The internet is a collection of connected computers that exchange data. In a way, you essentially query machines (servers) with certain parameters to retrieve specific data. That data will be either:

• A. Served straight away, since the server is simply a repository of files. E.g. you can download the ChEMBL database dump from their servers.

• B. Retrieved from a database and formatted in a particular way. The result you see on your browser is either:

• B1. Pre-processed on the server, e.g. the HTML page you see when you visit any article in Wikipedia.

• B2. Dynamically generated on the client (your browser) as you use the website, e.g. Twitter, Facebook, or any modern web-app.

• C. Computed through the execution of one or more programs on the server-side, e.g. estimating the protonation states of a protein-ligand complex using Protoss.

In a way, configuration C is a special type of B1. You are just replacing the type of task that runs on the server: database querying and HTML rendering vs. computations that process your query and return data formatted in a domain-specific way.

Another way of categorizing online services is by the format of the returned data. Most pages you see on your browser are using HTML, usually focusing on presenting data in a human-readable way. However, some servers might structure that data in a way that is machine-readable. This data can be processed in a reliable way because it’s formatted using a consistent set of rules that can be easily encoded in a program. Such programs are usually called parsers. HTML can be labeled in such a way that data can be obtained reliably, but it is not designed with that purpose in mind. As a result, we will usually prefer using services that provide machine-readable formats, like JSON, CSV or XML.

In practice, both ways of data presentation (should) coexist in harmony. Modern web architecture strives to separate data retrieval tasks from end-user presentation. One popular implementation consists of using a programmatic endpoint that returns machine-readable JSON data, which is then consumed by the user-facing web application. The latter renders HTML, either on the server -option B1-, or on the user’s browser -option B2. Unfortunately, unlike the user-facing application, the programmatic endpoint (API) is not guaranteed to be publicly available, and is sometimes restricted to internal usage on the server side.

In the following sections, we will discuss how to make the most out of each type of online service using Python and some libraries!

## Practical¶

[1]:

from pathlib import Path

HERE = Path(_dh[-1])
DATA = HERE / "data"
TMPDATA = DATA / "_tmp"  # this dir is gitignored
TMPDATA.mkdir(parents=True, exist_ok=True)


In this case, the web server is hosting files that you will download and consume right away. All you need to do is to query the server for the right address or URL (Universal Resource Location). You do this all the time when you browse the internet, and you can also do it with Python!

For example, let’s get this kinase-related CSV dataset from GitHub, which contains a list of kinases and their identifiers.

Tip: Whenever you want to download a file hosted in GitHub, use the Raw button to obtain the downloadable URL!

While Python provides a library to deal with HTTP queries (urllib), people often prefer using the 3rd-party requests because the usage is way simpler.

[2]:

import requests

url = "https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.csv"
response = requests.get(url)
response.raise_for_status()
response

# NBVAL_CHECK_OUTPUT

[2]:

<Response [200]>


When you use requests.get(...) you obtain a Response object. This is not the file you want to download, but an object that wraps the HTTP query and the response the server gave you. Before we inspect the content, we always call .raise_for_status(), which will raise an exception if the server told us that the request could not be fulfilled. How does the server do that? With HTTP status codes, a 3-digit number. There are several, but the most common ones are:

• 200: Everything OK!

• 500: Server error.

.raise_for_status() will complain if your response didn’t obtain a 200 code. As such, it’s a good practice to call it after every query!

See this example of a bad URL, it contains an error: there’s no TXT file there, just a CSV.

[3]:

# NBVAL_RAISES_EXCEPTION

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
/tmp/ipykernel_22567/1358255193.py in <module>

~/.local/miniconda/envs/toc/lib/python3.9/site-packages/requests/models.py in raise_for_status(self)
951
952         if http_error_msg:
--> 953             raise HTTPError(http_error_msg, response=self)
954
955     def close(self):



Ok, now let’s get to the contents of the CSV file! Depending on what you are looking for, you will need one of these attributes:

• response.content: The bytes returned by the server.

• response.text: The contents of the file, as a string, if possible.

• response.json(): If the server returns JSON data (more on this later), this method will parse it and return the corresponding dictionary.

Which one should you use? If you want to display some text in the Notebook output, then go for .text. Everything that involves binary files (images, archives, PDFs…) or downloading to disk should use .content.

Since this a CSV file, we know that’s a plain text file, so we can use the usual Python methods on it! Let’s print the first 10 lines:

[4]:

print(*response.text.splitlines()[:10], sep="\n")

﻿xName,Manning Name,HGNC Name,Kinase Name,Group,Family,SubFamily,UniprotID
ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519
ACK,ACK,TNK2,Activated CDC42 kinase 1,TK,Ack,,Q07912
ACTR2,ACTR2,ACVR2A,Activin receptor type-2A,TKL,STKR,STKR2,P27037
ACTR2B,ACTR2B,ACVR2B,Activin receptor type-2B,TKL,STKR,STKR2,Q13705
Trb1,Trb1,TRIB1,Tribbles homolog 1,CAMK,Trbl,,Q96RU8
BRSK2,BRSK2,BRSK2,Serine/threonine-protein kinase BRSK2,CAMK,CAMKL,BRSK,Q8IWQ3
Wnk2,Wnk2,WNK2,Serine/threonine-protein kinase WNK2,Other,WNK,,Q9Y3S1
AKT1,AKT1,AKT1,RAC-alpha serine/threonine-protein kinase,AGC,Akt,,P31749


Of course, you can save this to disk using the usual Python constructs. Since we are downloading, it’s recommended to use the raw bytes contents, not the text version! This means you should use response.content and open your file in bytes mode (the b in wb):

[5]:

with open(TMPDATA / "kinhub.csv", "wb") as f:
f.write(response.content)


Open it again to check we wrote something.

[6]:

# We need the encoding="utf-8-sig" to ensure correct encoding
# under all platforms
with open(TMPDATA / "kinhub.csv", encoding="utf-8-sig") as f:
# Zip will stop iterating with the shortest iterator
# passing range(5) allow us to just get five lines ;)
for _, line in zip(range(5), f):
print(line.rstrip())

# NBVAL_CHECK_OUTPUT

xName,Manning Name,HGNC Name,Kinase Name,Group,Family,SubFamily,UniprotID
ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519
ACK,ACK,TNK2,Activated CDC42 kinase 1,TK,Ack,,Q07912
ACTR2,ACTR2,ACVR2A,Activin receptor type-2A,TKL,STKR,STKR2,P27037
ACTR2B,ACTR2B,ACVR2B,Activin receptor type-2B,TKL,STKR,STKR2,Q13705


Tip: If all you want to do is downloading a CSV file to open it with Pandas, then just pass the raw URL to pandas.read_csv. It will download the file for you!

[7]:

import pandas as pd

"https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.csv"
)
# NBVAL_CHECK_OUTPUT

[7]:

xName Manning Name HGNC Name Kinase Name Group Family SubFamily UniprotID
0 ABL1 ABL ABL1 Tyrosine-protein kinase ABL1 TK Abl NaN P00519
1 ACK ACK TNK2 Activated CDC42 kinase 1 TK Ack NaN Q07912
2 ACTR2 ACTR2 ACVR2A Activin receptor type-2A TKL STKR STKR2 P27037
3 ACTR2B ACTR2B ACVR2B Activin receptor type-2B TKL STKR STKR2 Q13705

One note about file downloads. The method above downloads the whole file into memory, which can be a problem for very big files. If you intend to download a very large file, you can push it to disk directly using streaming requests and raw responses. As an example, let’s pretend this 1MB video is too big to fit in memory:

[8]:

import shutil
from IPython.display import Video

response = requests.get(
)
response.raise_for_status()

with open(TMPDATA / "video.mp4", "wb") as tmp:
for chunk in response.iter_content(chunk_size=8192):
tmp.write(chunk)

# Let's play the movie in Jupyter!
# Paths passed to widgets need to be relative to notebook or they will 404 :)
display(Video(Path(tmp.name).relative_to(HERE)))


### Accessing dynamically generated content¶

So far, we have been able to retrieve files that were present on a remote server. To do that, we used requests.get and a URL that points to the file.

Well, it turns out that the same technique will work for many more types of content! What the server does with the URL is not our concern! Whether the server only needs to give you a file on disk or query a database and assemble different parts into the returned content does not matter at all.

That concept alone is extremely powerful, as you will see now. Remember: We just need to make sure we request the correct URL!

Let’s work on something fun now! The spike protein in SARS-CoV-2 is one of the most popular proteins lately, can we get some information from UniProt using requests? Its UniProt ID is P0DTC2. Go check with your browser first, you should see something like this:

One of the things UniProt provides is the amino acid sequence of the listed protein. Scroll down until you see this part:

Do you think we can get only the sequence using Python? Let’s see!

To query a protein, you simply need to add its UniProt ID to the URL.

[9]:

r = requests.get("https://www.uniprot.org/uniprot/P0DTC2")
r.raise_for_status()
print(r.text[:5000])

<!DOCTYPE html SYSTEM "about:legacy-compat">
var BASE = '/';
</script><script src="/js-compr.js2021_03" type="text/javascript"></script><script type="text/javascript">
uniprot.isInternal = false;
uniprot.namespace = 'uniprot';
uniprot.releasedate = '2021_03';
</script><script type="text/javascript">
;
// variable to store annotation data
var annotations = [];
var entryId = 'P0DTC2';
var isObsolete = false || !true;
</script><meta content="attaches the virion to the cell membrane by interacting with host receptor, initiating the infection. Binding to human ACE2 receptor and internalization of the virus into the endosomes of the host cell induces conformational changes in the Spike glycoprotein (PubMed:32142651, PubMed:32221306, PubMed:32075877, PubMed:32155444). Binding to host NRP1 and NRP2 via C-terminal polybasic sequence enhances virion entry into host cell (PubMed:33082294, PubMed:33082293). This interaction may explain virus tropism of human olfactory epithelium cells, which express high level of NRP1 and NRP2 but low level of ACE2 (PubMed:33082293). The stalk domain of S contains three hinges, giving the head unexpected orientational freedom (PubMed:32817270). Uses human TMPRSS2 for priming in human lung cells which is an essential step for viral entry (PubMed:32142651). Can be alternatively processed by host furin (PubMed:32362314). Proteolysis by cathepsin CTSL may unmask the fusion peptide of S2 and activate membranes fusion within endosomes." name="description"/><meta content="nositelinkssearchbox" name="google"/><script defer="defer" src="https://d3js.org/d3.v4.min.js"></script><script defer="defer" src="https://cdn.jsdelivr.net/npm/protvista-uniprot@latest/dist/protvista-uniprot.js"></script><script defer="defer" src="https://cdn.jsdelivr.net/npm/interaction-viewer@latest/dist/interaction-viewer.js"></script></head><body class="namespace-uniprot" typeof="WebPage" prefix="up: http://purl.uniprot.org/core/" vocab="http://schema.org/"><span id="evidenceToolTip" style="display:none">&#xd;
&lt;p>An evidence describes the source of an annotation, e.g. an experiment that has been published in the scientific literature, an orthologous protein, a record from another database, etc.&lt;/p>&#xd;
&#xd;
&lt;p>&lt;a href="/manual/evidences">More...&lt;/a>&lt;/p>&#xd;
</span><p style="display:none"><a accesskey="2" href="#content">Skip Header</a></p><div id="masthead-container"><div class="masthead" id="local-masthead"><div id="local-title"><a id="logo" accesskey="1" href="/"><img alt="" src="/images/logos/Logo_medium.png" title="UniProt home"/></a></div><div class="namespace-uniprot" id="local-search"><form method="get" action="/uniprot" id="search-form"><div id="namespace-background"><div class="searchBoxIndicator" style="display:none" id="searchBoxIndicator1"> </div><div onclick="location.href=&apos;/help/text-search&apos;;" class="searchBoxIndicator" style="display:none" id="searchBoxIndicator2"> </div><div onclick="location.href=&apos;/help/advanced_search &apos;;" class="searchBoxIndicator" style="display:none" id="searchBoxIndicator3"> </div><a class="namespace-select" id="select-namespace" onclick="return false;" href=""><span class="caret_white" id="selected-namespace">UniProtKB</span></a><ul style="display:none" class="select-namespace-options"><a href="#" class="closeBox" id="closeNamespaceOptions">x</a><li><ul><li class="fixedHeight_namespaces"><h3 class="namespace_uniprot"><a class="namespace-option uniprot" href="#" id="uniprot">UniProtKB</a></h3><p>Protein knowledgebase</p></li><li class="fixedHeight_namespaces"><h3 class="namespace_uniparc"><a class="namespace-option uniparc" href="#" id="uniparc">UniParc</a></h3><p>Sequence archive</p></li><li class="fixedHeight_namespaces"><h3 class="


Wow, what is all that noise? You are seeing the HTML content of the webpage! That’s the markup language web developers use to write webpages.

There are libraries to process HTML and extract the actual content (like BeautifulSoup; more below), but we will not need it here yet. Fortunately, UniProt provides alternative representations of the data.

Some formats are more convenient for programmatic use. If you click on Text you will see something different in your browser: just plain text! Also, notice how the URL is now different.

Just adding the .txt extension was enough to change the style. This is a nice feature UniProt provides. It mimics a file system, but it’s actually changing the representation of the returned content. Elegant! And more important, easier to use programmatically! Check it:

[10]:

r = requests.get("https://www.uniprot.org/uniprot/P0DTC2.txt")
r.raise_for_status()
print(r.text[:1000])

ID   SPIKE_SARS2             Reviewed;        1273 AA.
AC   P0DTC2;
DT   22-APR-2020, integrated into UniProtKB/Swiss-Prot.
DT   22-APR-2020, sequence version 1.
DT   02-JUN-2021, entry version 8.
DE   RecName: Full=Spike glycoprotein {ECO:0000255|HAMAP-Rule:MF_04099};
DE            Short=S glycoprotein {ECO:0000255|HAMAP-Rule:MF_04099};
DE   AltName: Full=E2 {ECO:0000255|HAMAP-Rule:MF_04099};
DE   AltName: Full=Peplomer protein {ECO:0000255|HAMAP-Rule:MF_04099};
DE   Contains:
DE     RecName: Full=Spike protein S1 {ECO:0000255|HAMAP-Rule:MF_04099};
DE   Contains:
DE     RecName: Full=Spike protein S2 {ECO:0000255|HAMAP-Rule:MF_04099};
DE   Contains:
DE     RecName: Full=Spike protein S2' {ECO:0000255|HAMAP-Rule:MF_04099};
DE   Flags: Precursor;
GN   Name=S {ECO:0000255|HAMAP-Rule:MF_04099}; ORFNames=2;
OS   Severe acute respiratory syndrome coronavirus 2 (2019-nCoV) (SARS-CoV-2).
OC   Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes;
OC   Nidovirales; Cornidovirineae;


This is exactly what we see on our browser! Plain text is nice for these things. However, the sequence is all the way at the end of the file. To retrieve it, you need to get creative and analyze those little tags each line has. See how it begins with SQ and finishes with //:

SQ   SEQUENCE   1273 AA;  141178 MW;  B17BE6D9F1C4EA34 CRC64;
MFVFLVLLPL VSSQCVNLTT RTQLPPAYTN SFTRGVYYPD KVFRSSVLHS TQDLFLPFFS
NVTWFHAIHV SGTNGTKRFD NPVLPFNDGV YFASTEKSNI IRGWIFGTTL DSKTQSLLIV
NNATNVVIKV CEFQFCNDPF LGVYYHKNNK SWMESEFRVY SSANNCTFEY VSQPFLMDLE
GKQGNFKNLR EFVFKNIDGY FKIYSKHTPI NLVRDLPQGF SALEPLVDLP IGINITRFQT
LLALHRSYLT PGDSSSGWTA GAAAYYVGYL QPRTFLLKYN ENGTITDAVD CALDPLSETK
CTLKSFTVEK GIYQTSNFRV QPTESIVRFP NITNLCPFGE VFNATRFASV YAWNRKRISN
YNYKLPDDFT GCVIAWNSNN LDSKVGGNYN YLYRLFRKSN LKPFERDIST EIYQAGSTPC
NGVEGFNCYF PLQSYGFQPT NGVGYQPYRV VVLSFELLHA PATVCGPKKS TNLVKNKCVN
FNFNGLTGTG VLTESNKKFL PFQQFGRDIA DTTDAVRDPQ TLEILDITPC SFGGVSVITP
GTNTSNQVAV LYQDVNCTEV PVAIHADQLT PTWRVYSTGS NVFQTRAGCL IGAEHVNNSY
ECDIPIGAGI CASYQTQTNS PRRARSVASQ SIIAYTMSLG AENSVAYSNN SIAIPTNFTI
SVTTEILPVS MTKTSVDCTM YICGDSTECS NLLLQYGSFC TQLNRALTGI AVEQDKNTQE
VFAQVKQIYK TPPIKDFGGF NFSQILPDPS KPSKRSFIED LLFNKVTLAD AGFIKQYGDC
LGDIAARDLI CAQKFNGLTV LPPLLTDEMI AQYTSALLAG TITSGWTFGA GAALQIPFAM
QMAYRFNGIG VTQNVLYENQ KLIANQFNSA IGKIQDSLSS TASALGKLQD VVNQNAQALN
TLVKQLSSNF GAISSVLNDI LSRLDKVEAE VQIDRLITGR LQSLQTYVTQ QLIRAAEIRA
SANLAATKMS ECVLGQSKRV DFCGKGYHLM SFPQSAPHGV VFLHVTYVPA QEKNFTTAPA
ICHDGKAHFP REGVFVSNGT HWFVTQRNFY EPQIITTDNT FVSGNCDVVI GIVNNTVYDP
LQPELDSFKE ELDKYFKNHT SPDVDLGDIS GINASVVNIQ KEIDRLNEVA KNLNESLIDL
QELGKYEQYI KWPWYIWLGF IAGLIAIVMV TIMLCCMTSC CSCLKGCCSC GSCCKFDEDD
SEPVLKGVKL HYT
//


Hence, you could do something like this:

[11]:

sequence_block = False
lines = []
for line in r.text.splitlines():
if line.startswith("SQ"):
sequence_block = True
elif line.startswith("//"):
sequence_block = False

if sequence_block:
line = line.strip()  # delete spaces and newlines at the beginning and end of the line
line = line.replace(" ", "")  # delete spaces in the middle of the line
lines.append(line)

# NBVAL_CHECK_OUTPUT

This is your sequence: MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT


Ta-da! We got it! It required some processing, but it works… However, you should always wonder if there’s an easier way. Given that UniProt had a nice way of providing the text representation, how come they don’t offer a URL that only returns the sequence for a given UniProt ID? Well, they do! Just change .txt for .fasta: https://www.uniprot.org/uniprot/P0DTC2.fasta

[12]:

r = requests.get("https://www.uniprot.org/uniprot/P0DTC2.fasta")
r.raise_for_status()
print(r.text)

# NBVAL_CHECK_OUTPUT

>sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1
NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT
LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK
CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISN
YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC
NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVN
ECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI
SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE
LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM
QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN
TLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRA
SANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPA
ICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDP
LQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDL
QELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDD
SEPVLKGVKLHYT



This is returned in FASTA, a common syntax in bioinformatics. You could use established libraries like BioPython to parse it too!

[13]:

from Bio import SeqIO
from tempfile import NamedTemporaryFile
import os

# Write response into a temporary text file
with NamedTemporaryFile(suffix=".fasta", mode="w", delete=False) as tmp:
tmp.write(r.text)

# Create the BioPython object for sequence data:

# Delete temporary file now that we have read it
os.remove(tmp.name)

print(sequence.description)
print(sequence.seq)

# NBVAL_CHECK_OUTPUT

sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1


All these ways to access different representations or sections of the data contained in UniProt constitutes a URL-based API (Application Programmatic Interface). The foundational principle is that the URL contains all the parameters needed to ask the server for a specific type of content. Yes, you read that correctly: parameters. If you think about it, a URL specifies two parts: the machine you are connecting to and the page in that machine you want to access. When the page part is missing, the server assumes you are asking for index.html or equivalent.

Let’s compare it to a command-line interface:

@ # this is your browser
@ uniprot.org/uniprot/P0DTC2.fasta

$# this is your terminal$ uniprot --id=P0DTC2 --format=FASTA


Each part of the URL can be considered a positional argument! So, if you want the sequence of a different protein, just input its UniProt ID in the URL, done! For example, P00519 is the ID for the ABL1 kinase.

[14]:

r = requests.get("https://www.uniprot.org/uniprot/P00519.fasta")
r.raise_for_status()
print(r.text)

# NBVAL_CHECK_OUTPUT

>sp|P00519|ABL1_HUMAN Tyrosine-protein kinase ABL1 OS=Homo sapiens OX=9606 GN=ABL1 PE=1 SV=4
MLEICLKLVGCKSKKGLSSSSSCYLEEALQRPVASDFEPQGLSEAARWNSKENLLAGPSE
NDPNLFVALYDFVASGDNTLSITKGEKLRVLGYNHNGEWCEAQTKNGQGWVPSNYITPVN
SLEKHSWYHGPVSRNAAEYLLSSGINGSFLVRESESSPGQRSISLRYEGRVYHYRINTAS
DITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEIKHPNLVQ
LLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFI
DVWAFGVLLWEIATYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNP
SDRPSFAEIHQAFETMFQESSISDEVEKELGKQGVRGAVSTLLQAPELPTKTRTSRRAAE
HRDTTDVPEMPHSKGQGESDPLDHEPAVSPLLPRKERGPPEGGLNEDERLLPKDKKTNLF
KPSNGAGVPNGALRESGGSGFRSPHLWKKSSTLTSSRLATGEEEGGGSSSKRFLRSCSAS
CVPHGAKDTEWRSVTLPRDLQSTGRQFDSSTFGGHKSEKPALPRKRAGENRSDQVTRGTV
ALGTPAAAEPVTPTSKAGSGAPGGTSKGPAEESRVRRHKHSSESPGRDKGKLSRLKPAPP
PPPAASAGKAGGKPSQSPSQEAAGEAVLGAKTKATSLVDAVNSDAAKPSQPGEGLKKPVL
PATPKPQSAKPSGTPISPAPVPSTLPSASSALAGDQPSSTAFIPLISTRVSLRKTRQPPE
RIASGAITKGVVLDSTEALCLAISRNSEQMASHSAVLEAGKNLYTFCVSYVDSIQQMRNK
FAFREAINKLENNLRELQICPATAGSGPAATQDFSKLLSSVKEISDIVQR



What if we parameterize the URL with an f-string and provide a function to make it super Pythonic? Even better, what if we provide the Bio.SeqIO parsing functionality too?

[15]:

def sequence_for_uniprot_id(uniprot_id):
"""
Returns the FASTA sequence of a given Uniprot ID using
the UniProt URL-based API

Parameters
----------
uniprot_id : str

Returns
-------
Bio.SeqIO.SeqRecord
"""
#                                                  ⬇ this is key part!
r = requests.get(f"https://www.uniprot.org/uniprot/{uniprot_id}.fasta")
r.raise_for_status()

with NamedTemporaryFile(suffix=".fasta", mode="w", delete=False) as tmp:
tmp.write(r.text)

os.remove(tmp.name)

return sequence


Now you can use it for any UniProt ID. This is for the Src kinase:

[16]:

sequence = sequence_for_uniprot_id("P12931")
print(sequence)

# NBVAL_CHECK_OUTPUT

ID: sp|P12931|SRC_HUMAN
Name: sp|P12931|SRC_HUMAN
Description: sp|P12931|SRC_HUMAN Proto-oncogene tyrosine-protein kinase Src OS=Homo sapiens OX=9606 GN=SRC PE=1 SV=3
Number of features: 0


Congratulations! You have used your first online API in Python and adapted it to a workflow!

### Programmatic interfaces¶

What UniProt does with their URLs is one way of providing access to their database, i.e., through specific URL schemes. However, if each web service would have to come up with their own scheme, developers would need to figure out which scheme the website is using, and then implement, adapt or customize their scripts on a case-by-case basis. Fortunately, there are some standardized ways of providing programmatic access to online resources. Some of them include:

• HTTP-based RESTful APIs (wiki)

• GraphQL

• SOAP

• gRPC

In this talktorial, we will focus on the first one, REST.

#### HTTP-based RESTful APIs¶

This type of programmatic access defines a specific entry point for clients (scripts, libraries, programs) that require programmatic access, something like api.webservice.com. This is usually different from the website itself (webservice.com). They can be versioned, so the provider can update the scheme without disrupting existing implementations (api.webservice.com/v1 will still work even when api.webservice.com/v2 has been deployed).

This kind of API is usually accompanied by well-written documentation explaining all the available actions in the platform. For example, look at the KLIFS API documentation. KLIFS is a database of kinase targets and small compound inhibitors. You can see how every argument and option is documented, along with usage examples.

If you wanted to list all the kinase families available in KLIFS, you need to access this URL:

https://klifs.net/api/kinase_groups


[
"AGC",
"CAMK",
"CK1",
"CMGC",
"Other",
"STE",
"TK",
"TKL"
]


This response happens to be JSON-formatted! This is easily parsed into a Python object using the json library. The best news is that you don’t even need that. Using requests, the following operation can be done in three lines thanks to the .json() method:

[17]:

import requests

response = requests.get("https://klifs.net/api/kinase_groups")
response.raise_for_status()
result = response.json()
result

# NBVAL_CHECK_OUTPUT

[17]:

['AGC', 'CAMK', 'CK1', 'CMGC', 'Other', 'STE', 'TK', 'TKL']


That’s a Python list!

[18]:

result[0]

[18]:

'AGC'


Let’s see if we can get all the kinase families contained in a specific group. Reading the documentation, looks like we need this kind of URL:

https://klifs.net/api/kinase_families?kinase_group={{ NAME }}


What follows after the ? symbol is the query. It’s formatted with a key-value syntax like this: key=value. Multiple parameters can be expressed with &:

https://api.webservice.com/some/endpoint?parameter1=value1&parameter2=value2


Let’s see the returned object for the tyrosine kinase (TK) group: family=TK

[19]:

response = requests.get("https://klifs.net/api/kinase_families?kinase_group=TK")
response.raise_for_status()
result = response.json()
result

[19]:

['ALK',
'Abl',
'Ack',
'Alk',
'Axl',
'CCK4',
'Csk',
'DDR',
'EGFR',
'Eph',
'FAK',
'FGFR',
'Fer',
'InsR',
'JakA',
'JakB',
'Lmr',
'Met',
'Musk',
'PDGFR',
'Ret',
'Ror',
'Ryk',
'Sev',
'Src',
'Syk',
'TK-Unique',
'Tec',
'Tie',
'Trk',
'VEGFR']


Since passing parameters to the URL is a very common task, requests provides a more convenient way. This will save you from building the URLs manually or HTML escaping the values. The key idea is to pass the key-value pairs as a dictionary. The previous query can be (and should be, if you ask us) done like this:

[20]:

response = requests.get("https://klifs.net/api/kinase_families", params={"kinase_group": "TK"})
# You can see how requests formatted the URL for you
print("Queried", response.url)
response.raise_for_status()
result = response.json()
result

Queried https://klifs.net/api/kinase_families?kinase_group=TK

[20]:

['ALK',
'Abl',
'Ack',
'Alk',
'Axl',
'CCK4',
'Csk',
'DDR',
'EGFR',
'Eph',
'FAK',
'FGFR',
'Fer',
'InsR',
'JakA',
'JakB',
'Lmr',
'Met',
'Musk',
'PDGFR',
'Ret',
'Ror',
'Ryk',
'Sev',
'Src',
'Syk',
'TK-Unique',
'Tec',
'Tie',
'Trk',
'VEGFR']


Sometimes the returned JSON object is not a list, but a dict. Or a combination of dictionaries and lists. Maybe even nested! You can still access them using the Python tools you already know.

For example, the kinase_information endpoint requires a numeric ID, and will return a lot of information on a single kinase:

[21]:

response = requests.get("https://klifs.net/api/kinase_information", params={"kinase_ID": 22})
response.raise_for_status()
result = response.json()
result

# NBVAL_CHECK_OUTPUT

[21]:

[{'kinase_ID': 22,
'name': 'MASTL',
'HGNC': 'MASTL',
'family': 'MAST',
'group': 'AGC',
'kinase_class': 'MASTL',
'species': 'Human',
'full_name': 'microtubule associated serine/threonine kinase like',
'uniprot': 'Q96GX5',
'iuphar': 0,
'pocket': 'KPISRGAFGKVYLYAVKVVQVQAERDALALSKPFIVHLYYSYLVMEYLIGGDVKSLLHIYLHRHGIIHRDLKPDNMLILTDFGLS'}]


If you want to know the UniProt ID for this kinase, you will need to access the first (and only) element in the returned list, and ask for the value of the uniprot key:

[22]:

result[0]["uniprot"]

[22]:

'Q96GX5'


Turns out we can use this to get the full sequence of the protein (and not just the pocket sequence) using our UniProt function from before!

[23]:

mastl = sequence_for_uniprot_id(result[0]["uniprot"])
print(mastl.seq)

# NBVAL_CHECK_OUTPUT

MDPTAGSKKEPGGGAATEEGVNRIAVPKPPSIEEFSIVKPISRGAFGKVYLGQKGGKLYAVKVVKKADMINKNMTHQVQAERDALALSKSPFIVHLYYSLQSANNVYLVMEYLIGGDVKSLLHIYGYFDEEMAVKYISEVALALDYLHRHGIIHRDLKPDNMLISNEGHIKLTDFGLSKVTLNRDINMMDILTTPSMAKPRQDYSRTPGQVLSLISSLGFNTPIAEKNQDPANILSACLSETSQLSQGLVCPMSVDQKDTTPYSSKLLKSCLETVASNPGMPVKCLTSNLLQSRKRLATSSASSQSHTFISSVESECHSSPKWEKDCQESDEALGPTMMSWNAVEKLCAKSANAIETKGFNKKDLELALSPIHNSSALPTTGRSCVNLAKKCFSGEVSWEAVELDVNNINMDTDTSQLGFHQSNQWAVDSGGISEEHLGKRSLKRNFELVDSSPCKKIIQNKKTCVEYKHNEMTNCYTNQNTGLTVEVQDLKLSVHKSQQNDCANKENIVNSFTDKQQTPEKLPIPMIAKNLMCELDEDCEKNSKRDYLSSSFLCSDDDRASKNISMNSDSSFPGISIMESPLESQPLDSDRSIKESSFEESNIEDPLIVTPDCQEKTSPKGVENPAVQESNQKMLGPPLEVLKTLASKRNAVAFRSFNSHINASNNSEPSRMNMTSLDAMDISCAYSGSYPMAITPTQKRRSCMPHQQTPNQIKSGTPYRTPKSVRRGVAPVDDGRILGTPDYLAPELLLGRAHGPAVDWWALGVCLFEFLTGIPPFNDETPQQVFQNILKRDIPWPEGEEKLSDNAQSAVEILLTIDDTKRAGMKELKRHPLFSDVDWENLQHQTMPFIPQPDDETDTSYFEARNTAQHLTVSGFSL


We are using two webservices together, awesome!

#### Generating a client for any API¶

Did you find that convenient? Well, we are not done yet! You might have noticed that all the endpoints in the KLIFS API have a similar pattern. You specify the name of the endpoint (kinase_groups, kinase_families, kinase_information, …), pass some (optional) parameters if needed, and then get a JSON-formatted response. Is there a way you can avoid having to format the URLs yourself? The answer is… yes!

The REST API scheme can be expressed programmatically in a document called Swagger/OpenAPI definitions, which allows to dynamically generate a Python client for any REST API that implements the Swagger/OpenAPI schema. This is the one for KLIFS.

Of course, there are libraries for doing that in Python, like bravado.

[24]:

from bravado.client import SwaggerClient

KLIFS_SWAGGER = "https://klifs.net/swagger/swagger.json"
client = SwaggerClient.from_url(KLIFS_SWAGGER, config={"validate_responses": False})
client

[24]:

SwaggerClient(https://klifs.net/api)


Then, you can have fun inspecting the client object for all the API actions as methods.

Tip: Type client. and press Tab to inspect the client in this notebook.

[25]:

client.Information.get_kinase_names?

Signature:      client.Information.get_kinase_names(**op_kwargs)
Type:           CallableOperation
String form:    <bravado.client.CallableOperation object at 0x7f6ffebd6ca0>
Docstring:
[GET] Kinase names

The Kinase names endpoint returns a list of all available kinases in KLIFS according using the HGNC gene symbols. When a kinase group or kinase family is specified only those kinase names that are within that kinase group or kinase family are returned. When both a group and a family are specified, only the family is used to process the request.

:param kinase_group: Optional: Name (or multiple names separated by a comma) of the kinase group for which the kinase families are requested (e.g. TKL,STE). (optional)
:type kinase_group: string
:param kinase_family: Optional: Name (or multiple names separated by a comma) of the kinase family for which the kinase names are requested (e.g. AUR,WEE). (optional)
:type kinase_family: string
:param species: Optional: Species for which the kinase names are requested (e.g. HUMAN OR MOUSE). (optional)
:type species: string
:returns: 200: An array of IDs and kinase names
:rtype: array:#/definitions/IDlist
:returns: default: Unexpected error
:rtype: #/definitions/Error
Call docstring:
Invoke the actual HTTP request and return a future.

:rtype: :class:bravado.http_future.HTTPFuture



bravado is auto-generating classes and functions that mirror the API we were using before! How cool is that? The same query can now be done without requests.

[26]:

client.Information.get_kinase_information(kinase_ID=[22])

[26]:

<bravado.http_future.HttpFuture at 0x7f6ffeb36670>


Note that bravado does not return the response right away. It creates a promise that it will do so when you ask for it. This allows it to be usable in asynchronous programming, but for our purposes, it means that you need to call it with .result().

[27]:

results = client.Information.get_kinase_information(kinase_ID=[22]).result()
result = results[0]
result

[27]:

KinaseInformation(HGNC='MASTL', family='MAST', full_name='microtubule associated serine/threonine kinase like', group='AGC', iuphar=0, kinase_ID=22, kinase_class='MASTL', name='MASTL', pocket='KPISRGAFGKVYLYAVKVVQVQAERDALALSKPFIVHLYYSYLVMEYLIGGDVKSLLHIYLHRHGIIHRDLKPDNMLILTDFGLS', species='Human', uniprot='Q96GX5')

[28]:

result.uniprot

# NBVAL_CHECK_OUTPUT

[28]:

'Q96GX5'


bravado also builds result objects for you, so you don’t have to use the result["property"] syntax, but the result.property one. Some more convenience for the end user ;)

### Document parsing¶

Sometimes the web service will not provide a standardized API that produces machine-readable documents. Instead, you will have to use the regular webpage and parse through the HTML code to obtain the information you need. This is called (web) scraping, which usually involves finding the right HTML tags and IDs that contain the valuable data (ignoring things such as the sidebars, top menus, footers, ads, etc).

In scraping, you basically do two things:

1. Access the webpage with requests and obtain the HTML contents.

2. Parse the HTML string with BeautifulSoup or requests-html.

Let’s parse the proteinogenic amino acids table in this Wikipedia article:

[29]:

import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

r = requests.get("https://en.wikipedia.org/wiki/Proteinogenic_amino_acid")
r.raise_for_status()

# To guess the correct steps here, you will have to inspect the HTML code by hand
# Tip: use right-click + inspect content in any webpage to land in the HTML definition ;)
html = BeautifulSoup(r.text)
table_body = table.find("tbody")

data = []
for row in table_body.find_all("tr"):
cells = row.find_all("td")
if cells:
data.append([])
for cell in cells:
cell_content = cell.text.strip()
try:  # convert to float if possible
cell_content = float(cell_content)
except ValueError:
pass
data[-1].append(cell_content)

# Empty fields are denoted with "?" which casts respective columns to object types
# (here mix of strings and floats) but we want float64, therefore replace "?" with NaN values
pd.DataFrame.from_records(data).replace("?", np.nan)

# NBVAL_CHECK_OUTPUT

[29]:

0 1 2 3 4 5
0 A Ala 89.09404 6.01 2.35 9.87
1 C Cys 121.15404 5.05 1.92 10.70
2 D Asp 133.10384 2.85 1.99 9.90
3 E Glu 147.13074 3.15 2.10 9.47
4 F Phe 165.19184 5.49 2.20 9.31
5 G Gly 75.06714 6.06 2.35 9.78
6 H His 155.15634 7.60 1.80 9.33
7 I Ile 131.17464 6.05 2.32 9.76
8 K Lys 146.18934 9.60 2.16 9.06
9 L Leu 131.17464 6.01 2.33 9.74
10 M Met 149.20784 5.74 2.13 9.28
11 N Asn 132.11904 5.41 2.14 8.72
12 O Pyl 255.31000 NaN NaN NaN
13 P Pro 115.13194 6.30 1.95 10.64
14 Q Gln 146.14594 5.65 2.17 9.13
15 R Arg 174.20274 10.76 1.82 8.99
16 S Ser 105.09344 5.68 2.19 9.21
17 T Thr 119.12034 5.60 2.09 9.10
18 U Sec 168.05300 5.47 1.91 10.00
19 V Val 117.14784 6.00 2.39 9.74
20 W Trp 204.22844 5.89 2.46 9.41
21 Y Tyr 181.19124 5.64 2.20 9.21

If you want to get an image, you need to find img tags and retrieve the src property.

[30]:

from IPython.display import Image

display(Image(f'https:{html.find("img")["src"]}'))


### Browser remote control¶

The trend some years ago was to build servers that dynamically generate HTML documents with some JavaScript here and there (such as Wikipedia). In other words, the HTML is built in the server and sent to the client (your browser).

However, latest trends are pointing towards full applications built entirely with JavaScript frameworks. This means that the HTML content is dynamically generated in the client. Traditional parsing will not work and you will only download the placeholder HTML code that hosts the JavaScript framework. To work around this, the HTML must be rendered with a client-side JavaScript engine.

We won’t cover this in the current notebook, but you can check the following projects if you are interested:

## Discussion¶

In this theoretical introduction you have seen how different methods to programmatically access online web services can be used from a Python interpreter. Leveraging these techniques you will be able to build automated pipelines inside Jupyter Notebooks. In the end, querying a database or downloading a file involves the same kind of tooling.

Unfortunately, there is too much material to cover about web APIs in a single lesson. For example, how do you send or upload contents from Python? Can you submit forms? If you are interested in knowing more, the requests documentation should be your go-to resource. Some interesting parts include:

## Quiz¶

• Use the KLIFS API (with or without bravado, up to you) to find all kinases that can bind staurosporine (ligand code STU).

• How can you find the correct HTML tags and identifiers to scrape a specific part of a website? Can it be automated?

• Would you rather use programmatic APIs or manually crafted scrapers?