T011 · Querying online API webservices

Note: This talktorial is a part of TeachOpenCADD, a platform that aims to teach domain-specific skills and to provide pipeline templates as starting points for research projects.

Authors:

Aim of this talktorial

In this notebook, you will learn how to programmatically use online web-services from Python, in the context of drug design. By the end of this talktorial, you will be familiar with REST services and web scraping.

Contents in Theory

  • Data access from a server-side perspective

Contents in Practical

  • Downloading static files

  • Accessing dynamically generated content

  • Programmatic interfaces

  • Document parsing

  • Browser remote control

References

This guide is very practical and omits some technical definitions for the sake of clarity. However, you should also handle some basic terminology to fully understand what is going on behind the scenes.

Theory

The internet is a collection of connected computers that exchange data. In a way, you essentially query machines (servers) with certain parameters to retrieve specific data. That data will be either:

  • A. Served straight away, since the server is simply a repository of files. E.g. you can download the ChEMBL database dump from their servers.

  • B. Retrieved from a database and formatted in a particular way. The result you see on your browser is either:

    • B1. Pre-processed on the server, e.g. the HTML page you see when you visit any article in Wikipedia.

    • B2. Dynamically generated on the client (your browser) as you use the website, e.g. Twitter, Facebook, or any modern web-app.

  • C. Computed through the execution of one or more programs on the server-side, e.g. estimating the protonation states of a protein-ligand complex using Protoss.

In a way, configuration C is a special type of B1. You are just replacing the type of task that runs on the server: database querying and HTML rendering vs. computations that process your query and return data formatted in a domain-specific way.

Another way of categorizing online services is by the format of the returned data. Most pages you see on your browser are using HTML, usually focusing on presenting data in a human-readable way. However, some servers might structure that data in a way that is machine-readable. This data can be processed in a reliable way because it’s formatted using a consistent set of rules that can be easily encoded in a program. Such programs are usually called parsers. HTML can be labeled in such a way that data can be obtained reliably, but it is not designed with that purpose in mind. As a result, we will usually prefer using services that provide machine-readable formats, like JSON, CSV or XML.

In practice, both ways of data presentation (should) coexist in harmony. Modern web architecture strives to separate data retrieval tasks from end-user presentation. One popular implementation consists of using a programmatic endpoint that returns machine-readable JSON data, which is then consumed by the user-facing web application. The latter renders HTML, either on the server -option B1-, or on the user’s browser -option B2. Unfortunately, unlike the user-facing application, the programmatic endpoint (API) is not guaranteed to be publicly available, and is sometimes restricted to internal usage on the server side.

In the following sections, we will discuss how to make the most out of each type of online service using Python and some libraries!

Practical

[1]:
from pathlib import Path

HERE = Path(_dh[-1])
DATA = HERE / "data"
TMPDATA = DATA / "_tmp"  # this dir is gitignored
TMPDATA.mkdir(parents=True, exist_ok=True)

Downloading static files

In this case, the web server is hosting files that you will download and consume right away. All you need to do is to query the server for the right address or URL (Universal Resource Location). You do this all the time when you browse the internet, and you can also do it with Python!

For example, let’s get this kinase-related CSV dataset from GitHub, which contains a list of kinases and their identifiers.

Tip: Whenever you want to download a file hosted in GitHub, use the Raw button to obtain the downloadable URL!

image.png

While Python provides a library to deal with HTTP queries (urllib), people often prefer using the 3rd-party requests because the usage is way simpler.

[2]:
import requests

url = "https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.csv"
response = requests.get(url)
response.raise_for_status()
response

# NBVAL_CHECK_OUTPUT
[2]:
<Response [200]>

When you use requests.get(...) you obtain a Response object. This is not the file you want to download, but an object that wraps the HTTP query and the response the server gave you. Before we inspect the content, we always call .raise_for_status(), which will raise an exception if the server told us that the request could not be fulfilled. How does the server do that? With HTTP status codes, a 3-digit number. There are several, but the most common ones are:

  • 200: Everything OK!

  • 404: File not found.

  • 500: Server error.

.raise_for_status() will complain if your response didn’t obtain a 200 code. As such, it’s a good practice to call it after every query!

See this example of a bad URL, it contains an error: there’s no TXT file there, just a CSV.

[3]:
# NBVAL_RAISES_EXCEPTION
bad_url = "https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.txt"
bad_response = requests.get(bad_url)
bad_response.raise_for_status()
bad_response
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
/tmp/ipykernel_22567/1358255193.py in <module>
      2 bad_url = "https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.txt"
      3 bad_response = requests.get(bad_url)
----> 4 bad_response.raise_for_status()
      5 bad_response

~/.local/miniconda/envs/toc/lib/python3.9/site-packages/requests/models.py in raise_for_status(self)
    951
    952         if http_error_msg:
--> 953             raise HTTPError(http_error_msg, response=self)
    954
    955     def close(self):

HTTPError: 404 Client Error: Not Found for url: https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.txt

Ok, now let’s get to the contents of the CSV file! Depending on what you are looking for, you will need one of these attributes:

  • response.content: The bytes returned by the server.

  • response.text: The contents of the file, as a string, if possible.

  • response.json(): If the server returns JSON data (more on this later), this method will parse it and return the corresponding dictionary.

Which one should you use? If you want to display some text in the Notebook output, then go for .text. Everything that involves binary files (images, archives, PDFs…) or downloading to disk should use .content.

Since this a CSV file, we know that’s a plain text file, so we can use the usual Python methods on it! Let’s print the first 10 lines:

[4]:
print(*response.text.splitlines()[:10], sep="\n")
xName,Manning Name,HGNC Name,Kinase Name,Group,Family,SubFamily,UniprotID
ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519
ACK,ACK,TNK2,Activated CDC42 kinase 1,TK,Ack,,Q07912
ACTR2,ACTR2,ACVR2A,Activin receptor type-2A,TKL,STKR,STKR2,P27037
ACTR2B,ACTR2B,ACVR2B,Activin receptor type-2B,TKL,STKR,STKR2,Q13705
ADCK4,ADCK4,ADCK4,Uncharacterized aarF domain-containing protein kinase 4,Atypical,ABC1,ABC1-A,Q96D53
Trb1,Trb1,TRIB1,Tribbles homolog 1,CAMK,Trbl,,Q96RU8
BRSK2,BRSK2,BRSK2,Serine/threonine-protein kinase BRSK2,CAMK,CAMKL,BRSK,Q8IWQ3
Wnk2,Wnk2,WNK2,Serine/threonine-protein kinase WNK2,Other,WNK,,Q9Y3S1
AKT1,AKT1,AKT1,RAC-alpha serine/threonine-protein kinase,AGC,Akt,,P31749

Of course, you can save this to disk using the usual Python constructs. Since we are downloading, it’s recommended to use the raw bytes contents, not the text version! This means you should use response.content and open your file in bytes mode (the b in wb):

[5]:
with open(TMPDATA / "kinhub.csv", "wb") as f:
    f.write(response.content)

Open it again to check we wrote something.

[6]:
# We need the encoding="utf-8-sig" to ensure correct encoding
# under all platforms
with open(TMPDATA / "kinhub.csv", encoding="utf-8-sig") as f:
    # Zip will stop iterating with the shortest iterator
    # passing `range(5)` allow us to just get five lines ;)
    for _, line in zip(range(5), f):
        print(line.rstrip())

# NBVAL_CHECK_OUTPUT
xName,Manning Name,HGNC Name,Kinase Name,Group,Family,SubFamily,UniprotID
ABL1,ABL,ABL1,Tyrosine-protein kinase ABL1,TK,Abl,,P00519
ACK,ACK,TNK2,Activated CDC42 kinase 1,TK,Ack,,Q07912
ACTR2,ACTR2,ACVR2A,Activin receptor type-2A,TKL,STKR,STKR2,P27037
ACTR2B,ACTR2B,ACVR2B,Activin receptor type-2B,TKL,STKR,STKR2,Q13705

Tip: If all you want to do is downloading a CSV file to open it with Pandas, then just pass the raw URL to pandas.read_csv. It will download the file for you!

[7]:
import pandas as pd

df = pd.read_csv(
    "https://raw.githubusercontent.com/openkinome/kinodata/master/data/KinHubKinaseList.csv"
)
df.head()
# NBVAL_CHECK_OUTPUT
[7]:
xName Manning Name HGNC Name Kinase Name Group Family SubFamily UniprotID
0 ABL1 ABL ABL1 Tyrosine-protein kinase ABL1 TK Abl NaN P00519
1 ACK ACK TNK2 Activated CDC42 kinase 1 TK Ack NaN Q07912
2 ACTR2 ACTR2 ACVR2A Activin receptor type-2A TKL STKR STKR2 P27037
3 ACTR2B ACTR2B ACVR2B Activin receptor type-2B TKL STKR STKR2 Q13705
4 ADCK4 ADCK4 ADCK4 Uncharacterized aarF domain-containing protein... Atypical ABC1 ABC1-A Q96D53

One note about file downloads. The method above downloads the whole file into memory, which can be a problem for very big files. If you intend to download a very large file, you can push it to disk directly using streaming requests and raw responses. As an example, let’s pretend this 1MB video is too big to fit in memory:

[8]:
import shutil
from IPython.display import Video

response = requests.get(
    "https://archive.org/download/SlowMotionFlame/slomoflame_512kb.mp4", stream=True
)
response.raise_for_status()

with open(TMPDATA / "video.mp4", "wb") as tmp:
    for chunk in response.iter_content(chunk_size=8192):
        tmp.write(chunk)

    # Let's play the movie in Jupyter!
    # Paths passed to widgets need to be relative to notebook or they will 404 :)
    display(Video(Path(tmp.name).relative_to(HERE)))

Accessing dynamically generated content

So far, we have been able to retrieve files that were present on a remote server. To do that, we used requests.get and a URL that points to the file.

Well, it turns out that the same technique will work for many more types of content! What the server does with the URL is not our concern! Whether the server only needs to give you a file on disk or query a database and assemble different parts into the returned content does not matter at all.

That concept alone is extremely powerful, as you will see now. Remember: We just need to make sure we request the correct URL!

Let’s work on something fun now! The spike protein in SARS-CoV-2 is one of the most popular proteins lately, can we get some information from UniProt using requests? Its UniProt ID is P0DTC2. Go check with your browser first, you should see something like this:

UniProt entry for SARS-CoV-2

One of the things UniProt provides is the amino acid sequence of the listed protein. Scroll down until you see this part:

Sequence for SARS-CoV-2

Do you think we can get only the sequence using Python? Let’s see!

To query a protein, you simply need to add its UniProt ID to the URL.

[9]:
r = requests.get("https://www.uniprot.org/uniprot/P0DTC2")
r.raise_for_status()
print(r.text[:5000])
<!DOCTYPE html SYSTEM "about:legacy-compat">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"><head><title>S - Spike glycoprotein precursor - Severe acute respiratory syndrome coronavirus 2 (2019-nCoV) - S gene &amp; protein</title><meta content="IE=edge" http-equiv="X-UA-Compatible"/><meta content="text/html; charset=UTF-8" http-equiv="Content-Type"/><meta content="width=device-width, initial-scale=1" name="viewport"/><link href="/" rel="home"/><link href="https://creativecommons.org/licenses/by/4.0/" rel="license"/><link type="image/vnd.microsoft.icon" href="/favicon.ico" rel="shortcut icon"/><link href="/uniprot.min.css2021_03" type="text/css" rel="stylesheet"/><link href="/tippy.css" type="text/css" rel="stylesheet"/><script type="text/javascript">
                        var BASE = '/';
                </script><script src="/js-compr.js2021_03" type="text/javascript"></script><script type="text/javascript">
                                uniprot.isInternal = false;
                                uniprot.namespace = 'uniprot';
                                uniprot.releasedate = '2021_03';
                        </script><script type="text/javascript">
                        ;
                </script><link href="opensearch.xml" title="UniProtKB" type="application/opensearchdescription+xml" rel="search"/><link href="https://www.uniprot.org/uniprot/P0DTC2" rel="canonical"/><link href="P0DTC2.rdf" title="RDF" type="application/rdf+xml" rel="alternate"/><link href="P0DTC2.rss?version=*" type="application/rss+xml" title="RSS" rel="alternate"/><script type="text/javascript">
                                // variable to store annotation data
                                var annotations = [];
                                var entryId = 'P0DTC2';
                                var isObsolete = false || !true;
                        </script><meta content="attaches the virion to the cell membrane by interacting with host receptor, initiating the infection. Binding to human ACE2 receptor and internalization of the virus into the endosomes of the host cell induces conformational changes in the Spike glycoprotein (PubMed:32142651, PubMed:32221306, PubMed:32075877, PubMed:32155444). Binding to host NRP1 and NRP2 via C-terminal polybasic sequence enhances virion entry into host cell (PubMed:33082294, PubMed:33082293). This interaction may explain virus tropism of human olfactory epithelium cells, which express high level of NRP1 and NRP2 but low level of ACE2 (PubMed:33082293). The stalk domain of S contains three hinges, giving the head unexpected orientational freedom (PubMed:32817270). Uses human TMPRSS2 for priming in human lung cells which is an essential step for viral entry (PubMed:32142651). Can be alternatively processed by host furin (PubMed:32362314). Proteolysis by cathepsin CTSL may unmask the fusion peptide of S2 and activate membranes fusion within endosomes." name="description"/><meta content="nositelinkssearchbox" name="google"/><script defer="defer" src="https://d3js.org/d3.v4.min.js"></script><script defer="defer" src="https://cdn.jsdelivr.net/npm/protvista-uniprot@latest/dist/protvista-uniprot.js"></script><script defer="defer" src="https://cdn.jsdelivr.net/npm/interaction-viewer@latest/dist/interaction-viewer.js"></script></head><body class="namespace-uniprot" typeof="WebPage" prefix="up: http://purl.uniprot.org/core/" vocab="http://schema.org/"><span id="evidenceToolTip" style="display:none">&#xd;
                                    &lt;p>An evidence describes the source of an annotation, e.g. an experiment that has been published in the scientific literature, an orthologous protein, a record from another database, etc.&lt;/p>&#xd;
&#xd;
&lt;p>&lt;a href="/manual/evidences">More...&lt;/a>&lt;/p>&#xd;
                                </span><p style="display:none"><a accesskey="2" href="#content">Skip Header</a></p><div id="masthead-container"><div class="masthead" id="local-masthead"><div id="local-title"><a id="logo" accesskey="1" href="/"><img alt="" src="/images/logos/Logo_medium.png" title="UniProt home"/></a></div><div class="namespace-uniprot" id="local-search"><form method="get" action="/uniprot" id="search-form"><div id="namespace-background"><div class="searchBoxIndicator" style="display:none" id="searchBoxIndicator1"> </div><div onclick="location.href=&apos;/help/text-search&apos;;" class="searchBoxIndicator" style="display:none" id="searchBoxIndicator2"> </div><div onclick="location.href=&apos;/help/advanced_search &apos;;" class="searchBoxIndicator" style="display:none" id="searchBoxIndicator3"> </div><a class="namespace-select" id="select-namespace" onclick="return false;" href=""><span class="caret_white" id="selected-namespace">UniProtKB</span></a><ul style="display:none" class="select-namespace-options"><a href="#" class="closeBox" id="closeNamespaceOptions">x</a><li><ul><li class="fixedHeight_namespaces"><h3 class="namespace_uniprot"><a class="namespace-option uniprot" href="#" id="uniprot">UniProtKB</a></h3><p>Protein knowledgebase</p></li><li class="fixedHeight_namespaces"><h3 class="namespace_uniparc"><a class="namespace-option uniparc" href="#" id="uniparc">UniParc</a></h3><p>Sequence archive</p></li><li class="fixedHeight_namespaces"><h3 class="

Wow, what is all that noise? You are seeing the HTML content of the webpage! That’s the markup language web developers use to write webpages.

There are libraries to process HTML and extract the actual content (like BeautifulSoup; more below), but we will not need it here yet. Fortunately, UniProt provides alternative representations of the data.

UniProt formats

Some formats are more convenient for programmatic use. If you click on Text you will see something different in your browser: just plain text! Also, notice how the URL is now different.

Just adding the .txt extension was enough to change the style. This is a nice feature UniProt provides. It mimics a file system, but it’s actually changing the representation of the returned content. Elegant! And more important, easier to use programmatically! Check it:

[10]:
r = requests.get("https://www.uniprot.org/uniprot/P0DTC2.txt")
r.raise_for_status()
print(r.text[:1000])
ID   SPIKE_SARS2             Reviewed;        1273 AA.
AC   P0DTC2;
DT   22-APR-2020, integrated into UniProtKB/Swiss-Prot.
DT   22-APR-2020, sequence version 1.
DT   02-JUN-2021, entry version 8.
DE   RecName: Full=Spike glycoprotein {ECO:0000255|HAMAP-Rule:MF_04099};
DE            Short=S glycoprotein {ECO:0000255|HAMAP-Rule:MF_04099};
DE   AltName: Full=E2 {ECO:0000255|HAMAP-Rule:MF_04099};
DE   AltName: Full=Peplomer protein {ECO:0000255|HAMAP-Rule:MF_04099};
DE   Contains:
DE     RecName: Full=Spike protein S1 {ECO:0000255|HAMAP-Rule:MF_04099};
DE   Contains:
DE     RecName: Full=Spike protein S2 {ECO:0000255|HAMAP-Rule:MF_04099};
DE   Contains:
DE     RecName: Full=Spike protein S2' {ECO:0000255|HAMAP-Rule:MF_04099};
DE   Flags: Precursor;
GN   Name=S {ECO:0000255|HAMAP-Rule:MF_04099}; ORFNames=2;
OS   Severe acute respiratory syndrome coronavirus 2 (2019-nCoV) (SARS-CoV-2).
OC   Viruses; Riboviria; Orthornavirae; Pisuviricota; Pisoniviricetes;
OC   Nidovirales; Cornidovirineae;

This is exactly what we see on our browser! Plain text is nice for these things. However, the sequence is all the way at the end of the file. To retrieve it, you need to get creative and analyze those little tags each line has. See how it begins with SQ and finishes with //:

SQ   SEQUENCE   1273 AA;  141178 MW;  B17BE6D9F1C4EA34 CRC64;
     MFVFLVLLPL VSSQCVNLTT RTQLPPAYTN SFTRGVYYPD KVFRSSVLHS TQDLFLPFFS
     NVTWFHAIHV SGTNGTKRFD NPVLPFNDGV YFASTEKSNI IRGWIFGTTL DSKTQSLLIV
     NNATNVVIKV CEFQFCNDPF LGVYYHKNNK SWMESEFRVY SSANNCTFEY VSQPFLMDLE
     GKQGNFKNLR EFVFKNIDGY FKIYSKHTPI NLVRDLPQGF SALEPLVDLP IGINITRFQT
     LLALHRSYLT PGDSSSGWTA GAAAYYVGYL QPRTFLLKYN ENGTITDAVD CALDPLSETK
     CTLKSFTVEK GIYQTSNFRV QPTESIVRFP NITNLCPFGE VFNATRFASV YAWNRKRISN
     CVADYSVLYN SASFSTFKCY GVSPTKLNDL CFTNVYADSF VIRGDEVRQI APGQTGKIAD
     YNYKLPDDFT GCVIAWNSNN LDSKVGGNYN YLYRLFRKSN LKPFERDIST EIYQAGSTPC
     NGVEGFNCYF PLQSYGFQPT NGVGYQPYRV VVLSFELLHA PATVCGPKKS TNLVKNKCVN
     FNFNGLTGTG VLTESNKKFL PFQQFGRDIA DTTDAVRDPQ TLEILDITPC SFGGVSVITP
     GTNTSNQVAV LYQDVNCTEV PVAIHADQLT PTWRVYSTGS NVFQTRAGCL IGAEHVNNSY
     ECDIPIGAGI CASYQTQTNS PRRARSVASQ SIIAYTMSLG AENSVAYSNN SIAIPTNFTI
     SVTTEILPVS MTKTSVDCTM YICGDSTECS NLLLQYGSFC TQLNRALTGI AVEQDKNTQE
     VFAQVKQIYK TPPIKDFGGF NFSQILPDPS KPSKRSFIED LLFNKVTLAD AGFIKQYGDC
     LGDIAARDLI CAQKFNGLTV LPPLLTDEMI AQYTSALLAG TITSGWTFGA GAALQIPFAM
     QMAYRFNGIG VTQNVLYENQ KLIANQFNSA IGKIQDSLSS TASALGKLQD VVNQNAQALN
     TLVKQLSSNF GAISSVLNDI LSRLDKVEAE VQIDRLITGR LQSLQTYVTQ QLIRAAEIRA
     SANLAATKMS ECVLGQSKRV DFCGKGYHLM SFPQSAPHGV VFLHVTYVPA QEKNFTTAPA
     ICHDGKAHFP REGVFVSNGT HWFVTQRNFY EPQIITTDNT FVSGNCDVVI GIVNNTVYDP
     LQPELDSFKE ELDKYFKNHT SPDVDLGDIS GINASVVNIQ KEIDRLNEVA KNLNESLIDL
     QELGKYEQYI KWPWYIWLGF IAGLIAIVMV TIMLCCMTSC CSCLKGCCSC GSCCKFDEDD
     SEPVLKGVKL HYT
//

Hence, you could do something like this:

[11]:
sequence_block = False
lines = []
for line in r.text.splitlines():
    if line.startswith("SQ"):
        sequence_block = True
    elif line.startswith("//"):
        sequence_block = False

    if sequence_block:
        line = line.strip()  # delete spaces and newlines at the beginning and end of the line
        line = line.replace(" ", "")  # delete spaces in the middle of the line
        lines.append(line)
sequence = "".join(lines[1:])  # the first line is the metadata header
print(f"This is your sequence: {sequence}")

# NBVAL_CHECK_OUTPUT
This is your sequence: MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT

Ta-da! We got it! It required some processing, but it works… However, you should always wonder if there’s an easier way. Given that UniProt had a nice way of providing the text representation, how come they don’t offer a URL that only returns the sequence for a given UniProt ID? Well, they do! Just change .txt for .fasta: https://www.uniprot.org/uniprot/P0DTC2.fasta

[12]:
r = requests.get("https://www.uniprot.org/uniprot/P0DTC2.fasta")
r.raise_for_status()
print(r.text)

# NBVAL_CHECK_OUTPUT
>sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFS
NVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIV
NNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLE
GKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQT
LLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETK
CTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISN
CVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD
YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPC
NGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVN
FNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITP
GTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSY
ECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTI
SVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQE
VFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAM
QMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALN
TLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRA
SANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPA
ICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDP
LQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDL
QELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDD
SEPVLKGVKLHYT

This is returned in FASTA, a common syntax in bioinformatics. You could use established libraries like BioPython to parse it too!

[13]:
from Bio import SeqIO
from tempfile import NamedTemporaryFile
import os

# Write response into a temporary text file
with NamedTemporaryFile(suffix=".fasta", mode="w", delete=False) as tmp:
    tmp.write(r.text)

# Create the BioPython object for sequence data:
sequence = SeqIO.read(tmp.name, format="fasta")

# Delete temporary file now that we have read it
os.remove(tmp.name)

print(sequence.description)
print(sequence.seq)

# NBVAL_CHECK_OUTPUT
sp|P0DTC2|SPIKE_SARS2 Spike glycoprotein OS=Severe acute respiratory syndrome coronavirus 2 OX=2697049 GN=S PE=1 SV=1
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYFPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT

All these ways to access different representations or sections of the data contained in UniProt constitutes a URL-based API (Application Programmatic Interface). The foundational principle is that the URL contains all the parameters needed to ask the server for a specific type of content. Yes, you read that correctly: parameters. If you think about it, a URL specifies two parts: the machine you are connecting to and the page in that machine you want to access. When the page part is missing, the server assumes you are asking for index.html or equivalent.

Let’s compare it to a command-line interface:

@ # this is your browser
@ uniprot.org/uniprot/P0DTC2.fasta
$ # this is your terminal
$ uniprot --id=P0DTC2 --format=FASTA

Each part of the URL can be considered a positional argument! So, if you want the sequence of a different protein, just input its UniProt ID in the URL, done! For example, P00519 is the ID for the ABL1 kinase.

[14]:
r = requests.get("https://www.uniprot.org/uniprot/P00519.fasta")
r.raise_for_status()
print(r.text)

# NBVAL_CHECK_OUTPUT
>sp|P00519|ABL1_HUMAN Tyrosine-protein kinase ABL1 OS=Homo sapiens OX=9606 GN=ABL1 PE=1 SV=4
MLEICLKLVGCKSKKGLSSSSSCYLEEALQRPVASDFEPQGLSEAARWNSKENLLAGPSE
NDPNLFVALYDFVASGDNTLSITKGEKLRVLGYNHNGEWCEAQTKNGQGWVPSNYITPVN
SLEKHSWYHGPVSRNAAEYLLSSGINGSFLVRESESSPGQRSISLRYEGRVYHYRINTAS
DGKLYVSSESRFNTLAELVHHHSTVADGLITTLHYPAPKRNKPTVYGVSPNYDKWEMERT
DITMKHKLGGGQYGEVYEGVWKKYSLTVAVKTLKEDTMEVEEFLKEAAVMKEIKHPNLVQ
LLGVCTREPPFYIITEFMTYGNLLDYLRECNRQEVNAVVLLYMATQISSAMEYLEKKNFI
HRDLAARNCLVGENHLVKVADFGLSRLMTGDTYTAHAGAKFPIKWTAPESLAYNKFSIKS
DVWAFGVLLWEIATYGMSPYPGIDLSQVYELLEKDYRMERPEGCPEKVYELMRACWQWNP
SDRPSFAEIHQAFETMFQESSISDEVEKELGKQGVRGAVSTLLQAPELPTKTRTSRRAAE
HRDTTDVPEMPHSKGQGESDPLDHEPAVSPLLPRKERGPPEGGLNEDERLLPKDKKTNLF
SALIKKKKKTAPTPPKRSSSFREMDGQPERRGAGEEEGRDISNGALAFTPLDTADPAKSP
KPSNGAGVPNGALRESGGSGFRSPHLWKKSSTLTSSRLATGEEEGGGSSSKRFLRSCSAS
CVPHGAKDTEWRSVTLPRDLQSTGRQFDSSTFGGHKSEKPALPRKRAGENRSDQVTRGTV
TPPPRLVKKNEEAADEVFKDIMESSPGSSPPNLTPKPLRRQVTVAPASGLPHKEEAGKGS
ALGTPAAAEPVTPTSKAGSGAPGGTSKGPAEESRVRRHKHSSESPGRDKGKLSRLKPAPP
PPPAASAGKAGGKPSQSPSQEAAGEAVLGAKTKATSLVDAVNSDAAKPSQPGEGLKKPVL
PATPKPQSAKPSGTPISPAPVPSTLPSASSALAGDQPSSTAFIPLISTRVSLRKTRQPPE
RIASGAITKGVVLDSTEALCLAISRNSEQMASHSAVLEAGKNLYTFCVSYVDSIQQMRNK
FAFREAINKLENNLRELQICPATAGSGPAATQDFSKLLSSVKEISDIVQR

What if we parameterize the URL with an f-string and provide a function to make it super Pythonic? Even better, what if we provide the Bio.SeqIO parsing functionality too?

[15]:
def sequence_for_uniprot_id(uniprot_id):
    """
    Returns the FASTA sequence of a given Uniprot ID using
    the UniProt URL-based API

    Parameters
    ----------
    uniprot_id : str

    Returns
    -------
    Bio.SeqIO.SeqRecord
    """
    #                                                  ⬇ this is key part!
    r = requests.get(f"https://www.uniprot.org/uniprot/{uniprot_id}.fasta")
    r.raise_for_status()

    with NamedTemporaryFile(suffix=".fasta", mode="w", delete=False) as tmp:
        tmp.write(r.text)

    sequence = SeqIO.read(tmp.name, format="fasta")
    os.remove(tmp.name)

    return sequence

Now you can use it for any UniProt ID. This is for the Src kinase:

[16]:
sequence = sequence_for_uniprot_id("P12931")
print(sequence)

# NBVAL_CHECK_OUTPUT
ID: sp|P12931|SRC_HUMAN
Name: sp|P12931|SRC_HUMAN
Description: sp|P12931|SRC_HUMAN Proto-oncogene tyrosine-protein kinase Src OS=Homo sapiens OX=9606 GN=SRC PE=1 SV=3
Number of features: 0
Seq('MGSNKSKPKDASQRRRSLEPAENVHGAGGGAFPASQTPSKPASADGHRGPSAAF...ENL', SingleLetterAlphabet())

Congratulations! You have used your first online API in Python and adapted it to a workflow!

Programmatic interfaces

What UniProt does with their URLs is one way of providing access to their database, i.e., through specific URL schemes. However, if each web service would have to come up with their own scheme, developers would need to figure out which scheme the website is using, and then implement, adapt or customize their scripts on a case-by-case basis. Fortunately, there are some standardized ways of providing programmatic access to online resources. Some of them include:

  • HTTP-based RESTful APIs (wiki)

  • GraphQL

  • SOAP

  • gRPC

In this talktorial, we will focus on the first one, REST.

HTTP-based RESTful APIs

This type of programmatic access defines a specific entry point for clients (scripts, libraries, programs) that require programmatic access, something like api.webservice.com. This is usually different from the website itself (webservice.com). They can be versioned, so the provider can update the scheme without disrupting existing implementations (api.webservice.com/v1 will still work even when api.webservice.com/v2 has been deployed).

This kind of API is usually accompanied by well-written documentation explaining all the available actions in the platform. For example, look at the KLIFS API documentation. KLIFS is a database of kinase targets and small compound inhibitors. You can see how every argument and option is documented, along with usage examples.

If you wanted to list all the kinase families available in KLIFS, you need to access this URL:

https://klifs.net/api/kinase_groups

Result (click here!)

[
  "AGC",
  "CAMK",
  "CK1",
  "CMGC",
  "Other",
  "STE",
  "TK",
  "TKL"
]

This response happens to be JSON-formatted! This is easily parsed into a Python object using the json library. The best news is that you don’t even need that. Using requests, the following operation can be done in three lines thanks to the .json() method:

[17]:
import requests

response = requests.get("https://klifs.net/api/kinase_groups")
response.raise_for_status()
result = response.json()
result

# NBVAL_CHECK_OUTPUT
[17]:
['AGC', 'CAMK', 'CK1', 'CMGC', 'Other', 'STE', 'TK', 'TKL']

That’s a Python list!

[18]:
result[0]
[18]:
'AGC'

Let’s see if we can get all the kinase families contained in a specific group. Reading the documentation, looks like we need this kind of URL:

https://klifs.net/api/kinase_families?kinase_group={{ NAME }}

What follows after the ? symbol is the query. It’s formatted with a key-value syntax like this: key=value. Multiple parameters can be expressed with &:

https://api.webservice.com/some/endpoint?parameter1=value1&parameter2=value2

Let’s see the returned object for the tyrosine kinase (TK) group: family=TK

[19]:
response = requests.get("https://klifs.net/api/kinase_families?kinase_group=TK")
response.raise_for_status()
result = response.json()
result
[19]:
['ALK',
 'Abl',
 'Ack',
 'Alk',
 'Axl',
 'CCK4',
 'Csk',
 'DDR',
 'EGFR',
 'Eph',
 'FAK',
 'FGFR',
 'Fer',
 'InsR',
 'JakA',
 'JakB',
 'Lmr',
 'Met',
 'Musk',
 'PDGFR',
 'Ret',
 'Ror',
 'Ryk',
 'Sev',
 'Src',
 'Syk',
 'TK-Unique',
 'Tec',
 'Tie',
 'Trk',
 'VEGFR']

Since passing parameters to the URL is a very common task, requests provides a more convenient way. This will save you from building the URLs manually or HTML escaping the values. The key idea is to pass the key-value pairs as a dictionary. The previous query can be (and should be, if you ask us) done like this:

[20]:
response = requests.get("https://klifs.net/api/kinase_families", params={"kinase_group": "TK"})
# You can see how requests formatted the URL for you
print("Queried", response.url)
response.raise_for_status()
result = response.json()
result
Queried https://klifs.net/api/kinase_families?kinase_group=TK
[20]:
['ALK',
 'Abl',
 'Ack',
 'Alk',
 'Axl',
 'CCK4',
 'Csk',
 'DDR',
 'EGFR',
 'Eph',
 'FAK',
 'FGFR',
 'Fer',
 'InsR',
 'JakA',
 'JakB',
 'Lmr',
 'Met',
 'Musk',
 'PDGFR',
 'Ret',
 'Ror',
 'Ryk',
 'Sev',
 'Src',
 'Syk',
 'TK-Unique',
 'Tec',
 'Tie',
 'Trk',
 'VEGFR']

Sometimes the returned JSON object is not a list, but a dict. Or a combination of dictionaries and lists. Maybe even nested! You can still access them using the Python tools you already know.

For example, the kinase_information endpoint requires a numeric ID, and will return a lot of information on a single kinase:

[21]:
response = requests.get("https://klifs.net/api/kinase_information", params={"kinase_ID": 22})
response.raise_for_status()
result = response.json()
result

# NBVAL_CHECK_OUTPUT
[21]:
[{'kinase_ID': 22,
  'name': 'MASTL',
  'HGNC': 'MASTL',
  'family': 'MAST',
  'group': 'AGC',
  'kinase_class': 'MASTL',
  'species': 'Human',
  'full_name': 'microtubule associated serine/threonine kinase like',
  'uniprot': 'Q96GX5',
  'iuphar': 0,
  'pocket': 'KPISRGAFGKVYLYAVKVVQVQAERDALALSKPFIVHLYYSYLVMEYLIGGDVKSLLHIYLHRHGIIHRDLKPDNMLILTDFGLS'}]

If you want to know the UniProt ID for this kinase, you will need to access the first (and only) element in the returned list, and ask for the value of the uniprot key:

[22]:
result[0]["uniprot"]
[22]:
'Q96GX5'

Turns out we can use this to get the full sequence of the protein (and not just the pocket sequence) using our UniProt function from before!

[23]:
mastl = sequence_for_uniprot_id(result[0]["uniprot"])
print(mastl.seq)

# NBVAL_CHECK_OUTPUT
MDPTAGSKKEPGGGAATEEGVNRIAVPKPPSIEEFSIVKPISRGAFGKVYLGQKGGKLYAVKVVKKADMINKNMTHQVQAERDALALSKSPFIVHLYYSLQSANNVYLVMEYLIGGDVKSLLHIYGYFDEEMAVKYISEVALALDYLHRHGIIHRDLKPDNMLISNEGHIKLTDFGLSKVTLNRDINMMDILTTPSMAKPRQDYSRTPGQVLSLISSLGFNTPIAEKNQDPANILSACLSETSQLSQGLVCPMSVDQKDTTPYSSKLLKSCLETVASNPGMPVKCLTSNLLQSRKRLATSSASSQSHTFISSVESECHSSPKWEKDCQESDEALGPTMMSWNAVEKLCAKSANAIETKGFNKKDLELALSPIHNSSALPTTGRSCVNLAKKCFSGEVSWEAVELDVNNINMDTDTSQLGFHQSNQWAVDSGGISEEHLGKRSLKRNFELVDSSPCKKIIQNKKTCVEYKHNEMTNCYTNQNTGLTVEVQDLKLSVHKSQQNDCANKENIVNSFTDKQQTPEKLPIPMIAKNLMCELDEDCEKNSKRDYLSSSFLCSDDDRASKNISMNSDSSFPGISIMESPLESQPLDSDRSIKESSFEESNIEDPLIVTPDCQEKTSPKGVENPAVQESNQKMLGPPLEVLKTLASKRNAVAFRSFNSHINASNNSEPSRMNMTSLDAMDISCAYSGSYPMAITPTQKRRSCMPHQQTPNQIKSGTPYRTPKSVRRGVAPVDDGRILGTPDYLAPELLLGRAHGPAVDWWALGVCLFEFLTGIPPFNDETPQQVFQNILKRDIPWPEGEEKLSDNAQSAVEILLTIDDTKRAGMKELKRHPLFSDVDWENLQHQTMPFIPQPDDETDTSYFEARNTAQHLTVSGFSL

We are using two webservices together, awesome!

Generating a client for any API

Did you find that convenient? Well, we are not done yet! You might have noticed that all the endpoints in the KLIFS API have a similar pattern. You specify the name of the endpoint (kinase_groups, kinase_families, kinase_information, …), pass some (optional) parameters if needed, and then get a JSON-formatted response. Is there a way you can avoid having to format the URLs yourself? The answer is… yes!

The REST API scheme can be expressed programmatically in a document called Swagger/OpenAPI definitions, which allows to dynamically generate a Python client for any REST API that implements the Swagger/OpenAPI schema. This is the one for KLIFS.

Of course, there are libraries for doing that in Python, like bravado.

[24]:
from bravado.client import SwaggerClient

KLIFS_SWAGGER = "https://klifs.net/swagger/swagger.json"
client = SwaggerClient.from_url(KLIFS_SWAGGER, config={"validate_responses": False})
client
[24]:
SwaggerClient(https://klifs.net/api)

Then, you can have fun inspecting the client object for all the API actions as methods.

Tip: Type client. and press Tab to inspect the client in this notebook.

[25]:
client.Information.get_kinase_names?
Signature:      client.Information.get_kinase_names(**op_kwargs)
Type:           CallableOperation
String form:    <bravado.client.CallableOperation object at 0x7f6ffebd6ca0>
File:           ~/.local/miniconda/envs/toc/lib/python3.9/site-packages/bravado/client.py
Docstring:
[GET] Kinase names

The Kinase names endpoint returns a list of all available kinases in KLIFS according using the HGNC gene symbols. When a kinase group or kinase family is specified only those kinase names that are within that kinase group or kinase family are returned. When both a group and a family are specified, only the family is used to process the request.


:param kinase_group: Optional: Name (or multiple names separated by a comma) of the kinase group for which the kinase families are requested (e.g. TKL,STE). (optional)
:type kinase_group: string
:param kinase_family: Optional: Name (or multiple names separated by a comma) of the kinase family for which the kinase names are requested (e.g. AUR,WEE). (optional)
:type kinase_family: string
:param species: Optional: Species for which the kinase names are requested (e.g. HUMAN OR MOUSE). (optional)
:type species: string
:returns: 200: An array of IDs and kinase names
:rtype: array:#/definitions/IDlist
:returns: default: Unexpected error
:rtype: #/definitions/Error
Call docstring:
Invoke the actual HTTP request and return a future.

:rtype: :class:`bravado.http_future.HTTPFuture`

bravado is auto-generating classes and functions that mirror the API we were using before! How cool is that? The same query can now be done without requests.

[26]:
client.Information.get_kinase_information(kinase_ID=[22])
[26]:
<bravado.http_future.HttpFuture at 0x7f6ffeb36670>

Note that bravado does not return the response right away. It creates a promise that it will do so when you ask for it. This allows it to be usable in asynchronous programming, but for our purposes, it means that you need to call it with .result().

[27]:
results = client.Information.get_kinase_information(kinase_ID=[22]).result()
result = results[0]
result
[27]:
KinaseInformation(HGNC='MASTL', family='MAST', full_name='microtubule associated serine/threonine kinase like', group='AGC', iuphar=0, kinase_ID=22, kinase_class='MASTL', name='MASTL', pocket='KPISRGAFGKVYLYAVKVVQVQAERDALALSKPFIVHLYYSYLVMEYLIGGDVKSLLHIYLHRHGIIHRDLKPDNMLILTDFGLS', species='Human', uniprot='Q96GX5')
[28]:
result.uniprot

# NBVAL_CHECK_OUTPUT
[28]:
'Q96GX5'

bravado also builds result objects for you, so you don’t have to use the result["property"] syntax, but the result.property one. Some more convenience for the end user ;)

Document parsing

Sometimes the web service will not provide a standardized API that produces machine-readable documents. Instead, you will have to use the regular webpage and parse through the HTML code to obtain the information you need. This is called (web) scraping, which usually involves finding the right HTML tags and IDs that contain the valuable data (ignoring things such as the sidebars, top menus, footers, ads, etc).

In scraping, you basically do two things:

  1. Access the webpage with requests and obtain the HTML contents.

  2. Parse the HTML string with BeautifulSoup or requests-html.

Let’s parse the proteinogenic amino acids table in this Wikipedia article:

[29]:
import requests
from bs4 import BeautifulSoup
import numpy as np
import pandas as pd

r = requests.get("https://en.wikipedia.org/wiki/Proteinogenic_amino_acid")
r.raise_for_status()

# To guess the correct steps here, you will have to inspect the HTML code by hand
# Tip: use right-click + inspect content in any webpage to land in the HTML definition ;)
html = BeautifulSoup(r.text)
header = html.find("span", id="General_chemical_properties")
table = header.find_all_next()[4]
table_body = table.find("tbody")

data = []
for row in table_body.find_all("tr"):
    cells = row.find_all("td")
    if cells:
        data.append([])
    for cell in cells:
        cell_content = cell.text.strip()
        try:  # convert to float if possible
            cell_content = float(cell_content)
        except ValueError:
            pass
        data[-1].append(cell_content)

# Empty fields are denoted with "?" which casts respective columns to object types
# (here mix of strings and floats) but we want float64, therefore replace "?" with NaN values
pd.DataFrame.from_records(data).replace("?", np.nan)

# NBVAL_CHECK_OUTPUT
[29]:
0 1 2 3 4 5
0 A Ala 89.09404 6.01 2.35 9.87
1 C Cys 121.15404 5.05 1.92 10.70
2 D Asp 133.10384 2.85 1.99 9.90
3 E Glu 147.13074 3.15 2.10 9.47
4 F Phe 165.19184 5.49 2.20 9.31
5 G Gly 75.06714 6.06 2.35 9.78
6 H His 155.15634 7.60 1.80 9.33
7 I Ile 131.17464 6.05 2.32 9.76
8 K Lys 146.18934 9.60 2.16 9.06
9 L Leu 131.17464 6.01 2.33 9.74
10 M Met 149.20784 5.74 2.13 9.28
11 N Asn 132.11904 5.41 2.14 8.72
12 O Pyl 255.31000 NaN NaN NaN
13 P Pro 115.13194 6.30 1.95 10.64
14 Q Gln 146.14594 5.65 2.17 9.13
15 R Arg 174.20274 10.76 1.82 8.99
16 S Ser 105.09344 5.68 2.19 9.21
17 T Thr 119.12034 5.60 2.09 9.10
18 U Sec 168.05300 5.47 1.91 10.00
19 V Val 117.14784 6.00 2.39 9.74
20 W Trp 204.22844 5.89 2.46 9.41
21 Y Tyr 181.19124 5.64 2.20 9.21

If you want to get an image, you need to find img tags and retrieve the src property.

[30]:
from IPython.display import Image

display(Image(f'https:{html.find("img")["src"]}'))
../_images/talktorials_T011_query_online_api_webservices_69_0.png

Browser remote control

The trend some years ago was to build servers that dynamically generate HTML documents with some JavaScript here and there (such as Wikipedia). In other words, the HTML is built in the server and sent to the client (your browser).

However, latest trends are pointing towards full applications built entirely with JavaScript frameworks. This means that the HTML content is dynamically generated in the client. Traditional parsing will not work and you will only download the placeholder HTML code that hosts the JavaScript framework. To work around this, the HTML must be rendered with a client-side JavaScript engine.

We won’t cover this in the current notebook, but you can check the following projects if you are interested:


Discussion

In this theoretical introduction you have seen how different methods to programmatically access online web services can be used from a Python interpreter. Leveraging these techniques you will be able to build automated pipelines inside Jupyter Notebooks. In the end, querying a database or downloading a file involves the same kind of tooling.

Unfortunately, there is too much material to cover about web APIs in a single lesson. For example, how do you send or upload contents from Python? Can you submit forms? If you are interested in knowing more, the requests documentation should be your go-to resource. Some interesting parts include:

Quiz

  • Use the KLIFS API (with or without bravado, up to you) to find all kinases that can bind staurosporine (ligand code STU).

  • How can you find the correct HTML tags and identifiers to scrape a specific part of a website? Can it be automated?

  • Would you rather use programmatic APIs or manually crafted scrapers?