Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using Database Modules in Inputs #179

Open
ervau opened this issue Dec 28, 2021 · 15 comments
Open

Using Database Modules in Inputs #179

ervau opened this issue Dec 28, 2021 · 15 comments
Assignees
Labels
help wanted User needs help question User requires information

Comments

@ervau
Copy link
Collaborator

ervau commented Dec 28, 2021

Hi, I can not reach some of the modules in the "inputs" module. The ones I can use are shown below. Is there another way to use the other ones?

image
image

Also, I couldn't find a module to retrieve EC numbers for proteins. Is there a way to do this?

OS: MacOS
Python version: 3.8
Version or commit hash v0.13.13

@ervau ervau added help wanted User needs help question User requires information labels Dec 28, 2021
@deeenes
Copy link
Member

deeenes commented Dec 28, 2021

Hi Erva,

That's weird, though here you are looking at it through an IDE or some other GUI tool. The modules are definitely there, and you could just try what happens if you try to import one which is clearly not on your list:

from pypath.inputs import icellnet

My guess is that your IDE generates the list just by getting the keys from the __dir__ of any given object. And in the inputs module we import almost no submodules by default, we import those only on demand. For example, the readline autocompletion (if turned on) in the Python shell is able to list all the submodules, but at the cost of importing all of them.

Overall, I think it's not a problem, just a limitation of your IDE, and you can just go ahead and import the modules you need.

EC numbers: wait a bit, I'm gonna take a look.

Best,

Denes

@deeenes deeenes self-assigned this Dec 28, 2021
@ervau
Copy link
Collaborator Author

ervau commented Dec 29, 2021

The modules are definitely there, and you could just try what happens if you try to import one which is clearly not on your list:

from pypath.inputs import icellnet

This fixed my problem. It was probably about IDE as you mentioned. Thanks!

@deeenes
Copy link
Member

deeenes commented Dec 31, 2021

Great! EC numbers: it was not possible, but I've just added it: 61a205b

Then you can do something like:

from pyath.utils import mapping
mapping.map_name('P49841', 'uniprot', 'ec')
# {'2.7.11.26', '2.7.11.1'}

Starting from a gene symbol:

mapping.get_mapper().chain_map('GSK3B', 'genesymbol', 'uniprot', 'ec')
# {'2.7.11.26', '2.7.11.1'}

An alternative, suitable only if working with a small number of proteins:

from pypath.utils import uniprot
gsk3b = uniprot.UniprotProtein('P49841')
gsk3b.ec
# {'2.7.11.26', '2.7.11.1'}

Another alternative, a dict of ECs for the whole proteome:

from pypath.inputs import uniprot as uniprot_input
ec = uniprot_input.uniprot_data('ec')

@ervau
Copy link
Collaborator Author

ervau commented Jan 5, 2022

Thanks for your help. Another suggestion; can we add evidence codes to the dictionary that go.go_annotations_goa() creates? like a small adjustment below:

The current version of the related part:

for line in c.result:
        if not line or line[0] == '!':
            continue

        line = line.strip().split('\t')
        annot[line[8]][line[1]].add(line[4])

We can add evidence codes like:

for line in c.result:
    if not line or line[0] == '!':
        continue

    line = line.strip().split('\t')
    annot[line[8]][line[1]].add((line[4], line[6]))

We can also add this as an optional parameter. What do you think?

@deeenes
Copy link
Member

deeenes commented Jan 5, 2022

Hi Erva,

That's a great idea, do you want to open a pull request? I see this function is used at 4 places in pypath, we should just make sure those work fine with the new return values.

@elifcevrim
Copy link
Collaborator

Hello Denes,

I have some questions about KEGG database in pypath.

  1. When I check KEGG Pathway Maps for human in KEGG database. It gives 345 different pathways. However, there are 188 pathways in pypath. Is it due to pypath having some filters or protein entities?

Here are the modules that I use.

a = pypath.core.annot.get_db()
kegg = a.annots['KEGG']
kegg.make_df()
kegg_df = kegg.df
kegg_df
len(kegg_df['value'].unique())
from pypath.inputs import kegg
import pandas as pd

kegg_int = kegg.kegg_interactions()
kegg_int = pd.DataFrame(kegg_int)
len(kegg_int.iloc[:,3].unique())
  1. When I use kegg_dbget module with searching example queries in https://www.genome.jp/kegg/kegg3.html website I couldn't get the results for some of them:
  • KEGG pathway map: map00010
  • Functional ortholog: K04527
  • Gene / protein: hsa:3643, vg:155971, vp:155971-1, ag:CAA76703
  • Enzyme: ec:2.7.10.1

What could be the reason for this? or Is there any way to reach those data?

from pypath.inputs import kegg
kegg_dict = kegg.kegg_dbget('map00010')  #https://www.genome.jp/entry/map00010
kegg_dict
**Traceback**
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_4520/2197661993.py in <module>
----> 1 kegg_dict = kegg.kegg_dbget('map00010')
      2 kegg_dict

~\anaconda3\envs\pypath\lib\site-packages\pypath\inputs\kegg.py in kegg_dbget(entry)
    750 
    751             collecting_ref = True
--> 752             last_ref['PMID'] = re.findall(r'\d+', td.text)[-1]
    753             continue
    754 

IndexError: list index out of range
kegg_dict = kegg.kegg_dbget('hsa:3643') #https://www.genome.jp/entry/hsa:3643
kegg_dict
{'Type': '3643\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0CDS\xa0\xa0\xa0\xa0\xa0\xa0\xa0T01001\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0',
 'Entry': '3643CDST01001\n'}

@deeenes
Copy link
Member

deeenes commented Jan 14, 2022

Hi Elif,

Hello Denes,

I have some questions about KEGG database in pypath.

1. When I check KEGG Pathway Maps for human in KEGG database. It gives 345 different pathways. However, there are 188 pathways in pypath. Is it due to pypath having some filters or protein entities?

Here are the modules that I use.

Just a minor note, I think the line below builds the annotation database every time it's called, which takes very long. You can either load a previously built database, or load only the KEGG annotations (requiring much less memory!):

from pypath import omnipath
a = omnipath.db.get_db('annotations')

from pypath.core import annot
kegg = annot.KeggPathways()
a = pypath.core.annot.get_db()
kegg = a.annots['KEGG']
kegg.make_df()
kegg_df = kegg.df
kegg_df
len(kegg_df['value'].unique())
from pypath.inputs import kegg
import pandas as pd

kegg_int = kegg.kegg_interactions()
kegg_int = pd.DataFrame(kegg_int)
len(kegg_int.iloc[:,3].unique())

For me both of these result 203 (not surprising as the data comes from the same place). In general, we process unambiguous protein-protein interactions from KEGG Pathways where we can translate both partners to UniProt. If a pathway doesn't contain any such interaction, it won't be in this dataset. I can imagine metabolic pathways are often like this, or pathways with virus proteins, etc. We can look into it more specifically if you tell an example which interaction or pathway is missing.

2. When I use kegg_dbget module with searching example queries in https://www.genome.jp/kegg/kegg3.html website I couldn't get the results for some of them:


* KEGG pathway map: map00010

* Functional ortholog: K04527

* Gene / protein: hsa:3643, vg:155971, vp:155971-1, ag:CAA76703

* Enzyme: ec:2.7.10.1

What could be the reason for this? or Is there any way to reach those data?

from pypath.inputs import kegg
kegg_dict = kegg.kegg_dbget('map00010')  #https://www.genome.jp/entry/map00010
kegg_dict
**Traceback**
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_4520/2197661993.py in <module>
----> 1 kegg_dict = kegg.kegg_dbget('map00010')
      2 kegg_dict

~\anaconda3\envs\pypath\lib\site-packages\pypath\inputs\kegg.py in kegg_dbget(entry)
    750 
    751             collecting_ref = True
--> 752             last_ref['PMID'] = re.findall(r'\d+', td.text)[-1]
    753             continue
    754 

IndexError: list index out of range
kegg_dict = kegg.kegg_dbget('hsa:3643') #https://www.genome.jp/entry/hsa:3643
kegg_dict
{'Type': '3643\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0CDS\xa0\xa0\xa0\xa0\xa0\xa0\xa0T01001\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0',
 'Entry': '3643CDST01001\n'}

I think I made this function primarily for pathway pages, like this: https://www.kegg.jp/dbget-bin/www_bget?hsa04350, although I know there are many other type of entities and would be nice to support them. Actually this is just a small experiment, no one ever used it:

def kegg_dbget(entry):
It processes the HTML page, and these nested tables are sometimes tricky to process consistently. I can imagine you can easily extend this function, or even develop it into a class, but it all depends on the purpose: would it be useful for you? What do you want to use it for?

Ok, finally I found the reason why I wrote this function: processing KEGG-MEDICUS, this was the only way I could resolve some identifiers. This is the only use of it at the moment.

Best,

Denes

@elifcevrim
Copy link
Collaborator

Hi Denes,

For example, I can't access N-Glycan biosynthesis (hsa00510) from kegg.kegg_pathways() or kegg.kegg_interactions() module. By the way, I don't know why I get 188 pathways than 203 pathways. I asked Erva to check it. She also gets 188 pathways.

We need to access gene/protein - pathway relation regardless of PPIs. Is there a way to reach the pathways without first looking PPIs?

Besides, one of the information that we want to get from KEGG database is gene - disease relation. We may access it from kegg.kegg_dbget() by searching gene id and also we may get a list of all genes with their attributes in KEGG database. We will check if we can implement it in kegg.py.

@deeenes
Copy link
Member

deeenes commented Jan 20, 2022

Hi Elif,

Hi Denes,

For example, I can't access N-Glycan biosynthesis (hsa00510) from kegg.kegg_pathways() or kegg.kegg_interactions() module. By the way, I don't know why I get 188 pathways than 203 pathways. I asked Erva to check it. She also gets 188 pathways.

That's really mysterious, if you send me your 188 pathways I can take a look.

We need to access gene/protein - pathway relation regardless of PPIs. Is there a way to reach the pathways without first looking PPIs?

Not at the moment. But we iterate through all the pathways and just drop the ones which don't have any PPI that we can process. So it would be fairly straightforward to collect all the annotations, even not only for proteins, but for metabolites, for example. Do you want to try it?

Besides, one of the information that we want to get from KEGG database is gene - disease relation. We may access it from kegg.kegg_dbget() by searching gene id and also we may get a list of all genes with their attributes in KEGG database. We will check if we can implement it in kegg.py.

Great, let me know if you need any help from me. If we extend in this direction, maybe better to make it a class instead of a single function.

Best,

Denes

@deeenes
Copy link
Member

deeenes commented Jan 20, 2022

About diseases in KEGG: have you checked KEGG-MEDICUS? You can find it, with it's URL in pypath.

@elifcevrim
Copy link
Collaborator

Sorry for the late reply. The pathway file (188 pathways) is attached.
kegg_path.csv

@ervau
Copy link
Collaborator Author

ervau commented Apr 2, 2022

Hi Denes,
Which of the methods in go.py do you recommend us to use for accessing GO annotations with their GO IDs, GO term names and reference ids? Should we implement references to go_annotations_solr since go_annotations_quickgo is slow

@deeenes
Copy link
Member

deeenes commented Apr 4, 2022

Hi Erva,

In pypath we have pypath.inputs.go and pypath.utils.go. The former retrieves GO data from different sources, the latter provides classes for working with GO. For most of the use cases the utils is more suitable. I've just made recently a little tutorial about this one, this could be converted to a nice notebook:

from pypath.utils import go

# first time it takes long due to downloads:
goa = go.GOAnnotation()

goa.select_by_term('GO:0034727')
# {'O95352', 'Q8TDY2', 'O75143'}

goa.ontology.terms_to_names(['GO:0034727'])
# [('GO:0034727', 'piecemeal microautophagy of the nucleus')]

goa.select_by_name('piecemeal microautophagy of the nucleus')
# {'O95352', 'Q8TDY2', 'O75143'}

goa.select('piecemeal microautophagy of the nucleus OR late nucleophagy')
# {'O95352', 'O75143', 'Q8TDY2', 'Q674R7', 'Q7Z3C6'}

goa.select('autophagy')
# {'A0A2U3TZJ3', 'A0PJW8', 'A1A4Y4', ...}

goa.ontology.names_to_terms(['autophagy'])
# [('GO:0006914', 'autophagy')]

goa.get_annots('O95352')
# {'GO:0000045', 'GO:0000407', 'GO:0000422', ...}

goa.ontology.terms_to_names(goa.get_annots('O95352'))
# [('GO:0032446', 'protein modification by small protein conjugation'),
#  ('GO:1903204', 'negative regulation of oxidative stress-induced ...'),
#  ...]

# the more specific term 'neutrophil degranulation'
# is in the annotations of O95352
'GO:0043312' in goa.get_annots('O95352')
# True
# the more generic parent term 'myeloid leukocyte activation' is not:
'GO:0002274' in goa.get_annots('O95352')
# False
# but if we let pypath to traverse the graph from bottom to top, it will
# find it:
'GO:0002274' in goa.get_annots_ancestors('O95352')
# True

# among these terms, GO:0043312 is the more specific:
goa.ontology.lowest({'GO:0043312', 'GO:0002274'})
# {'GO:0043312'}
# though it isn't a leaf term, it has children:
goa.ontology.is_leaf('GO:0043312')
# False
# which are those children?
goa.ontology.get_all_descendants('GO:0043312')
# {'GO:0043313', 'GO:0043315', 'GO:0043312', 'GO:0043314'}
# the most specific among these children:
goa.ontology.lowest({'GO:0043313', 'GO:0043315', 'GO:0043312', 'GO:0043314'})
# {'GO:0043315', 'GO:0043314'}
# this one is actually a leaf node:
goa.ontology.is_leaf('GO:0043315')
# True

# similarly, we can look up the ancestors:
goa.ontology.get_all_ancestors('GO:0043312')
# {'GO:0001775', 'GO:0002252', 'GO:0002263', ...}
# the 'myeloid leukocyte activation' is among the ancestors:
'GO:0002274' in goa.ontology.get_all_ancestors('GO:0043312')
# True
# the top level among the ancestors should be a root term:
goa.ontology.highest(goa.ontology.get_all_ancestors('GO:0043312'))
# {'GO:0008150'}
# indeed, this is the root of the BP aspect:
go.ROOT_NODES['biological_process']
# 'GO:0008150'
goa.ontology.terms_to_names({'GO:0008150'})
# [('GO:0008150', 'biological_process')]

# in the ontology we have various relations. we can decide which ones to use,
# by default these are used:
goa.ontology.all_relations
# {'part_of', 'occurs_in', 'regulates', 'positively_regulates',
#  'negatively_regulates', 'is_a'}

# Some more complex query: AND, OR, NOT and parentheses can be used, terms and
# their names can be mixed, as well as ontology aspects:
q = """
signaling receptor activator activity AND
(extracellular region OR
cell surface OR
external side of plasma membrane OR
intrinsic component of plasma membrane)
"""
goa.select(q)
# {'A0A096LPE2', 'A0A0A6YY99', 'A0A0B4J2E2', ...}

I think above you see how to access annotations of proteins with ACs and term names in a variety of ways.

About the inputs: here you can see which functions actually work in this module and how long it takes to run them: https://status.omnipathdb.org/inputs/latest/#go-get_go_desc. I remember I had a lot of troubles with downloading up-to-date GO data with a reasonable performance. Goose and solr were great interfaces, but they abandoned them. QuickGO is up-to-date, but slow. So I ended up with 3-4 implementations for the same thing. To make it less confusing, I added synonyms which point to the preferred functions, even if sometimes we change the implementation:

# synonym for the default method
As I see, the utils.go uses QuickGO to import the ontology tree:
def _load_terms(self):
and GOA for annotations
annot = go_input.go_annotations_goa(organism = organism)
This was my conclusion that time (maybe 2 years ago?), that this is the best solution. As always, download might take some minutes, but we even dump these objects to pickles, so loading them later is fast.

Best,

Denes

@ervau
Copy link
Collaborator Author

ervau commented Apr 8, 2022

Thank you so much for your detailed explanation.
I have another question. Currently I'm trying to retrieve some fields from Uniprot, using pypath.inputs.uniprot uniprot_data function. One of the attributes I want to retrieve is the 'Variants' track of the Protein Feature Viewer in UniProt entries. As far as I understand, UniProt website REST API does not provide this track, so I can't retrieve it using uniprot_data.
Is there a method that I haven't noticed to obtain this information with pypath? If not, we can add another function that retrieves large scale data source annotations (variants, proteomics and antigen tracks of the feature viewer) from https://www.ebi.ac.uk/proteins/api. Also I'm wondering if the coding style of the new functions we add should be similar to the one you use when adjusting my latest commit.

One last thing, do you think it would be useful if we add a function that gathers all the fields for a protein/all proteins of one organism? For example, this function can retrieve multiple fields using uniprot_data, preprocess them and return a dictionary with field names as keys and the information related to this field as values.

@deeenes
Copy link
Member

deeenes commented Apr 17, 2022

Hi Erva,

Apologies for my late answer. That's a good question, and variant information is interesting for us too, it would be fantastic to retrieve it in an efficient way from UniProt. The fields that inputs.uniprot.uniprot_data is able to access are the fields that you can add to your UniProt search results, these are listed here: https://www.uniprot.org/help/uniprotkb_column_names We see a number of sequence related fields, for example, feature(NATURAL_VARIANT) and feature(MUTAGENESIS). I've just taken a quick look into these fields:

from pypath.inputs import uniprot

nv = uniprot.uniprot_data('feature(NATURAL_VARIANT)')
mu = uniprot.uniprot_data('feature(MUTAGENESIS)')

nv['O43734']

'VARIANT 19;  /note="D -> N (in PSORS13; there is a reducing binding of this variant to TRAF6; dbSNP:rs33980500)";  /evidence="ECO:0000269|PubMed:20953186, ECO:0000269|PubMed:20953188";  /id="VAR_047349"; VARIANT 83;  /note="R -> W (in dbSNP:rs13190932)";  /id="VAR_031227"; VARIANT 332;  /note="H -> Q (in dbSNP:rs1043730)";  /evidence="ECO:0000269|PubMed:10962024, ECO:0000269|PubMed:14702039, ECO:0000269|PubMed:15489334, ECO:0000269|PubMed:17974005, ECO:0000269|Ref.7";  /id="VAR_024307"; VARIANT 536;  /note="T -> I (in CANDF8; abolishes homotypic interactions with the SEFIR domain of IL17RA, IL17RB and IL17RC; does not affect homodimerization; does not affect SEFIR-independent interactions with other proteins; dbSNP:rs397518485)";  /evidence="ECO:0000269|PubMed:24120361";  /id="VAR_070904"'

mu['O43734']

'MUTAGEN 303;  /note="L->G: Loss of E3 ubiquitin ligase activity.";  /evidence="ECO:0000269|PubMed:19825828"; MUTAGEN 318;  /note="P->G: Decreases E3 ubiquitin ligase activity.";  /evidence="ECO:0000269|PubMed:19825828"; MUTAGEN 319;  /note="V->R: Loss of E3 ubiquitin ligase activity.";  /evidence="ECO:0000269|PubMed:19825828"; MUTAGEN 324;  /note="L->R: Decreases E3 ubiquitin ligase activity.";  /evidence="ECO:0000269|PubMed:19825828"'

The 4 natural variants in this example is a small subset of all variants: https://www.uniprot.org/uniprot/O43734/protvista These 4 are the ones with Feature ID, and with literature curated disease association, I think that's the reason why only these are included in the query result. The 4 further variants from mutagenesis are not shown in the feature viewer, but these too are literature curated. If you switch from Feature viewer to Feature table, you can see the same 8 variants listed: https://www.uniprot.org/uniprot/O43734#showFeaturesTable If you toggle the "UniProt reviewed" option in the Feature viewer, you see the 4 curated natural variants. If you add also the "ClinVar reviewed" ones, those are much more numerous, and I couldn't find them in any UniProt query field. Apart from these, there are also the variants automatically extracted from large scale data, these are supposed to be the least important or least known ones.

In summary, you are completely right, the full variant data is available only by the Proteins API. For the same example protein as above:

curl -s -X GET --header 'Accept:application/json' 'https://www.ebi.ac.uk/proteins/api/variation/O43734' | python -m json.tool | less

For the whole proteome of an organism (taxid=9606), the result is paginated (size=100):

curl -s -X GET --header 'Accept:application/json' 'https://www.ebi.ac.uk/proteins/api/variation?offset=0&size=100&taxid=9606' | python -m json.tool | less

To do this in pypath, we have the module inputs.ebi, which I've just updated to make it really a generic client for these JSON based web services. It relies on a this function and the glom module for extracting the relavant fields from JSON.

On top of this generic client, we can implement specific ones, and I created one for the Proteins/variation query. The variation data in UniProt is enormous, retrieving the complete data for human takes hours and might fill many GBs of memory. I set sourceType to ['uniprot', 'mixed'] by default, omitting data only from large scale studies. This way we get ~242k variants in 20 min runtime, and using ~1.3GB disk space for cache.

from pypath.inputs import proteins

v = proteins.variants()
sum(len(x.features) for x in v)
# 241561

# one record looks like this:
VariationRecord(
    uniprot='P00533',
    features=[
        {'type': 'VARIANT', 'begin': 98, 'end': 98, 'consequence': 'missense', 'wild_residue': 'R', 'mutated_residue': 'Q', 'somatic': False, 'evidence': 'mixed'},
        {'type': 'VARIANT', 'begin': 266, 'end': 266, 'consequence': 'missense', 'wild_residue': 'P', 'mutated_residue': 'R', 'somatic': False, 'evidence': 'mixed'},
        {'type': 'VARIANT', 'begin': 428, 'end': 428, 'consequence': 'missense', 'wild_residue': 'G', 'mutated_residue': 'D', 'somatic': False, 'evidence': 'mixed'},
        ...
    ]
)

Above you see some fields processed, more fields can be specified by the fields and feature_fields arguments.

Also I'm wondering if the coding style of the new functions we add should be similar to the one you use when adjusting my latest commit.

Yes :)
We even have this little guide, which I have to update as we recently dropped all Python 2 support, and started to use type hinting.

One last thing, do you think it would be useful if we add a function that gathers all the fields for a protein/all proteins of one organism? For example, this function can retrieve multiple fields using uniprot_data, preprocess them and return a dictionary with field names as keys and the information related to this field as values.

Yes, would be definitely useful, and I think we have already a number of things for various purposes. For one (or few) protein(s) for example:

from pypath.utils import uniprot
uniprot.info(['P00533', 'O43734'])
=====> [2 proteins] <=====
╒═══════╤════════╤══════════════╤══════════╤══════════╤══════════════════════════════════╤══════════════════════════════════════════════╤════════════════════════════════════════════════╤════════════════════════════════════════════════╕
│   No. │ ac     │ genesymbol   │   length │   weight │ full_name                        │ function_or_genecards                        │ keywords                                       │ subcellular_location                           │
╞═══════╪════════╪══════════════╪══════════╪══════════╪══════════════════════════════════╪══════════════════════════════════════════════╪════════════════════════════════════════════════╪════════════════════════════════════════════════╡
│     1 │ P00533 │ EGFR         │     1210 │   134277 │ Epidermal growth factor receptor │ Receptor tyrosine kinase binding ligands of  │ 3D-structure, Alternative splicing, ATP-       │ Cell membrane; Single-pass type I membrane     │
│       │        │              │          │          │                                  │ the EGF family and activating several        │ binding, Cell membrane, Developmental protein, │ protein. Endoplasmic reticulum membrane;       │
│       │        │              │          │          │                                  │ signaling cascades to convert extracellular  │ Direct protein sequencing, Disease mutation,   │ Single-pass type I membrane protein. Golgi     │
│       │        │              │          │          │                                  │ cues into appropriate cellular responses     │ Disulfide bond, Endoplasmic reticulum,         │ apparatus membrane; Single-pass type I         │
│       │        │              │          │          │                                  │ (PubMed:2790960, PubMed:10805725,            │ Endosome, Glycoprotein, Golgi apparatus, Host  │ membrane protein. Nucleus membrane; Single-    │
│       │        │              │          │          │                                  │ PubMed:27153536). Known ligands include EGF, │ cell receptor for virus entry, Host-virus      │ pass type I membrane protein. Endosome.        │
│       │        │              │          │          │                                  │ TGFA/TGF-alpha, AREG, epigen/EPGN,           │ interaction, Isopeptide bond, Kinase,          │ Endosome membrane. Nucleus. Note=In response   │
│       │        │              │          │          │                                  │ BTC/betacellulin, epiregulin/EREG and        │ Lipoprotein, Membrane, Methylation,            │ to EGF, translocated from the cell membrane to │
│       │        │              │          │          │                                  │ HBEGF/heparin- binding EGF (PubMed:2790960,  │ Nucleotide-binding, Nucleus, Palmitate,        │ the nucleus via Golgi and ER                   │
│       │        │              │          │          │                                  │ PubMed:7679104, PubMed:8144591,              │ Phosphoprotein, Polymorphism, Proto-oncogene,  │ (PubMed:20674546). Endocytosed upon activation │
│       │        │              │          │          │                                  │ PubMed:9419975, PubMed:15611079,             │ Receptor, Reference proteome, Repeat,          │ by ligand (PubMed:2790960, PubMed:17182860,    │
│       │        │              │          │          │                                  │ PubMed:12297049, PubMed:27153536,            │ Secreted, Signal, Transferase, [...]           │ PubMed:27153536). Colocalized [...]            │
│       │        │              │          │          │                                  │ PubMed:20837704). Ligand binding [...]       │                                                │                                                │
├───────┼────────┼──────────────┼──────────┼──────────┼──────────────────────────────────┼──────────────────────────────────────────────┼────────────────────────────────────────────────┼────────────────────────────────────────────────┤
│     2 │ O43734 │ TRAF3IP2     │      574 │    64666 │ Adapter protein CIKS             │ Could be involved in the activation of both  │ Alternative splicing, Disease mutation,        │ None                                           │
│       │        │              │          │          │                                  │ NF-kappa-B via a NF-kappa-B inhibitor kinase │ Polymorphism, Reference proteome               │                                                │
│       │        │              │          │          │                                  │ (IKK)-dependent mechanism and stress-        │                                                │                                                │
│       │        │              │          │          │                                  │ activated protein kinase (SAPK)/JNK.         │                                                │                                                │
╘═══════╧════════╧══════════════╧══════════╧══════════╧══════════════════════════════════╧══════════════════════════════════════════════╧════════════════════════════════════════════════╧════════════════════════════════════════════════╛

This is based on retrieving UniProt datasheets for individual proteins:

egfr = uniprot.UniprotProtein('P00533')
egfr
# <UniProt datasheet P00533 (EGFR)>

# the datasheets have all fields from UniProt, some we expose as attributes, the rest is stored in its raw form:
egfr.disease
# "Lung cancer (LNCR) [MIM:211980]: A common malignancy affecting tissues of the lung. The most common form of lung cancer is..."

Retrieving data for all proteins, typically we need a few particular fields, the full data for one organism would be huge. In the inputs.uniprot module the most important function is uniprot_data, as I mentioned in the beginning. One issue with UniProt data is that it's often not very clean, it has labels, IDs in square brackets, etc (I mean things like /evidence="ECO:0000269|PubMed:19825828"; ). Hence in the inputs.uniprot module you see functions which clean certain fields (keywords, tissues, topology, families).

Further UniProt data is available by the uploadlists API, supported by utils.mapping in pypath (not sure if field names must be included here).

Overall, the Proteins API could provide an efficient access to proteome wide data which is otherwise not accessible by other UniProt APIs, and we have the fundamentals to use it. We can also look around other for other variant resources. And we could discuss a broader strategy for dealing with variant data, we could have a call some time.

Best,

Denes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted User needs help question User requires information
Projects
None yet
Development

No branches or pull requests

3 participants