New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using Database Modules in Inputs #179
Comments
Hi Erva, That's weird, though here you are looking at it through an IDE or some other GUI tool. The modules are definitely there, and you could just try what happens if you try to import one which is clearly not on your list:
My guess is that your IDE generates the list just by getting the keys from the Overall, I think it's not a problem, just a limitation of your IDE, and you can just go ahead and import the modules you need. EC numbers: wait a bit, I'm gonna take a look. Best, Denes |
This fixed my problem. It was probably about IDE as you mentioned. Thanks! |
Great! EC numbers: it was not possible, but I've just added it: 61a205b Then you can do something like:
Starting from a gene symbol:
An alternative, suitable only if working with a small number of proteins:
Another alternative, a dict of ECs for the whole proteome:
|
Thanks for your help. Another suggestion; can we add evidence codes to the dictionary that go.go_annotations_goa() creates? like a small adjustment below: The current version of the related part:
We can add evidence codes like:
We can also add this as an optional parameter. What do you think? |
Hi Erva, That's a great idea, do you want to open a pull request? I see this function is used at 4 places in |
Hello Denes, I have some questions about KEGG database in pypath.
Here are the modules that I use. a = pypath.core.annot.get_db()
kegg = a.annots['KEGG']
kegg.make_df()
kegg_df = kegg.df
kegg_df
len(kegg_df['value'].unique()) from pypath.inputs import kegg
import pandas as pd
kegg_int = kegg.kegg_interactions()
kegg_int = pd.DataFrame(kegg_int)
len(kegg_int.iloc[:,3].unique())
What could be the reason for this? or Is there any way to reach those data? from pypath.inputs import kegg
kegg_dict = kegg.kegg_dbget('map00010') #https://www.genome.jp/entry/map00010
kegg_dict **Traceback**
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_4520/2197661993.py in <module>
----> 1 kegg_dict = kegg.kegg_dbget('map00010')
2 kegg_dict
~\anaconda3\envs\pypath\lib\site-packages\pypath\inputs\kegg.py in kegg_dbget(entry)
750
751 collecting_ref = True
--> 752 last_ref['PMID'] = re.findall(r'\d+', td.text)[-1]
753 continue
754
IndexError: list index out of range kegg_dict = kegg.kegg_dbget('hsa:3643') #https://www.genome.jp/entry/hsa:3643
kegg_dict {'Type': '3643\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0CDS\xa0\xa0\xa0\xa0\xa0\xa0\xa0T01001\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0',
'Entry': '3643CDST01001\n'} |
Hi Elif,
Just a minor note, I think the line below builds the annotation database every time it's called, which takes very long. You can either load a previously built database, or load only the KEGG annotations (requiring much less memory!): from pypath import omnipath
a = omnipath.db.get_db('annotations')
from pypath.core import annot
kegg = annot.KeggPathways()
For me both of these result 203 (not surprising as the data comes from the same place). In general, we process unambiguous protein-protein interactions from KEGG Pathways where we can translate both partners to UniProt. If a pathway doesn't contain any such interaction, it won't be in this dataset. I can imagine metabolic pathways are often like this, or pathways with virus proteins, etc. We can look into it more specifically if you tell an example which interaction or pathway is missing.
I think I made this function primarily for pathway pages, like this: https://www.kegg.jp/dbget-bin/www_bget?hsa04350, although I know there are many other type of entities and would be nice to support them. Actually this is just a small experiment, no one ever used it: pypath/src/pypath/inputs/kegg.py Line 692 in a5e0ca2
Ok, finally I found the reason why I wrote this function: processing KEGG-MEDICUS, this was the only way I could resolve some identifiers. This is the only use of it at the moment. Best, Denes |
Hi Denes, For example, I can't access N-Glycan biosynthesis (hsa00510) from kegg.kegg_pathways() or kegg.kegg_interactions() module. By the way, I don't know why I get 188 pathways than 203 pathways. I asked Erva to check it. She also gets 188 pathways. We need to access gene/protein - pathway relation regardless of PPIs. Is there a way to reach the pathways without first looking PPIs? Besides, one of the information that we want to get from KEGG database is gene - disease relation. We may access it from kegg.kegg_dbget() by searching gene id and also we may get a list of all genes with their attributes in KEGG database. We will check if we can implement it in kegg.py. |
Hi Elif,
That's really mysterious, if you send me your 188 pathways I can take a look.
Not at the moment. But we iterate through all the pathways and just drop the ones which don't have any PPI that we can process. So it would be fairly straightforward to collect all the annotations, even not only for proteins, but for metabolites, for example. Do you want to try it?
Great, let me know if you need any help from me. If we extend in this direction, maybe better to make it a class instead of a single function. Best, Denes |
About diseases in KEGG: have you checked KEGG-MEDICUS? You can find it, with it's URL in pypath. |
Sorry for the late reply. The pathway file (188 pathways) is attached. |
Hi Denes, |
Hi Erva, In
I think above you see how to access annotations of proteins with ACs and term names in a variety of ways. About the Line 98 in c907cbb
utils.go uses QuickGO to import the ontology tree: Line 131 in c907cbb
Line 573 in c907cbb
Best, Denes |
Thank you so much for your detailed explanation. One last thing, do you think it would be useful if we add a function that gathers all the fields for a protein/all proteins of one organism? For example, this function can retrieve multiple fields using |
Hi Erva, Apologies for my late answer. That's a good question, and variant information is interesting for us too, it would be fantastic to retrieve it in an efficient way from UniProt. The fields that
The 4 natural variants in this example is a small subset of all variants: https://www.uniprot.org/uniprot/O43734/protvista These 4 are the ones with Feature ID, and with literature curated disease association, I think that's the reason why only these are included in the query result. The 4 further variants from mutagenesis are not shown in the feature viewer, but these too are literature curated. If you switch from Feature viewer to Feature table, you can see the same 8 variants listed: https://www.uniprot.org/uniprot/O43734#showFeaturesTable If you toggle the "UniProt reviewed" option in the Feature viewer, you see the 4 curated natural variants. If you add also the "ClinVar reviewed" ones, those are much more numerous, and I couldn't find them in any UniProt query field. Apart from these, there are also the variants automatically extracted from large scale data, these are supposed to be the least important or least known ones. In summary, you are completely right, the full variant data is available only by the Proteins API. For the same example protein as above:
For the whole proteome of an organism (
To do this in pypath, we have the module On top of this generic client, we can implement specific ones, and I created one for the Proteins/variation query. The variation data in UniProt is enormous, retrieving the complete data for human takes hours and might fill many GBs of memory. I set
Above you see some fields processed, more fields can be specified by the
Yes :)
Yes, would be definitely useful, and I think we have already a number of things for various purposes. For one (or few) protein(s) for example:
This is based on retrieving UniProt datasheets for individual proteins:
Retrieving data for all proteins, typically we need a few particular fields, the full data for one organism would be huge. In the Further UniProt data is available by the Overall, the Proteins API could provide an efficient access to proteome wide data which is otherwise not accessible by other UniProt APIs, and we have the fundamentals to use it. We can also look around other for other variant resources. And we could discuss a broader strategy for dealing with variant data, we could have a call some time. Best, Denes |
Hi, I can not reach some of the modules in the "inputs" module. The ones I can use are shown below. Is there another way to use the other ones?
Also, I couldn't find a module to retrieve EC numbers for proteins. Is there a way to do this?
OS: MacOS
Python version: 3.8
Version or commit hash v0.13.13
The text was updated successfully, but these errors were encountered: