Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Providing access to protein annotations and general note #43

Open
jorainer opened this issue May 4, 2023 · 2 comments
Open

Providing access to protein annotations and general note #43

jorainer opened this issue May 4, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@jorainer
Copy link

jorainer commented May 4, 2023

Description of feature

Dear developers! Great work you're doing!

Just to introduce myself: I'm the developer of the ensembldb package and am maintaining and adding new EnsDb databases to Bioconductor's AnnotationHub for each new Ensembl release.

As a general note: the EnsDbs would also provide protein annotations (amino acid sequences, (functional) protein domains and some mapping to Uniprot identifiers). These things are not general GenomicFeatures features, but more specific to the ensembldb package - and it allows also to map directly between positions within the protein to the transcript to the genome (and vice versa).

Also, please don't hesitate to ask if something about the EnsDb database layouts is unclear or if you have feature requests.

@jorainer jorainer added the enhancement New feature or request label May 4, 2023
@ivirshup
Copy link
Member

ivirshup commented May 5, 2023

Thanks! Also thanks for making the package and huge thanks for making this great resource available in such a reasonable format!

As a general note: the EnsDbs would also provide protein annotations (amino acid sequences, (functional) protein domains and some mapping to Uniprot identifiers).

Yup, we've actually already got some of this working right now (using some stuff in #31)

import genomic_features as gf

ensdb = gf.ensembl.annotation("Hsapiens", "108")
ensdb.genes(
    cols=["gene_id", "gene_name", "tx_id", "uniprot_id"],
    filter=(gf.filters.CanonicalFilter() & gf.filters.GeneBioTypeFilter("protein_coding"))
).head()
           gene_id gene_name            tx_id     uniprot_id    gene_biotype  tx_is_canonical
0  ENSG00000000003    TSPAN6  ENST00000373020     O43657.176  protein_coding                1
1  ENSG00000000005      TNMD  ENST00000373031     Q9H2S6.148  protein_coding                1
2  ENSG00000000419      DPM1  ENST00000371588  A0A0S2Z4Y5.30  protein_coding                1
3  ENSG00000000419      DPM1  ENST00000371588     O60762.199  protein_coding                1
4  ENSG00000000457     SCYL3  ENST00000367771     Q8IZE3.171  protein_coding                1

I have to confess that I'm not terribly familiar with common uses or access patterns for this protein information, so any tips would be appreciated!


Also, please don't hesitate to ask if something about the EnsDb database layouts is unclear or if you have feature requests.

Honestly, so far it has been super self explanatory and easy to figure out.

I would be interested in knowing if you had any plans for schema updates, or anything like that we should be aware of.

One "feature" request

There was one thing that came up during our testing that I'd like to request. In #16, we saw that there were some versions of ensembl missing from annotation hub, that were instead bundled with the bioconductor annotation packages.

Would it be possible/ easy to upload these versions to AnnotationHub?

@jorainer
Copy link
Author

jorainer commented May 8, 2023

I have to confess that I'm not terribly familiar with common uses or access patterns for this protein information, so any tips would be appreciated!

As long as Ensembl IDs are used (ENSG..., ENST, ..., ENSP...) all is pretty simple. There is (AFAIK) one ensembl protein ID (ENSP...) assigned to one transcript ID (ENST...) - so straight forward 1:1 mapping. With Uniprot it tricky, because there is no 1:1 mapping between Ensembl proteins and Uniprot. One Ensembl protein can be annotated to none, one or multiple Uniprot IDs... so, if possible, try to do the joins based on Ensembl IDs.

I would be interested in knowing if you had any plans for schema updates, or anything like that we should be aware of.

No schema changes planned - in the past I also tried to keep the main schema the same and just add e.g. new columns to individual tables.

Regarding missing Ensembl versions - it would be possible to create EnsDb databases for these, but I would only do that eventually for some specific releases to not add too many databases to AnnotationHub that might eventually not even be used...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants