Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

genomeToProtein doesn't match protein if input CDS range contains a stop codon #137

Open
kotliary opened this issue Jun 14, 2022 · 4 comments

Comments

@kotliary
Copy link

Below is an example and the output. The input sequence was obtains by overlap with CDS segments of the gene.
I looked at the segment in the Genome Browser and found the stop codon.

Can genomeToProtein ignore stop codon?

packageVersion("ensembldb")
#> [1] '2.14.1'
packageVersion("EnsDb.Hsapiens.v75")
#> [1] '2.99.0'

library(ensembldb)
library(EnsDb.Hsapiens.v75)

g = GRanges(seqnames = "3", ranges = IRanges(start = 178952140, end = 178952152), strand = "+")
genomeToProtein(g, EnsDb.Hsapiens.v75)
#> Warning: Provided coordinates for 'ENST00000263967' are not within the coding
#> region
#> IRangesList object of length 1:
#> [[1]]
#> IRanges object with 1 range and 9 metadata columns:
#>           start       end     width |           tx_id    cds_ok           tx_id
#>       <integer> <integer> <integer> |     <character> <logical>     <character>
#>   [1]        -1        -1         1 | ENST00000263967      <NA> ENST00000263967
#>               exon_id exon_rank seq_start   seq_end    seq_name  seq_strand
#>           <character> <integer> <integer> <integer> <character> <character>
#>   [1] ENSE00001139987        21 178952140 178952152           3           +

end(g) = end(g) - 3 # removing the stop codon (last 3 bases)
genomeToProtein(g, EnsDb.Hsapiens.v75)
#> IRangesList object of length 1:
#> [[1]]
#> IRanges object with 1 range and 9 metadata columns:
#>                       start       end     width |           tx_id    cds_ok
#>                   <integer> <integer> <integer> |     <character> <logical>
#>   ENSP00000263967      1065      1068         4 | ENST00000263967      TRUE
#>                             tx_id         exon_id exon_rank seq_start   seq_end
#>                       <character>     <character> <integer> <integer> <integer>
#>   ENSP00000263967 ENST00000263967 ENSE00001139987        21 178952140 178952149
#>                      seq_name  seq_strand
#>                   <character> <character>
#>   ENSP00000263967           3           +

Created on 2022-06-14 by the reprex package (v2.0.0)

@jorainer
Copy link
Owner

AFAIK stop codons don't get translated. Thus I ignore the stop codon as any (RNA) position within the stop codon can not be mapped to an amino acid in the protein sequence (such positions would basically be outside the protein sequence).

Does this make sense? Would you suggest an alternative behaviour?

@kotliary
Copy link
Author

Thanks for the response. You are correct that stop codon is not translated. As my example shows, if the input range contains stop codon, it's not excluded from the query and the whole range doesn't map to protein. Actually, if a range has even a single base outside AA coding region, it doesn't map to protein.

The big question - is it possible to do automatically a partial mapping of a genomic range to protein ignoring non-coding regions? Ideally, if I provide a large range covering multiple exons, I'd like to receive a list of ranges with protein coordinates (probably grouped by transcripts). This would be really cool.

@jorainer
Copy link
Owner

the main use case for the genomeToProtein was more to map small regions or even single positions to protein coordinates. What you are proposing might be pretty difficult (especially the partial mapping).

If you would provide an example with input and expected output I could start thinking about how to implement - but I'm currently busy with other topics, so that might be more a long term solution.

Alternatively, you could also implement that code part (I would maybe suggest as a new, separate function) and contribute it to ensembldb (I'm always open for collaborations and have no problem adding contributors to the package).

@kotliary
Copy link
Author

I am working on a script for partial mapping on a gene-by-gene basis. I'll think about how to wrap it into a function. Thanks for the suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants