genomeToProtein doesn't match protein if input CDS range contains a stop codon #137

kotliary · 2022-06-14T20:22:18Z

Below is an example and the output. The input sequence was obtains by overlap with CDS segments of the gene.
I looked at the segment in the Genome Browser and found the stop codon.

Can genomeToProtein ignore stop codon?

packageVersion("ensembldb")
#> [1] '2.14.1'
packageVersion("EnsDb.Hsapiens.v75")
#> [1] '2.99.0'

library(ensembldb)
library(EnsDb.Hsapiens.v75)

g = GRanges(seqnames = "3", ranges = IRanges(start = 178952140, end = 178952152), strand = "+")
genomeToProtein(g, EnsDb.Hsapiens.v75)
#> Warning: Provided coordinates for 'ENST00000263967' are not within the coding
#> region
#> IRangesList object of length 1:
#> [[1]]
#> IRanges object with 1 range and 9 metadata columns:
#>           start       end     width |           tx_id    cds_ok           tx_id
#>       <integer> <integer> <integer> |     <character> <logical>     <character>
#>   [1]        -1        -1         1 | ENST00000263967      <NA> ENST00000263967
#>               exon_id exon_rank seq_start   seq_end    seq_name  seq_strand
#>           <character> <integer> <integer> <integer> <character> <character>
#>   [1] ENSE00001139987        21 178952140 178952152           3           +

end(g) = end(g) - 3 # removing the stop codon (last 3 bases)
genomeToProtein(g, EnsDb.Hsapiens.v75)
#> IRangesList object of length 1:
#> [[1]]
#> IRanges object with 1 range and 9 metadata columns:
#>                       start       end     width |           tx_id    cds_ok
#>                   <integer> <integer> <integer> |     <character> <logical>
#>   ENSP00000263967      1065      1068         4 | ENST00000263967      TRUE
#>                             tx_id         exon_id exon_rank seq_start   seq_end
#>                       <character>     <character> <integer> <integer> <integer>
#>   ENSP00000263967 ENST00000263967 ENSE00001139987        21 178952140 178952149
#>                      seq_name  seq_strand
#>                   <character> <character>
#>   ENSP00000263967           3           +

^{Created on 2022-06-14 by the reprex package (v2.0.0)}

jorainer · 2022-06-16T09:33:07Z

AFAIK stop codons don't get translated. Thus I ignore the stop codon as any (RNA) position within the stop codon can not be mapped to an amino acid in the protein sequence (such positions would basically be outside the protein sequence).

Does this make sense? Would you suggest an alternative behaviour?

kotliary · 2022-06-16T14:58:47Z

Thanks for the response. You are correct that stop codon is not translated. As my example shows, if the input range contains stop codon, it's not excluded from the query and the whole range doesn't map to protein. Actually, if a range has even a single base outside AA coding region, it doesn't map to protein.

The big question - is it possible to do automatically a partial mapping of a genomic range to protein ignoring non-coding regions? Ideally, if I provide a large range covering multiple exons, I'd like to receive a list of ranges with protein coordinates (probably grouped by transcripts). This would be really cool.

jorainer · 2022-06-17T06:51:50Z

the main use case for the genomeToProtein was more to map small regions or even single positions to protein coordinates. What you are proposing might be pretty difficult (especially the partial mapping).

If you would provide an example with input and expected output I could start thinking about how to implement - but I'm currently busy with other topics, so that might be more a long term solution.

Alternatively, you could also implement that code part (I would maybe suggest as a new, separate function) and contribute it to ensembldb (I'm always open for collaborations and have no problem adding contributors to the package).

kotliary · 2022-06-17T14:42:10Z

I am working on a script for partial mapping on a gene-by-gene basis. I'll think about how to wrap it into a function. Thanks for the suggestion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

genomeToProtein doesn't match protein if input CDS range contains a stop codon #137

genomeToProtein doesn't match protein if input CDS range contains a stop codon #137

kotliary commented Jun 14, 2022

jorainer commented Jun 16, 2022

kotliary commented Jun 16, 2022

jorainer commented Jun 17, 2022

kotliary commented Jun 17, 2022

genomeToProtein doesn't match protein if input CDS range contains a stop codon #137

genomeToProtein doesn't match protein if input CDS range contains a stop codon #137

Comments

kotliary commented Jun 14, 2022

jorainer commented Jun 16, 2022

kotliary commented Jun 16, 2022

jorainer commented Jun 17, 2022

kotliary commented Jun 17, 2022