Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CDS phase (frame offset for eg ribo slippage) not taken into account in amino acid translation #732

Open
davmlaw opened this issue Mar 13, 2024 · 0 comments
Labels
bug Something isn't working data provider schema change

Comments

@davmlaw
Copy link
Contributor

davmlaw commented Mar 13, 2024

The GFF format has a "phase" column on CDS features (values 0,1,2) which alter the reading frame of exons and the translation to amino acids.

The UTA/DataProvider transcript annotation format does not currently contain this information, so I believe it will need to be added, then HGVS code modified to take it into account when converting to p. (similar to how alignment gaps are done between g. and c.)

Example annotation

from ref_GRCh37.p10_top_level.gff3 (phase is the "1" after the "+"):

NC_000007.13	RefSeq  CDS 	94292646    	94293825    	.   	+   	1   	ID=cds13063;Name=NP_001165908.1;Parent=rna16954;Note=isoform 3 is encoded by transcript variant 2;Dbxref=GeneID:23089,Genbank:NP_001165908.1,HGNC:14005,MIM:609810;exception=ribosomal slippage;gbkey=CDS;product=retrotransposon-derived protein PEG10 isoform 3;protein_id=NP_001165908.1

Column 8: "phase"

For features of type "CDS", the phase indicates where the next codon begins relative to the 5' end (where the 5' end of the CDS is relative to the strand of the CDS feature) of the current CDS feature. For clarification the 5' end for CDS features on the plus strand is the feature's start and and the 5' end for CDS features on the minus strand is the feature's end. The phase is one of the integers 0, 1, or 2, indicating the number of bases forward from the start of the current CDS feature the next codon begins. A phase of "0" indicates that a codon begins on the first nucleotide of the CDS feature (i.e. 0 bases forward), a phase of "1" indicates that the codon begins at the second nucleotide of this CDS feature and a phase of "2" indicates that the codon begins at the third nucleotide of this region. Note that ‘Phase’ in the context of a GFF3 CDS feature should not be confused with the similar concept of frame that is also a common concept in bioinformatics. Frame is generally calculated as a value for a given base relative to the start of the complete open reading frame (ORF) or the codon (e.g. modulo 3) while CDS phase describes the start of the next codon relative to a given CDS feature.

The phase is REQUIRED for all CDS features.

This was originally raised by holtgrewe on cdot project

@davmlaw davmlaw added bug Something isn't working data provider schema change labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data provider schema change
Projects
None yet
Development

No branches or pull requests

1 participant