Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Storing definitions for custom tags used in SAM file #710

Open
cmdcolin opened this issue Mar 22, 2023 · 1 comment
Open

Storing definitions for custom tags used in SAM file #710

cmdcolin opened this issue Mar 22, 2023 · 1 comment

Comments

@cmdcolin
Copy link
Contributor

cmdcolin commented Mar 22, 2023

I was wondering if there was a way or specification for SAM headers to describe what custom tags they are using, for example the lower case and X/Y/Z prefixed tags. My angle on this is just showing users at a glance what various fields mean in a genome browser, but can imagine it being useful in other circumstances.

VCF kind of has this with e.g. "1.4.4 Individual format field format" which will allow a file to self-describe the custom fields in it's FORMAT column

It could possibly make it easier to at-a-glace for a human to understand a data file. possible caveats

  • some fields require lengthy descriptions to begin to explain them
  • if it is free text it may not be very 'semantic' or 'machine parse-able'. bit of a tangent but in the example of the VCF, the CSQ is one of these things where what i think should be a machine readable description is stored in this 'human readable' field e.g. to meaningfully parse the CSQ field a program needs to split the VCF header description of CSQ by the text after "Format:"
examples of CSQ and ANN
##INFO=<ID=ANN,Number=1,Type=String,Description="Functional annotations:'Allele|Annotation|Annotation_Impact|Gene_Name|Gene_ID|Feature_Type|Feature_ID|Transcript_BioType|Rank|HGVS.c|HGVS.p|cDNA.pos / cDNA.length|CDS.pos / CDS.length|AA.pos / AA.length|Distance|ERRORS / WARNINGS / INFO'">

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|MANE_SELECT|MANE_PLUS_CLINICAL|TSL|APPRIS|SIFT|PolyPhen|AF|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS">

@jkbonfield
Copy link
Contributor

I like this idea, but sadly currently it doesn't exist.

It'd need to be in the @CO tag to avoid breaking existing parsers that validate the headers, at least until that mythical time we develop SAM 2.0. That's not ideal, but we are where we are.

I guess we could carve out a namespace within CO for additional commentary. Eg:

@CO	@TAG	ID:X0	TY:i	DS:Number of best hits

You're perfectly at liberty to start doing this already, although it'd obviously need buy-in from the genome browsers. I'm not sure we'd want to add something formal to the specification unless we see active buy-in from multiple implementations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: To do (backlog)
Development

No branches or pull requests

2 participants