Storing definitions for custom tags used in SAM file #710

cmdcolin · 2023-03-22T16:25:00Z

I was wondering if there was a way or specification for SAM headers to describe what custom tags they are using, for example the lower case and X/Y/Z prefixed tags. My angle on this is just showing users at a glance what various fields mean in a genome browser, but can imagine it being useful in other circumstances.

VCF kind of has this with e.g. "1.4.4 Individual format field format" which will allow a file to self-describe the custom fields in it's FORMAT column

It could possibly make it easier to at-a-glace for a human to understand a data file. possible caveats

some fields require lengthy descriptions to begin to explain them
if it is free text it may not be very 'semantic' or 'machine parse-able'. bit of a tangent but in the example of the VCF, the CSQ is one of these things where what i think should be a machine readable description is stored in this 'human readable' field e.g. to meaningfully parse the CSQ field a program needs to split the VCF header description of CSQ by the text after "Format:"

examples of CSQ and ANN

##INFO=<ID=ANN,Number=1,Type=String,Description="Functional annotations:'Allele|Annotation|Annotation_Impact|Gene_Name|Gene_ID|Feature_Type|Feature_ID|Transcript_BioType|Rank|HGVS.c|HGVS.p|cDNA.pos / cDNA.length|CDS.pos / CDS.length|AA.pos / AA.length|Distance|ERRORS / WARNINGS / INFO'">

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|SYMBOL_SOURCE|HGNC_ID|MANE_SELECT|MANE_PLUS_CLINICAL|TSL|APPRIS|SIFT|PolyPhen|AF|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS">

The text was updated successfully, but these errors were encountered:

jkbonfield · 2023-03-22T16:40:18Z

I like this idea, but sadly currently it doesn't exist.

It'd need to be in the @CO tag to avoid breaking existing parsers that validate the headers, at least until that mythical time we develop SAM 2.0. That's not ideal, but we are where we are.

I guess we could carve out a namespace within CO for additional commentary. Eg:

@CO	@TAG	ID:X0	TY:i	DS:Number of best hits

You're perfectly at liberty to start doing this already, although it'd obviously need buy-in from the genome browsers. I'm not sure we'd want to add something formal to the specification unless we see active buy-in from multiple implementations.

cmdcolin mentioned this issue Mar 22, 2023

Alignments tags description in feature details on mouseover GMOD/jbrowse-components#3604

Merged

jkbonfield added sam sam-tags labels Mar 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storing definitions for custom tags used in SAM file #710

Storing definitions for custom tags used in SAM file #710

cmdcolin commented Mar 22, 2023 •

edited

jkbonfield commented Mar 22, 2023

Storing definitions for custom tags used in SAM file #710

Storing definitions for custom tags used in SAM file #710

Comments

cmdcolin commented Mar 22, 2023 • edited

jkbonfield commented Mar 22, 2023

cmdcolin commented Mar 22, 2023 •

edited