Investigate how best to incorporate CSQ Annotations #124

danielecook · 2020-02-24T23:50:50Z

CSQ annotations are 'haplotype-aware' in the sense that they can incorporate multiple variants when determining the predicted effect. For example, imagine you observe a SNP that gives rise to a CAG --> TAG change. This looks like a premature stop codon if you only consider one variant at a time. But it's also possible there is an adjacent SNP that instead results in a CAG --> CCG change - which would actually make this an MNP (multi-nucleotide polymorphism), but more importantly only result in a Gln --> .Pro change (still bad, but not as bad as a stop codon).

CSQ annotations, in contrast to something like SNPeff are able to pick this critical difference up.

There are a couple of issues with CSQ annotations.

1.) The PD1074 genome isn't quite ready for them. They require a good quality GFF file like the one you see here. In fact, the GFF's on ensemble are the only ones I was able to get to work with the bcftools csq command previously. The easiest option I see here will probably be to liftover those to PD1074, unless we can get the wormbase one to work.

2.) While the variants are haplotype aware, they can be expressed in any order. See this example fromt he bcftools manual:

    # Two separate VCF records at positions 2:122106101 and 2:122106102
    # change the same codon. This UV-induced C>T dinucleotide mutation
    # has been annotated fully at the position 2:122106101 with
    #   - consequence type
    #   - gene name
    #   - ensembl transcript ID
    #   - coding strand (+ fwd, - rev)
    #   - amino acid position (in the coding strand orientation)
    #   - list of corresponding VCF variants
    # The annotation at the second position gives the position of the full
    # annotation
    BCSQ=missense|CLASP1|ENST00000545861|-|1174P>1174L|122106101G>A+122106102G>A
    BCSQ=@122106101

    # A frame-restoring combination of two frameshift insertions C>CG and T>TGG
    BCSQ=@46115084
    BCSQ=inframe_insertion|COPZ2|ENST00000006101|-|18AGRGP>18AQAGGP|46115072C>CG+46115084T>TGG

    # Stop gained variant
    BCSQ=stop_gained|C2orf83|ENST00000264387|-|141W>141*|228476140C>T

    # The consequence type of a variant downstream from a stop are prefixed with *
    BCSQ=*missense|PER3|ENST00000361923|+|1028M>1028T|7890117T>C

Note that the first variant references an upstream variant whereas the second one references a downstream one.

There are actually two issues here:

How do we represent these types of variants? The genome browser currently lists every variant and predicted consequences. My thinking is we should probably link the @1234... notation to the variant of interest and color it as a 'reference' row. Maybe when you highlight the actual variant we can somehow trigger both to light up.

Related issues

How do these get incorporated into cegwas2?
Investigate options for CSQ annotations wi-gatk#14

Notes

Because the genotypes we are looking at are clonal, we can infer the phase of genotypes (hint: they are already phased!)

The text was updated successfully, but these errors were encountered:

danielecook mentioned this issue Feb 24, 2020

Investigate options for CSQ annotations AndersenLab/wi-gatk#14

Open

danielecook assigned danrlu Feb 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate how best to incorporate CSQ Annotations #124

Investigate how best to incorporate CSQ Annotations #124

danielecook commented Feb 24, 2020 •

edited

Investigate how best to incorporate CSQ Annotations #124

Investigate how best to incorporate CSQ Annotations #124

Comments

danielecook commented Feb 24, 2020 • edited

Related issues

Notes

danielecook commented Feb 24, 2020 •

edited