Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate how best to incorporate CSQ Annotations #124

Open
2 tasks
danielecook opened this issue Feb 24, 2020 · 0 comments
Open
2 tasks

Investigate how best to incorporate CSQ Annotations #124

danielecook opened this issue Feb 24, 2020 · 0 comments
Assignees

Comments

@danielecook
Copy link
Contributor

danielecook commented Feb 24, 2020

CSQ annotations are 'haplotype-aware' in the sense that they can incorporate multiple variants when determining the predicted effect. For example, imagine you observe a SNP that gives rise to a CAG --> TAG change. This looks like a premature stop codon if you only consider one variant at a time. But it's also possible there is an adjacent SNP that instead results in a CAG --> CCG change - which would actually make this an MNP (multi-nucleotide polymorphism), but more importantly only result in a Gln --> .Pro change (still bad, but not as bad as a stop codon).

CSQ annotations, in contrast to something like SNPeff are able to pick this critical difference up.


There are a couple of issues with CSQ annotations.

1.) The PD1074 genome isn't quite ready for them. They require a good quality GFF file like the one you see here. In fact, the GFF's on ensemble are the only ones I was able to get to work with the bcftools csq command previously. The easiest option I see here will probably be to liftover those to PD1074, unless we can get the wormbase one to work.

2.) While the variants are haplotype aware, they can be expressed in any order. See this example fromt he bcftools manual:

    # Two separate VCF records at positions 2:122106101 and 2:122106102
    # change the same codon. This UV-induced C>T dinucleotide mutation
    # has been annotated fully at the position 2:122106101 with
    #   - consequence type
    #   - gene name
    #   - ensembl transcript ID
    #   - coding strand (+ fwd, - rev)
    #   - amino acid position (in the coding strand orientation)
    #   - list of corresponding VCF variants
    # The annotation at the second position gives the position of the full
    # annotation
    BCSQ=missense|CLASP1|ENST00000545861|-|1174P>1174L|122106101G>A+122106102G>A
    BCSQ=@122106101

    # A frame-restoring combination of two frameshift insertions C>CG and T>TGG
    BCSQ=@46115084
    BCSQ=inframe_insertion|COPZ2|ENST00000006101|-|18AGRGP>18AQAGGP|46115072C>CG+46115084T>TGG

    # Stop gained variant
    BCSQ=stop_gained|C2orf83|ENST00000264387|-|141W>141*|228476140C>T

    # The consequence type of a variant downstream from a stop are prefixed with *
    BCSQ=*missense|PER3|ENST00000361923|+|1028M>1028T|7890117T>C

Note that the first variant references an upstream variant whereas the second one references a downstream one.

There are actually two issues here:

How do we represent these types of variants? The genome browser currently lists every variant and predicted consequences. My thinking is we should probably link the @1234... notation to the variant of interest and color it as a 'reference' row. Maybe when you highlight the actual variant we can somehow trigger both to light up.

Related issues

Notes

  • Because the genotypes we are looking at are clonal, we can infer the phase of genotypes (hint: they are already phased!)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants