Structural variant vcf annotation #97

EugeneEA · 2022-02-16T13:07:12Z

Hi, I've come across the problem that oc does not annotate SV vcf's are there plans to support SV in a future or maybe thereis a workaround?
the common line format:
chr1 964964 20 N <DEL> 137.6 . SVTYPE=DEL;SVLEN=-366;END=965330;STRANDS=+-:10;IMPRECISE;CIPOS=-30 ...etc

Best, Eugene

The text was updated successfully, but these errors were encountered:

cariaso · 2022-02-16T14:12:49Z

What sort of annotation would you hope to see? Names of spanned genes? +?

EugeneEA · 2022-02-16T14:19:44Z

yes, basically annotation with a genomic feature (gene, exon/intron, UTR etc) and the possible consequence on gene expression (eg if frameshift in exon happens, or exon deletion/duplication/inversion watever)

VEP handles SV vcf, so it's annotation can be taken as an example

cariaso · 2022-02-16T16:49:21Z

https://github.com/Illumina/Nirvana might also be relevant?

…

On Wed, Feb 16, 2022 at 9:19 AM EugeneEA ***@***.***> wrote: yes, basically annotation with a genomic feature (gene, exon/intron, UTR etc) and the possible consequence on gene expression (eg if frameshift in exon happens, or exon deletion/duplication/inversion watever) VEP handles SV vcf, so it's annotation can be taken as an example — Reply to this email directly, view it on GitHub <#97 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA6TETXXSD4XE725BX7WSLU3OXAXANCNFSM5ORRX5XQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you commented.Message ID: ***@***.***>

-- -- Mike Cariaso http://www.cariaso.com

EugeneEA · 2022-02-17T07:28:05Z

Probably, I have not tried it

rkimoakbioinformatics · 2022-02-17T07:49:24Z

@EugeneEA Hi, yes there is a plan to add the support for SV, CNV, etc. in the future. Can be in this repo or a fork.

EugeneEA · 2022-02-25T13:23:37Z

@rkimoakbioinformatics Thanks for the answer, but as far as I understand it is not a near future, but plans for the further development?

rkimoakbioinformatics · 2022-02-26T14:39:20Z

@EugeneEA I would like to start discussion on it. Can you let me know what kind of output columns you would need? Something like the following?

+-------+-----------+-----+-----------+-------------------------------------------------------------------------------------+
| chrom | start     | end       | ref | alt   | all_mappings                                                                |
| chr10 | 121593023 | 121603287 | N   | <DEL> | {"GENE1": [["P00001", "", "transcript_ablation", "ENST00000346997.6", ""]], |
|       |           |           |     |       |  "GENE2": [["P00002", "", "transcript_ablation", "ENST000009385.1", ""]]}   |
| chr5  | 95849345  | 95853945  | N   | <DUP> | {“GENE3”: [[“P00003”, “”, “copy_number_gain”, “ENST000009482.1”, “”]]}      |
+-------+-----------+-----------+-----+-------+-----------------------------------------------------------------------------+

For imprecise structural variants, would you still want to see predicted protein sequence change?

EugeneEA · 2022-03-09T06:39:09Z

@rkimoakbioinformatics sorry for long delay, yes that would be sufficient for starters defenetly. The tricky part probably the filed "transcript_ablation" etc. maybe an additinal column should be added here, for example listing the deleted (exons) etc?

rkimoakbioinformatics · 2022-03-10T07:34:33Z

@EugeneEA Thanks. Below is a sketch. The current format of all_mappings is difficult to parse, sort, and filter. Thus, using putting each transcript in a separate line, something like:

+------------------+----------+--------+---------+----------+-----------+-------+---------------------------------------------+----------------+------+
| chrom | start    | end      | strand | ref     | alt      | imprecise | gene  | sequence ontology                           | transcript     | exon |
+------------------+----------+--------+---------+----------+-----------+-----------------------------------------------------+----------------+------|
| chr1  | 10394823 | 10404834 | +      | 10012nt | -        | imprecise | GENE2 | deletion,exon_loss_variant                  | ENST0000038273 | 2,3  |
| chr1  | 10394823 | 10404834 | +      | 10012nt | -        | imprecise | GENE2 | deletion,exon_loss_variant                  | ENST0000038284 | 2,3  |
| chr1  | 2784394  | 2984393  | +      | 20000nt | -        | imprecise | GENE3 | deletion,transcript_ablation                | ENST0000061234 |      |
| chr1  | 394823   | 399822   | +      | 5000nt  | 10000nt  | imprecise | GENE5 | duplication,partially_duplicated_transcript | ENST0000047283 | 4    |
| chr1  | 38584598 | 38683853 | +      | 99256nt | 198512nt | imprecise | GENE6 | duplication,transcript_amplification        | ENST0000038482 |      |
+------------------+----------+--------+---------+----------+-----------+-------+---------------------------------------------+----------------+------+

A VCF format specification document has a few structural variant examples:

#CHROM  POS   ID  REF ALT   QUAL  FILTER  INFO  FORMAT  NA00001
1 2827693   . CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA  C . PASS  SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9
2 321682    . T <DEL>   6 PASS    IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62  GT:GQ 0/1:12
2 14477084  . C <DEL:ME:ALU>  12  PASS  IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32  GT:GQ 0/1:12
3 9425916   . C <INS:ME:L1> 23  PASS  IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
3 12665100  . A <DUP>   14  PASS  IMPRECISE;SVTYPE=DUP;END=12686200;SVLEN=21100;CIPOS=-500,500;CIEND=-500,500   GT:GQ:CN:CNQ  ./.:0:3:16.2
4 18665128  . T <DUP:TANDEM>  11  PASS  IMPRECISE;SVTYPE=DUP;END=18665204;SVLEN=76;CIPOS=-10,10;CIEND=-10,10  GT:GQ:CN:CNQ  ./.:0:5:8.3

Turning this into something like:

+------------------+----------+--------+-----------------------------------------------------------------------+---------+-----------+--------+--------------------------------+
| chrom | start    | end      | strand | ref                                                                   | alt     | imprecise | gene   | sequence ontology              |
+------------------+----------+--------+-----------------------------------------------------------------------+---------+-----------+--------+--------------------------------+
| chr1  | 2827693  | 2827693  | +      | CGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA | -       |           | GENE7  | deletion                       |
| chr2  | 321682   | 321887   | +      | 206nt                                                                 | -       | imprecise | GENE8  | deletion                       |
| chr2  | 14477084 | 14477085 | +      | -                                                                     | 297nt   | imprecise | GENE9  | insertion,Alu_insertion        |
| chr3  | 9425916  | 9425917  | +      | -                                                                     | 6027nt  | imprecise | GENE10 | insertion,LINE1_insertion      |
| chr3  | 12665101 | 12686200 | +      | 21100nt                                                               | 42200nt | imprecise | GENE11 | duplication                    |
| chr4  | 18665128 | 18665204 | +      | 77nt                                                                  | 154nt   | imprecise | GENE12 | duplication,tandem_duplication |
+-------+----------+----------+--------+-----------------------------------------------------------------------+---------+-----------+--------+--------------------------------+

Of course, INFO fields should be parsed and recorded in other columns.

Would something like the above work for your purposes? Any feedback/suggestion would be appreciated.

EugeneEA · 2022-03-15T14:08:25Z

@rkimoakbioinformatics thankt a lot for the replies! Ok, that looks a bit too verbouse, can we select the major transcript as we do for the SNPs?

Sequence ontology is extremely usefull field but it's aggregation in VEP (https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html) quite simplify the filtering, may be it is something worth the implementation (also for SNPs).

Would these variants be annotated if they are present in some of the annotators (clinvar for example) (I know that these are basically indels, but still might be usefull)

rkimoakbioinformatics · 2022-03-22T13:32:18Z

@EugeneEA Yes, if the variants are in ClinVar as well as any other OpenCRAVAT, they will be annotated. I am not sure yet about how imprecise variants are treated in annotation data sources, but that will be the spirit.

As far as I know, VEP outputs sequence ontologies for each transcript on separate lines in its native output format, or on the same line delimited in the VCF format. I am not aware of aggregation by VEP - does it aggregate? If you mean, by aggregation, something like showing all sequence ontologies from all the variants for a transcript together, that has been planned but we haven't gotten to work on it yet.

By selecting a major transcript, you mean the current OpenCRAVAT's style of showing the mutation consequence on a representative transcript, either a MANE one or a custom choice for a gene, and that on all the other transcripts where the variant falls in another column?

EugeneEA · 2022-03-28T06:56:31Z

@rkimoakbioinformatics
By "VEP aggregation" I meant that they summarise sequence ontology from 30+ fileds to 4 (High, low, moderate, modifier) and provides it as an additional info column. It is a usefull feature for medical people, even though it is trivial to implement it might worth to include it into default output.

Yes, that is exectly what I meant either consequence or MANE, and the rest goes to other column

RachelKarchin · 2024-02-23T19:51:45Z

Hi EugeneEA,

Just to catch you up, Rick Kim is no longer on the OpenCRAVAT team, but we are actively developing structural variant mapping and annotations. We'd appreciate if you might share other possible features that would interest you in addition to your comments in early 2022.

EugeneEA · 2024-02-26T06:44:24Z

Hi EugeneEA,

Just to catch you up, Rick Kim is no longer on the OpenCRAVAT team, but we are actively developing structural variant mapping and annotations. We'd appreciate if you might share other possible features that would interest you in addition to your comments in early 2022.

Hi! Nothing above what was mentioned earlier so far, but in general, it would be super helpful if your SV support will follow the same frame as usual snp/INDELS module in terms of possibility of adding custom annotators.

For examples - we are analyzing a lot of samples with some SV detection tools and currently I have to annotate each new sample with the SV frequency from internal database using VEP + custom scripts. I'd love to switch to oc for both tasks. Therefore for me a "VEP aggregation" column (or set of columns which I can use as a secondary input for such annotator) and possibility to add custom annotator is a mast.

Best, Eugene

jasminebro assigned gsr9999 Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Structural variant vcf annotation #97

Structural variant vcf annotation #97

EugeneEA commented Feb 16, 2022

cariaso commented Feb 16, 2022 via email •

edited

EugeneEA commented Feb 16, 2022

cariaso commented Feb 16, 2022 via email

EugeneEA commented Feb 17, 2022

rkimoakbioinformatics commented Feb 17, 2022

EugeneEA commented Feb 25, 2022

rkimoakbioinformatics commented Feb 26, 2022

EugeneEA commented Mar 9, 2022

rkimoakbioinformatics commented Mar 10, 2022

EugeneEA commented Mar 15, 2022

rkimoakbioinformatics commented Mar 22, 2022

EugeneEA commented Mar 28, 2022

RachelKarchin commented Feb 23, 2024

EugeneEA commented Feb 26, 2024

Structural variant vcf annotation #97

Structural variant vcf annotation #97

Comments

EugeneEA commented Feb 16, 2022

cariaso commented Feb 16, 2022 via email • edited

EugeneEA commented Feb 16, 2022

cariaso commented Feb 16, 2022 via email

EugeneEA commented Feb 17, 2022

rkimoakbioinformatics commented Feb 17, 2022

EugeneEA commented Feb 25, 2022

rkimoakbioinformatics commented Feb 26, 2022

EugeneEA commented Mar 9, 2022

rkimoakbioinformatics commented Mar 10, 2022

EugeneEA commented Mar 15, 2022

rkimoakbioinformatics commented Mar 22, 2022

EugeneEA commented Mar 28, 2022

RachelKarchin commented Feb 23, 2024

EugeneEA commented Feb 26, 2024

cariaso commented Feb 16, 2022 via email •

edited