Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Structural variant vcf annotation #97

Open
EugeneEA opened this issue Feb 16, 2022 · 14 comments
Open

Structural variant vcf annotation #97

EugeneEA opened this issue Feb 16, 2022 · 14 comments
Assignees

Comments

@EugeneEA
Copy link

Hi, I've come across the problem that oc does not annotate SV vcf's are there plans to support SV in a future or maybe thereis a workaround?
the common line format:
chr1 964964 20 N <DEL> 137.6 . SVTYPE=DEL;SVLEN=-366;END=965330;STRANDS=+-:10;IMPRECISE;CIPOS=-30 ...etc

Best, Eugene

@cariaso
Copy link

cariaso commented Feb 16, 2022 via email

@EugeneEA
Copy link
Author

yes, basically annotation with a genomic feature (gene, exon/intron, UTR etc) and the possible consequence on gene expression (eg if frameshift in exon happens, or exon deletion/duplication/inversion watever)

VEP handles SV vcf, so it's annotation can be taken as an example

@cariaso
Copy link

cariaso commented Feb 16, 2022 via email

@EugeneEA
Copy link
Author

Probably, I have not tried it

@rkimoakbioinformatics
Copy link
Contributor

@EugeneEA Hi, yes there is a plan to add the support for SV, CNV, etc. in the future. Can be in this repo or a fork.

@EugeneEA
Copy link
Author

@rkimoakbioinformatics Thanks for the answer, but as far as I understand it is not a near future, but plans for the further development?

@rkimoakbioinformatics
Copy link
Contributor

@EugeneEA I would like to start discussion on it. Can you let me know what kind of output columns you would need? Something like the following?

+-------+-----------+-----+-----------+-------------------------------------------------------------------------------------+
| chrom | start     | end       | ref | alt   | all_mappings                                                                |
| chr10 | 121593023 | 121603287 | N   | <DEL> | {"GENE1": [["P00001", "", "transcript_ablation", "ENST00000346997.6", ""]], |
|       |           |           |     |       |  "GENE2": [["P00002", "", "transcript_ablation", "ENST000009385.1", ""]]}   |
| chr5  | 95849345  | 95853945  | N   | <DUP> | {“GENE3”: [[“P00003”, “”, “copy_number_gain”, “ENST000009482.1”, “”]]}      |
+-------+-----------+-----------+-----+-------+-----------------------------------------------------------------------------+

For imprecise structural variants, would you still want to see predicted protein sequence change?

@EugeneEA
Copy link
Author

EugeneEA commented Mar 9, 2022

@rkimoakbioinformatics sorry for long delay, yes that would be sufficient for starters defenetly. The tricky part probably the filed "transcript_ablation" etc. maybe an additinal column should be added here, for example listing the deleted (exons) etc?

@rkimoakbioinformatics
Copy link
Contributor

@EugeneEA Thanks. Below is a sketch. The current format of all_mappings is difficult to parse, sort, and filter. Thus, using putting each transcript in a separate line, something like:

+------------------+----------+--------+---------+----------+-----------+-------+---------------------------------------------+----------------+------+
| chrom | start    | end      | strand | ref     | alt      | imprecise | gene  | sequence ontology                           | transcript     | exon |
+------------------+----------+--------+---------+----------+-----------+-----------------------------------------------------+----------------+------|
| chr1  | 10394823 | 10404834 | +      | 10012nt | -        | imprecise | GENE2 | deletion,exon_loss_variant                  | ENST0000038273 | 2,3  |
| chr1  | 10394823 | 10404834 | +      | 10012nt | -        | imprecise | GENE2 | deletion,exon_loss_variant                  | ENST0000038284 | 2,3  |
| chr1  | 2784394  | 2984393  | +      | 20000nt | -        | imprecise | GENE3 | deletion,transcript_ablation                | ENST0000061234 |      |
| chr1  | 394823   | 399822   | +      | 5000nt  | 10000nt  | imprecise | GENE5 | duplication,partially_duplicated_transcript | ENST0000047283 | 4    |
| chr1  | 38584598 | 38683853 | +      | 99256nt | 198512nt | imprecise | GENE6 | duplication,transcript_amplification        | ENST0000038482 |      |
+------------------+----------+--------+---------+----------+-----------+-------+---------------------------------------------+----------------+------+

A VCF format specification document has a few structural variant examples:

#CHROM  POS   ID  REF ALT   QUAL  FILTER  INFO  FORMAT  NA00001
1 2827693   . CCGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA  C . PASS  SVTYPE=DEL;END=2827680;BKPTID=Pindel_LCS_D1099159;HOMLEN=1;HOMSEQ=C;SVLEN=-66 GT:GQ 1/1:13.9
2 321682    . T <DEL>   6 PASS    IMPRECISE;SVTYPE=DEL;END=321887;SVLEN=-105;CIPOS=-56,20;CIEND=-10,62  GT:GQ 0/1:12
2 14477084  . C <DEL:ME:ALU>  12  PASS  IMPRECISE;SVTYPE=DEL;END=14477381;SVLEN=-297;MEINFO=AluYa5,5,307,+;CIPOS=-22,18;CIEND=-12,32  GT:GQ 0/1:12
3 9425916   . C <INS:ME:L1> 23  PASS  IMPRECISE;SVTYPE=INS;END=9425916;SVLEN=6027;CIPOS=-16,22;MIINFO=L1HS,1,6025,- GT:GQ 1/1:15
3 12665100  . A <DUP>   14  PASS  IMPRECISE;SVTYPE=DUP;END=12686200;SVLEN=21100;CIPOS=-500,500;CIEND=-500,500   GT:GQ:CN:CNQ  ./.:0:3:16.2
4 18665128  . T <DUP:TANDEM>  11  PASS  IMPRECISE;SVTYPE=DUP;END=18665204;SVLEN=76;CIPOS=-10,10;CIEND=-10,10  GT:GQ:CN:CNQ  ./.:0:5:8.3

Turning this into something like:

+------------------+----------+--------+-----------------------------------------------------------------------+---------+-----------+--------+--------------------------------+
| chrom | start    | end      | strand | ref                                                                   | alt     | imprecise | gene   | sequence ontology              |
+------------------+----------+--------+-----------------------------------------------------------------------+---------+-----------+--------+--------------------------------+
| chr1  | 2827693  | 2827693  | +      | CGTGGATGCGGGGACCCGCATCCCCTCTCCCTTCACAGCTGAGTGACCCACATCCCCTCTCCCCTCGCA | -       |           | GENE7  | deletion                       |
| chr2  | 321682   | 321887   | +      | 206nt                                                                 | -       | imprecise | GENE8  | deletion                       |
| chr2  | 14477084 | 14477085 | +      | -                                                                     | 297nt   | imprecise | GENE9  | insertion,Alu_insertion        |
| chr3  | 9425916  | 9425917  | +      | -                                                                     | 6027nt  | imprecise | GENE10 | insertion,LINE1_insertion      |
| chr3  | 12665101 | 12686200 | +      | 21100nt                                                               | 42200nt | imprecise | GENE11 | duplication                    |
| chr4  | 18665128 | 18665204 | +      | 77nt                                                                  | 154nt   | imprecise | GENE12 | duplication,tandem_duplication |
+-------+----------+----------+--------+-----------------------------------------------------------------------+---------+-----------+--------+--------------------------------+

Of course, INFO fields should be parsed and recorded in other columns.

Would something like the above work for your purposes? Any feedback/suggestion would be appreciated.

@EugeneEA
Copy link
Author

@rkimoakbioinformatics thankt a lot for the replies! Ok, that looks a bit too verbouse, can we select the major transcript as we do for the SNPs?

Sequence ontology is extremely usefull field but it's aggregation in VEP (https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html) quite simplify the filtering, may be it is something worth the implementation (also for SNPs).

Would these variants be annotated if they are present in some of the annotators (clinvar for example) (I know that these are basically indels, but still might be usefull)

@rkimoakbioinformatics
Copy link
Contributor

@EugeneEA Yes, if the variants are in ClinVar as well as any other OpenCRAVAT, they will be annotated. I am not sure yet about how imprecise variants are treated in annotation data sources, but that will be the spirit.

As far as I know, VEP outputs sequence ontologies for each transcript on separate lines in its native output format, or on the same line delimited in the VCF format. I am not aware of aggregation by VEP - does it aggregate? If you mean, by aggregation, something like showing all sequence ontologies from all the variants for a transcript together, that has been planned but we haven't gotten to work on it yet.

By selecting a major transcript, you mean the current OpenCRAVAT's style of showing the mutation consequence on a representative transcript, either a MANE one or a custom choice for a gene, and that on all the other transcripts where the variant falls in another column?

@EugeneEA
Copy link
Author

@rkimoakbioinformatics
By "VEP aggregation" I meant that they summarise sequence ontology from 30+ fileds to 4 (High, low, moderate, modifier) and provides it as an additional info column. It is a usefull feature for medical people, even though it is trivial to implement it might worth to include it into default output.

Yes, that is exectly what I meant either consequence or MANE, and the rest goes to other column

@RachelKarchin
Copy link
Contributor

Hi EugeneEA,

Just to catch you up, Rick Kim is no longer on the OpenCRAVAT team, but we are actively developing structural variant mapping and annotations. We'd appreciate if you might share other possible features that would interest you in addition to your comments in early 2022.

@EugeneEA
Copy link
Author

Hi EugeneEA,

Just to catch you up, Rick Kim is no longer on the OpenCRAVAT team, but we are actively developing structural variant mapping and annotations. We'd appreciate if you might share other possible features that would interest you in addition to your comments in early 2022.

Hi! Nothing above what was mentioned earlier so far, but in general, it would be super helpful if your SV support will follow the same frame as usual snp/INDELS module in terms of possibility of adding custom annotators.

For examples - we are analyzing a lot of samples with some SV detection tools and currently I have to annotate each new sample with the SV frequency from internal database using VEP + custom scripts. I'd love to switch to oc for both tasks. Therefore for me a "VEP aggregation" column (or set of columns which I can use as a secondary input for such annotator) and possibility to add custom annotator is a mast.

Best, Eugene

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants