Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid variant dimension in 'genotype/data' and Ploidy=1 with more than 2 alleles per variant #85

Open
gtollefson opened this issue Mar 22, 2023 · 8 comments

Comments

@gtollefson
Copy link

gtollefson commented Mar 22, 2023

Hi,

I've used SeqArray::seqVCF2GDS to convert a haploid species VCF v4.2 file to gds format which has variants with multiple alternative alleles due to the nature of our sequencing which involves sequencing combined mixed clonal samples. We require storing any number of alleles per variant position (2+) even though our reference genome is a haploid genome (showing more than 2 alleles is useful to us because it indicates mixed clones of one species in one sample).

I get an error when running various SeqArray functions and I'm wondering if the root cause is that the Ploidy is 1 but there are variants in the GDS file with more than 2 alternative alleles.

When I run seqSummary(gds_filename) I get the following output including a warning message which I believe explains why many of the SeqArray functions don't work on our data.

File: /Users/george/Bailey_lab/miplicorn_dev_materials/variants.gds
Format Version: v1.0
Reference: /opt/species_resources/genomes/genome.fa
Invalid variant dimension in 'genotype/data'.
Ploidy: 1
Number of samples: 51
Number of variants: 49,763
Chromosomes:
chr1 : 5111 , chr4 : 3104 , chr5 : 5421 , chr7 : 4682 , chr8 : 3480 , chr10: 5430
chr13: 8170 , chr14: 11724, chrM : 2641
Contigs:
chr5, 1343557
chr10, 1687656
chr7, 1445207
chr3, 1067971
chr13, 2925236
chr11, 2038340
chr14, 3291936
chr9, 1541735
chr1, 640851
chr12, 2271494
chr8, 1472805
chr6, 1418242
chr4, 1200490
chr2, 947102
chrM, 5967
chrP, 34242
Alleles:
ALT:
tabulation: 2, 49745(100.0%); 3, 11(0.0%); 4, 4(0.0%); 6, 1(0.0%); 7, 1(0.0%); 8, 1(0.0%)
Annotation, Quality:
Min: 0, 1st Qu: 0, Median: 0, Mean: 8.03084504916388, 3rd Qu: 0, Max: 90224.8984375
Annotation, FILTER:
PASS, All filters passed, 0(0.0%)

Running seqGetData(gdsfile, "genotype") outputs:

Error in seqGetData(gdsfile, "genotype") :
Invalid dimension of 'genotype/data'.

Running seqGetData(gdsfile, "annotation/format/DS")

Error in seqGetData(gdsfile, "annotation/format/DS") :
The GDS node "annotation/format/DS/data" does not exist.

One more thing I noticed from the summary output is that under Alleles:

ALT: <None>
tabulation: 2, 49745(100.0%); 3, 11(0.0%); 4, 4(0.0%); 6, 1(0.0%); 7, 1(0.0%); 8, 1(0.0%)

Why are there no ALT alleles recorded?
and for tabulation, why do the alleles with 3, 4, 6, 7, 8, 1 show 0.0%?

Could you help me to identify the cause of these errors and help develop a fix? Do you think the GDS summary: Ploidy=1 while there are greater than 2 alleles is the issue? We would like to implement SeqArray into our R tool suite for molecular inversion probe analysis of mixed polyclonal parasite samples and hope we can get it working.

Thank you very much for your time and help,
George

@zhengxwen
Copy link
Owner

Could you please show me the structure of your gds file? like,

f <- seqOpen(gds_filename)
f

@gtollefson
Copy link
Author

gtollefson commented Mar 24, 2023

Hi @zhengxwen,

Sorry for my delay - I needed to update my email alerts to be sent to my new institutional email address. I should be responsive more quickly now.

Here is the output from gdsfile <- seqOpen(gds_filename)

gdsfile
Object of class "SeqVarGDSClass"
File: /Users/george/Bailey_lab/miplicorn_dev_materials/variants.gds (1.1M)

  • [ ] *
    |--+ description [ ] *
    |--+ sample.id { Str8 51 LZMA_ra(36.5%), 209B } *
    |--+ variant.id { Int32 49763 LZMA_ra(7.10%), 13.8K } *
    |--+ position { Int32 49763 LZMA_ra(6.89%), 13.4K } *
    |--+ chromosome { Str8 49763 LZMA_ra(0.16%), 213B } *
    |--+ allele { Str8 49763 LZMA_ra(5.29%), 15.5K } *
    |--+ genotype [ ] *
    | |--+ data { Bit2 1x51x132 LZMA_ra(49.1%), 833B } *
    | |--+ extra.index { Int32 3x4482 LZMA_ra(4.08%), 2.1K } *
    | --+ extra { Int16 4482 LZMA_ra(7.92%), 717B }
    |--+ phase [ ]
    |--+ annotation [ ]
    | |--+ id { Str8 49763 LZMA_ra(0.30%), 157B } *
    | |--+ qual { Float32 49763 LZMA_ra(0.45%), 909B } *
    | |--+ filter { Int32,factor 49763 LZMA_ra(0.09%), 189B } *
    | |--+ info [ ]
    | | |--+ NS { Int32 49763 LZMA_ra(0.28%), 573B } *
    | | |--+ DP { Int32 49763 LZMA_ra(0.79%), 1.5K } *
    | | |--+ DPB { Float32 49763 LZMA_ra(0.35%), 705B } *
    | | |--+ AC { Int32 159 LZMA_ra(40.6%), 265B } *
    | | |--+ AN { Int32 49763 LZMA_ra(0.29%), 589B } *
    | | |--+ AF { Float32 159 LZMA_ra(75.8%), 489B } *
    | | |--+ RO { Int32 49763 LZMA_ra(0.37%), 745B } *
    | | |--+ AO { Int32 159 LZMA_ra(49.4%), 321B } *
    | | |--+ PRO { Float32 49763 LZMA_ra(0.25%), 513B } *
    | | |--+ PAO { Float32 159 LZMA_ra(14.2%), 97B } *
    | | |--+ QR { Float32 49763 LZMA_ra(0.43%), 869B } *
    | | |--+ QA { Float32 159 LZMA_ra(71.4%), 461B } *
    | | |--+ PQR { Float32 49763 LZMA_ra(0.25%), 513B } *
    | | |--+ PQA { Float32 159 LZMA_ra(14.2%), 97B } *
    | | |--+ SRF { Int32 49763 LZMA_ra(0.31%), 621B } *
    | | |--+ SRR { Int32 49763 LZMA_ra(0.34%), 685B } *
    | | |--+ SAF { Int32 159 LZMA_ra(32.4%), 213B } *
    | | |--+ SAR { Int32 159 LZMA_ra(42.5%), 277B } *
    | | |--+ SRP { Float32 49763 LZMA_ra(0.46%), 913B } *
    | | |--+ SAP { Float32 159 LZMA_ra(79.6%), 513B } *
    | | |--+ AB { Float32 159 LZMA_ra(84.0%), 541B } *
    | | |--+ ABP { Float32 159 LZMA_ra(89.0%), 573B } *
    | | |--+ RUN { Int32 159 LZMA_ra(14.2%), 97B } *
    | | |--+ RPP { Float32 159 LZMA_ra(79.6%), 513B } *
    | | |--+ RPPR { Float32 49763 LZMA_ra(0.46%), 917B } *
    | | |--+ RPL { Float32 159 LZMA_ra(41.8%), 273B } *
    | | |--+ RPR { Float32 159 LZMA_ra(41.8%), 273B } *
    | | |--+ EPP { Float32 159 LZMA_ra(78.3%), 505B } *
    | | |--+ EPPR { Float32 49763 LZMA_ra(0.46%), 917B } *
    | | |--+ DPRA { Float32 159 LZMA_ra(94.7%), 609B } *
    | | |--+ ODDS { Float32 49763 LZMA_ra(0.47%), 937B } *
    | | |--+ GTI { Int32 49763 LZMA_ra(0.30%), 605B } *
    | | |--+ TYPE { Str8 159 LZMA_ra(22.8%), 161B } *
    | | |--+ CIGAR { Str8 159 LZMA_ra(43.1%), 317B } *
    | | |--+ NUMALT { Int32 49763 LZMA_ra(0.27%), 549B } *
    | | |--+ MEANALT { Float32 159 LZMA_ra(50.0%), 325B } *
    | | |--+ LEN { Int32 159 LZMA_ra(24.2%), 161B } *
    | | |--+ MQM { Float32 159 LZMA_ra(18.6%), 125B } *
    | | |--+ MQMR { Float32 49763 LZMA_ra(0.26%), 525B } *
    | | |--+ PAIRED { Float32 159 LZMA_ra(14.2%), 97B } *
    | | |--+ PAIREDR { Float32 49763 LZMA_ra(0.25%), 513B } *
    | | |--+ MIN_DP { Int32 49763 LZMA_ra(0.26%), 521B } *
    | | |--+ END { Int32 49763 LZMA_ra(7.11%), 13.8K } *
    | | --+ technology.ILLUMINA { Float32 159 LZMA_ra(18.6%), 125B } *
    | --+ format [ ]
    | |--+ GQ [ ] *
    | | --+ data { Float32 51x49763 LZMA_ra(4.42%), 437.9K } *
    | |--+ GL [ ] *
    | | --+ data { Float32 51x512 LZMA_ra(31.1%), 31.7K } *
    | |--+ DP [ ] *
    | | --+ data { VL_Int 51x49763 LZMA_ra(0.29%), 7.9K } *
    | |--+ AD [ ] *
    | | --+ data { VL_Int 51x284 LZMA_ra(19.7%), 6.3K } *
    | |--+ RO [ ] *
    | | --+ data { VL_Int 51x49763 LZMA_ra(0.45%), 12.1K } *
    | |--+ QR [ ] *
    | | --+ data { Float32 51x49763 LZMA_ra(4.39%), 435.4K } *
    | |--+ AO [ ] *
    | | --+ data { VL_Int 51x49797 LZMA_ra(0.27%), 7.0K } *
    | |--+ QA [ ] *
    | | --+ data { Float32 51x49797 LZMA_ra(0.12%), 11.6K } *
    | --+ MIN_DP [ ] *
    | --+ data { VL_Int 51x49638 LZMA_ra(0.24%), 6.4K } *
    --+ sample.annotation [ ]

@gtollefson
Copy link
Author

@zhengxwen just following up on this issue. I’ve tried examining the source code but haven’t been able to identify the root of the error.

@zhengxwen
Copy link
Owner

zhengxwen commented Mar 27, 2023

See:

| |--+ data { Bit2 1x51x132 LZMA_ra(49.1%), 833B } *

It is expected to be Bit2 1x51x49763 (or at least 49763). Here 1 is the ploidy, 51 is # of samples, and 49763 is # of variants (if more than 3 alleles at a site, this number could be larger than 49763).
There are somethings wrong when you convert VCF=>GDS (via seqVCF2GDS()).
I might suggest checking the VCF file, whether it includes the field GT correctly.
For haploid, the number of alleles can be more than 1.

@gtollefson
Copy link
Author

gtollefson commented Mar 27, 2023

Ok I see, thank you very much! I think I see the reason. Our analysis pipeline outputs a gVCF with every genomic position from our molecular inversion probe target region included in the VCF record lines even if there is no variant at each position. In our VCF, at positions without variants, there is no GT field. Each record row which has a variant allele at that position appears to have a properly formatted GT field. Only a small subset of the VCF records have a variant allele and the necessary GT field. Have you come across this before? Is there a way to specify in the seqVCF2GDS() function call to only include VCF lines which have the GT command?

If not, perhaps we can get around this by including a simple grep script in the pipeline to remove VCF lines without variants (those lines without the GT field).

Here are two screenshots illustrating the different formatting between variant and non-variant record lines:

seqarray_issue_screenshot

seqarray_issue_screenshot_2

@zhengxwen
Copy link
Owner

Removing VCF lines without variants (those lines without the GT field) would be a good way before importing to GDS.
Another possible issue is the ploidy.
From your data, it seems diploidy instead of haploid.

GT: GQ: DP: AD: RO: QR: AO: QA:GL 0/0:127.042:454:454,0:454:15248:0:0:0,-136.668,-1372.15

So I will expect a "corrected" VCF should give you a diploidy GDS file.

@gtollefson
Copy link
Author

gtollefson commented Mar 28, 2023

Ok I will proceed with removing the VCF lines without variants. It sounds like SeqArray does not support gVCF to GDS conversion. Is this correct?

In other words, SeqArray exclusively supports conversion of regular VCF files and not gVCF files which contain the homozygous reference blocks lacking GT calls which are added by various variant callers (GATK, Freebayes, etc.) during gVCF generation (which contain a line for every position in the genome including numerous non-variant sites)?

I think we have a fix to adjust the ploidy in the genotype calls - thank you for pointing that out.

@zhengxwen
Copy link
Owner

Not sure how you define gVCF, but the gVCF files I received and processed did have GT for every position (even for the site without variant). Another option might be you add the GT field to every site in your pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants