Invalid variant dimension in 'genotype/data' and Ploidy=1 with more than 2 alleles per variant #85

gtollefson · 2023-03-22T15:25:27Z

Hi,

I've used SeqArray::seqVCF2GDS to convert a haploid species VCF v4.2 file to gds format which has variants with multiple alternative alleles due to the nature of our sequencing which involves sequencing combined mixed clonal samples. We require storing any number of alleles per variant position (2+) even though our reference genome is a haploid genome (showing more than 2 alleles is useful to us because it indicates mixed clones of one species in one sample).

I get an error when running various SeqArray functions and I'm wondering if the root cause is that the Ploidy is 1 but there are variants in the GDS file with more than 2 alternative alleles.

When I run seqSummary(gds_filename) I get the following output including a warning message which I believe explains why many of the SeqArray functions don't work on our data.

File: /Users/george/Bailey_lab/miplicorn_dev_materials/variants.gds
Format Version: v1.0
Reference: /opt/species_resources/genomes/genome.fa
Invalid variant dimension in 'genotype/data'.
Ploidy: 1
Number of samples: 51
Number of variants: 49,763
Chromosomes:
chr1 : 5111 , chr4 : 3104 , chr5 : 5421 , chr7 : 4682 , chr8 : 3480 , chr10: 5430
chr13: 8170 , chr14: 11724, chrM : 2641
Contigs:
chr5, 1343557
chr10, 1687656
chr7, 1445207
chr3, 1067971
chr13, 2925236
chr11, 2038340
chr14, 3291936
chr9, 1541735
chr1, 640851
chr12, 2271494
chr8, 1472805
chr6, 1418242
chr4, 1200490
chr2, 947102
chrM, 5967
chrP, 34242
Alleles:
ALT:
tabulation: 2, 49745(100.0%); 3, 11(0.0%); 4, 4(0.0%); 6, 1(0.0%); 7, 1(0.0%); 8, 1(0.0%)
Annotation, Quality:
Min: 0, 1st Qu: 0, Median: 0, Mean: 8.03084504916388, 3rd Qu: 0, Max: 90224.8984375
Annotation, FILTER:
PASS, All filters passed, 0(0.0%)

Running seqGetData(gdsfile, "genotype") outputs:

Error in seqGetData(gdsfile, "genotype") :
Invalid dimension of 'genotype/data'.

Running seqGetData(gdsfile, "annotation/format/DS")

Error in seqGetData(gdsfile, "annotation/format/DS") :
The GDS node "annotation/format/DS/data" does not exist.

One more thing I noticed from the summary output is that under Alleles:

ALT: <None>
tabulation: 2, 49745(100.0%); 3, 11(0.0%); 4, 4(0.0%); 6, 1(0.0%); 7, 1(0.0%); 8, 1(0.0%)

Why are there no ALT alleles recorded?
and for tabulation, why do the alleles with 3, 4, 6, 7, 8, 1 show 0.0%?

Could you help me to identify the cause of these errors and help develop a fix? Do you think the GDS summary: Ploidy=1 while there are greater than 2 alleles is the issue? We would like to implement SeqArray into our R tool suite for molecular inversion probe analysis of mixed polyclonal parasite samples and hope we can get it working.

Thank you very much for your time and help,
George

The text was updated successfully, but these errors were encountered:

zhengxwen · 2023-03-22T18:00:14Z

Could you please show me the structure of your gds file? like,

f <- seqOpen(gds_filename)
f

gtollefson · 2023-03-24T16:34:20Z

Hi @zhengxwen,

Sorry for my delay - I needed to update my email alerts to be sent to my new institutional email address. I should be responsive more quickly now.

Here is the output from gdsfile <- seqOpen(gds_filename)

gdsfile
Object of class "SeqVarGDSClass"
File: /Users/george/Bailey_lab/miplicorn_dev_materials/variants.gds (1.1M)

[ ] *
|--+ description [ ] *
|--+ sample.id { Str8 51 LZMA_ra(36.5%), 209B } *
|--+ variant.id { Int32 49763 LZMA_ra(7.10%), 13.8K } *
|--+ position { Int32 49763 LZMA_ra(6.89%), 13.4K } *
|--+ chromosome { Str8 49763 LZMA_ra(0.16%), 213B } *
|--+ allele { Str8 49763 LZMA_ra(5.29%), 15.5K } *
|--+ genotype [ ] *
| |--+ data { Bit2 1x51x132 LZMA_ra(49.1%), 833B } *
| |--+ extra.index { Int32 3x4482 LZMA_ra(4.08%), 2.1K } *
| --+ extra { Int16 4482 LZMA_ra(7.92%), 717B }
|--+ phase [ ]
|--+ annotation [ ]
| |--+ id { Str8 49763 LZMA_ra(0.30%), 157B } *
| |--+ qual { Float32 49763 LZMA_ra(0.45%), 909B } *
| |--+ filter { Int32,factor 49763 LZMA_ra(0.09%), 189B } *
| |--+ info [ ]
| | |--+ NS { Int32 49763 LZMA_ra(0.28%), 573B } *
| | |--+ DP { Int32 49763 LZMA_ra(0.79%), 1.5K } *
| | |--+ DPB { Float32 49763 LZMA_ra(0.35%), 705B } *
| | |--+ AC { Int32 159 LZMA_ra(40.6%), 265B } *
| | |--+ AN { Int32 49763 LZMA_ra(0.29%), 589B } *
| | |--+ AF { Float32 159 LZMA_ra(75.8%), 489B } *
| | |--+ RO { Int32 49763 LZMA_ra(0.37%), 745B } *
| | |--+ AO { Int32 159 LZMA_ra(49.4%), 321B } *
| | |--+ PRO { Float32 49763 LZMA_ra(0.25%), 513B } *
| | |--+ PAO { Float32 159 LZMA_ra(14.2%), 97B } *
| | |--+ QR { Float32 49763 LZMA_ra(0.43%), 869B } *
| | |--+ QA { Float32 159 LZMA_ra(71.4%), 461B } *
| | |--+ PQR { Float32 49763 LZMA_ra(0.25%), 513B } *
| | |--+ PQA { Float32 159 LZMA_ra(14.2%), 97B } *
| | |--+ SRF { Int32 49763 LZMA_ra(0.31%), 621B } *
| | |--+ SRR { Int32 49763 LZMA_ra(0.34%), 685B } *
| | |--+ SAF { Int32 159 LZMA_ra(32.4%), 213B } *
| | |--+ SAR { Int32 159 LZMA_ra(42.5%), 277B } *
| | |--+ SRP { Float32 49763 LZMA_ra(0.46%), 913B } *
| | |--+ SAP { Float32 159 LZMA_ra(79.6%), 513B } *
| | |--+ AB { Float32 159 LZMA_ra(84.0%), 541B } *
| | |--+ ABP { Float32 159 LZMA_ra(89.0%), 573B } *
| | |--+ RUN { Int32 159 LZMA_ra(14.2%), 97B } *
| | |--+ RPP { Float32 159 LZMA_ra(79.6%), 513B } *
| | |--+ RPPR { Float32 49763 LZMA_ra(0.46%), 917B } *
| | |--+ RPL { Float32 159 LZMA_ra(41.8%), 273B } *
| | |--+ RPR { Float32 159 LZMA_ra(41.8%), 273B } *
| | |--+ EPP { Float32 159 LZMA_ra(78.3%), 505B } *
| | |--+ EPPR { Float32 49763 LZMA_ra(0.46%), 917B } *
| | |--+ DPRA { Float32 159 LZMA_ra(94.7%), 609B } *
| | |--+ ODDS { Float32 49763 LZMA_ra(0.47%), 937B } *
| | |--+ GTI { Int32 49763 LZMA_ra(0.30%), 605B } *
| | |--+ TYPE { Str8 159 LZMA_ra(22.8%), 161B } *
| | |--+ CIGAR { Str8 159 LZMA_ra(43.1%), 317B } *
| | |--+ NUMALT { Int32 49763 LZMA_ra(0.27%), 549B } *
| | |--+ MEANALT { Float32 159 LZMA_ra(50.0%), 325B } *
| | |--+ LEN { Int32 159 LZMA_ra(24.2%), 161B } *
| | |--+ MQM { Float32 159 LZMA_ra(18.6%), 125B } *
| | |--+ MQMR { Float32 49763 LZMA_ra(0.26%), 525B } *
| | |--+ PAIRED { Float32 159 LZMA_ra(14.2%), 97B } *
| | |--+ PAIREDR { Float32 49763 LZMA_ra(0.25%), 513B } *
| | |--+ MIN_DP { Int32 49763 LZMA_ra(0.26%), 521B } *
| | |--+ END { Int32 49763 LZMA_ra(7.11%), 13.8K } *
| | --+ technology.ILLUMINA { Float32 159 LZMA_ra(18.6%), 125B } *
| --+ format [ ]
| |--+ GQ [ ] *
| | --+ data { Float32 51x49763 LZMA_ra(4.42%), 437.9K } *
| |--+ GL [ ] *
| | --+ data { Float32 51x512 LZMA_ra(31.1%), 31.7K } *
| |--+ DP [ ] *
| | --+ data { VL_Int 51x49763 LZMA_ra(0.29%), 7.9K } *
| |--+ AD [ ] *
| | --+ data { VL_Int 51x284 LZMA_ra(19.7%), 6.3K } *
| |--+ RO [ ] *
| | --+ data { VL_Int 51x49763 LZMA_ra(0.45%), 12.1K } *
| |--+ QR [ ] *
| | --+ data { Float32 51x49763 LZMA_ra(4.39%), 435.4K } *
| |--+ AO [ ] *
| | --+ data { VL_Int 51x49797 LZMA_ra(0.27%), 7.0K } *
| |--+ QA [ ] *
| | --+ data { Float32 51x49797 LZMA_ra(0.12%), 11.6K } *
| --+ MIN_DP [ ] *
| --+ data { VL_Int 51x49638 LZMA_ra(0.24%), 6.4K } *
--+ sample.annotation [ ]

gtollefson · 2023-03-27T22:28:26Z

@zhengxwen just following up on this issue. I’ve tried examining the source code but haven’t been able to identify the root of the error.

zhengxwen · 2023-03-27T23:38:45Z

See:

| |--+ data { Bit2 1x51x132 LZMA_ra(49.1%), 833B } *

It is expected to be Bit2 1x51x49763 (or at least 49763). Here 1 is the ploidy, 51 is # of samples, and 49763 is # of variants (if more than 3 alleles at a site, this number could be larger than 49763).
There are somethings wrong when you convert VCF=>GDS (via seqVCF2GDS()).
I might suggest checking the VCF file, whether it includes the field GT correctly.
For haploid, the number of alleles can be more than 1.

gtollefson · 2023-03-27T23:46:38Z

Ok I see, thank you very much! I think I see the reason. Our analysis pipeline outputs a gVCF with every genomic position from our molecular inversion probe target region included in the VCF record lines even if there is no variant at each position. In our VCF, at positions without variants, there is no GT field. Each record row which has a variant allele at that position appears to have a properly formatted GT field. Only a small subset of the VCF records have a variant allele and the necessary GT field. Have you come across this before? Is there a way to specify in the seqVCF2GDS() function call to only include VCF lines which have the GT command?

If not, perhaps we can get around this by including a simple grep script in the pipeline to remove VCF lines without variants (those lines without the GT field).

Here are two screenshots illustrating the different formatting between variant and non-variant record lines:

zhengxwen · 2023-03-28T04:05:47Z

Removing VCF lines without variants (those lines without the GT field) would be a good way before importing to GDS.
Another possible issue is the ploidy.
From your data, it seems diploidy instead of haploid.

GT: GQ: DP: AD: RO: QR: AO: QA:GL 0/0:127.042:454:454,0:454:15248:0:0:0,-136.668,-1372.15

So I will expect a "corrected" VCF should give you a diploidy GDS file.

gtollefson · 2023-03-28T18:20:28Z

Ok I will proceed with removing the VCF lines without variants. It sounds like SeqArray does not support gVCF to GDS conversion. Is this correct?

In other words, SeqArray exclusively supports conversion of regular VCF files and not gVCF files which contain the homozygous reference blocks lacking GT calls which are added by various variant callers (GATK, Freebayes, etc.) during gVCF generation (which contain a line for every position in the genome including numerous non-variant sites)?

I think we have a fix to adjust the ploidy in the genotype calls - thank you for pointing that out.

zhengxwen · 2023-03-28T20:21:37Z

Not sure how you define gVCF, but the gVCF files I received and processed did have GT for every position (even for the site without variant). Another option might be you add the GT field to every site in your pipeline.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid variant dimension in 'genotype/data' and Ploidy=1 with more than 2 alleles per variant #85

Invalid variant dimension in 'genotype/data' and Ploidy=1 with more than 2 alleles per variant #85

gtollefson commented Mar 22, 2023 •

edited

zhengxwen commented Mar 22, 2023

gtollefson commented Mar 24, 2023 •

edited

gtollefson commented Mar 27, 2023

zhengxwen commented Mar 27, 2023 •

edited

gtollefson commented Mar 27, 2023 •

edited

zhengxwen commented Mar 28, 2023

gtollefson commented Mar 28, 2023 •

edited

zhengxwen commented Mar 28, 2023

Invalid variant dimension in 'genotype/data' and Ploidy=1 with more than 2 alleles per variant #85

Invalid variant dimension in 'genotype/data' and Ploidy=1 with more than 2 alleles per variant #85

Comments

gtollefson commented Mar 22, 2023 • edited

zhengxwen commented Mar 22, 2023

gtollefson commented Mar 24, 2023 • edited

gtollefson commented Mar 27, 2023

zhengxwen commented Mar 27, 2023 • edited

gtollefson commented Mar 27, 2023 • edited

zhengxwen commented Mar 28, 2023

gtollefson commented Mar 28, 2023 • edited

zhengxwen commented Mar 28, 2023

gtollefson commented Mar 22, 2023 •

edited

gtollefson commented Mar 24, 2023 •

edited

zhengxwen commented Mar 27, 2023 •

edited

gtollefson commented Mar 27, 2023 •

edited

gtollefson commented Mar 28, 2023 •

edited