Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generating multihetsep.txt files for whole genomes - multiple chromosomes #48

Open
killidude opened this issue Feb 6, 2023 · 1 comment

Comments

@killidude
Copy link

I am getting a consistent assertion error:

...
processing pos 115000000
processing pos 116000000
Traceback (most recent call last):
File "/home/tomas/bin/msmc-tools/generate_multihetsep.py", line 226, in
if mergedMask.getVal(snp_pos):
File "/home/tomas/bin/msmc-tools/generate_multihetsep.py", line 51, in getVal
return all((m.getVal(pos) for m in self.maskIterators))
File "/home/tomas/bin/msmc-tools/generate_multihetsep.py", line 51, in
return all((m.getVal(pos) for m in self.maskIterators))
File "/home/tomas/bin/msmc-tools/generate_multihetsep.py", line 35, in getVal
assert pos >= self.lastPos
AssertionError

call:
generate_multihetsep.py --mask=harpia_mask.bed.gz --negative_mask=harpia_mappability_mask.bed.gz harpia_pacbio_genotyped_SNP_final.vcf.gz > harpia.multihetsep.txt

My vcf and masks are of the complete genome, i.e. multiple chromosomes.

If I run the generate_multihetsep.py on the whole genome or just select a chromosome, I always get the same result. Last lines of the different multihetsep.txt files (upto abort):
beginning of the vcf/chromosome Z (length 117067010)
SUPER_Z 116781329 76024 AG
SUPER_Z 116791378 10049 TC
SUPER_Z 116900943 94985 AG
chromosome W (length 37400844) - selected as --chr SUPER_W
SUPER_W 116781329 76024 AG
SUPER_W 116791378 10049 TC
SUPER_W 116900943 94985 AG
chromosome 28_unloc_1(length 32972) - selected as --chr SUPER_28_unloc_1
SUPER_28_unloc_1 116781329 76024 AG
SUPER_28_unloc_1 116791378 10049 TC
SUPER_28_unloc_1 116900943 94985 AG

The outputs are identical. generate_multihetsep.py always starts at the beginning of the vcf (SUPER_Z) and continues until it reads through the first chromosome (SUPER_Z) and then crashes.

My vcf was generated using gatk, and filtered using bcftools (standard hard filters, and to include biallelic SNPs only). There are no issues analysing or manipulating this vcf in other programs. The last three SNPs of the SUPER_Z chromosome, and first two SNPs of the SUPER_W chromosome
SUPER_Z 116781329 . A G 70.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0;DP=4;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=17.66;ReadPosRankSum=0.674;SOR=2.303 GT:AD:DP:GQ:PL 0|1:2,2:4:78:78,0,78
SUPER_Z 116791378 . T C 70.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=-0.967;DP=3;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=23.55;ReadPosRankSum=0.967;SOR=0.223 GT:AD:DP:GQ:PGT:PID:PL:PS 0|1:1,2:3:36:0|1:116791365_CA_C:78,0,36:116791365
SUPER_Z 116900943 . A G 73.64 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=-0.431;DP=3;ExcessHet=0;FS=0;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=24.55;ReadPosRankSum=-0.967;SOR=0.223 GT:AD:DP:GQ:PL 0|1:1,2:3:36:81,0,36
SUPER_W 270520 . A T 179.96 PASS AC=2;AF=1;AN=2;DP=5;ExcessHet=0;FS=0;MLEAC=2;MLEAF=1;MQ=50.92;QD=26.52;SOR=1.022 GT:AD:DP:GQ:PL 1|1:0,5:5:15:194,15,0
SUPER_W 270593 . T C 210.96 PASS AC=2;AF=1;AN=2;DP=5;ExcessHet=0;FS=0;MLEAC=2;MLEAF=1;MQ=50.92;QD=30.56;SOR=1.022 GT:AD:DP:GQ:PGT:PID:PL:PS 1|1:0,5:5:15:1|1:270593_T_C:225,15,0:270593
SUPER_W 270597 . G C 210.96 PASS AC=2;AF=1;AN=2;DP=5;ExcessHet=0;FS=0;MLEAC=2;MLEAF=1;MQ=50.92;QD=30.23;SOR=1.022 GT:AD:DP:GQ:PGT:PID:PL:PS 1|1:0,5:5:15:1|1:270593_T_C:225,15,0:270593

My mask was generated using bedtools bam2bed.
SUPER_Z 1661 15824 m54306U_210130_202436/90440587/ccs 60 -
SUPER_Z 2021 21862 m54306U_210130_202436/983948/ccs 60 -
SUPER_Z 7945 15108 m54306U_210130_202436/33752681/ccs 60 -

My mappability mask was generated using the steps in https://lh3lh3.users.sourceforge.net/snpable.shtml, and corresponding bed file extracted using a python script.
SUPER_Z 84 270
SUPER_Z 287 1210
SUPER_Z 1234 1323

Based on this, am I to assume that the generate_multihetsep.py was written to process only one chromosome at a time? Or is this a bug in generate_multihetsep.py?

@stschiff
Copy link
Owner

You have to split the input by chromosome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants