You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I'm trying to calculate theta values using sliding window means and there is an issue with the .thetas.idx file from realSFS saf2theta. All SNPs have theta values calculated, but for some reason most, but not all, of the SNPs for each scaffold are out of order, causing the firstpos and lastpos values to be incorrect. I'm using the bioconda install of version 0.940 and my VCF is made up of hard calls generated using GATK4. Here's the code I used to generate the theta output files:
# edit vcf header for ANGSD compatability
INFO='##INFO=<ID=INDEL,Number=0,Type=Flag,Description=\"\">'
bcftools view in.vcf.gz | awk -v infohdr=$INFO '/^#CHROM/ {print infohdr"\n"$0} !/^#CHROM/' | bcftools view -O z -o in_edit.vcf.gz
# estimate global SFS
angsd -vcf-pl in_edit.vcf.gz -doSaf 1 -anc Myoluc_HiC.fa -P 24 -out out
realSFS out.saf.idx -fold 1 -P 24 > out.sfs # calculate folded SFS due to unknown ancestral states
# calculate thetas
realSFS saf2theta out.saf.idx -sfs out.sfs -outname out
thetaStat do_stat out.thetas.idx -win 5000 -step 1000 # calculate using sliding window
This is what the out.thetas.idx scaffold information looks like:
Due to the SNPs being out of order, when I try to run thetaStat do_stat out.thetas.idx -win 5000 -step 1000 I get:
Information from index file:
0 SUPER__1 497250 8 60
1 SUPER__10 215800 356975365972 60
2 SUPER__11 200989 515562415793 60
3 SUPER__12 200814 660925188998 60
4 SUPER__13 175691 803695353587 60
5 SUPER__14 112942 932352390936 60
6 SUPER__15 173623 1013386837125 60
7 SUPER__16 165080 1139937497546 60
8 SUPER__17 117020 1260393693991 60
9 SUPER__18 164923 1345374816484 60
10 SUPER__19 112220 1466220096393 60
11 SUPER__2 465503 1546709593414 60
12 SUPER__20 79091 1882840797514 60
13 SUPER__21 23927 1940574741807 60
14 SUPER__22 50970 1958908293492 60
15 SUPER__3 459592 1997193424641 60
16 SUPER__5 234115 2327556531933 60
17 SUPER__6 281994 2496506902849 60
18 SUPER__7 247518 2698132841805 60
19 SUPER__8 243936 2876606637369 60
20 SUPER__9 246447 3052961043541 60
-r=(null) outnames=(null) step: 1000 win: 5000
pc.chr=SUPER__1 pc.nSites=497250 firstpos=126323 lastpos=241748174
pc.chr=SUPER__10 pc.nSites=215800 firstpos=42544 lastpos=83096058
pc.chr=SUPER__11 pc.nSites=200989 firstpos=83096059 lastpos=83370225
pc.chr=SUPER__12 pc.nSites=200814 firstpos=83370911 lastpos=81489261
end of dataset is before end of window: end of window:83376000 last position in chr:81489261
pc.chr=SUPER__13 pc.nSites=175691 firstpos=81489263 lastpos=69988656
end of dataset is before end of window: end of window:81495000 last position in chr:69988656
pc.chr=SUPER__14 pc.nSites=112942 firstpos=69988867 lastpos=58769242
end of dataset is before end of window: end of window:69994000 last position in chr:58769242
pc.chr=SUPER__15 pc.nSites=173623 firstpos=58769290 lastpos=59753403
pc.chr=SUPER__16 pc.nSites=165080 firstpos=59753675 lastpos=55501109
end of dataset is before end of window: end of window:59759000 last position in chr:55501109
pc.chr=SUPER__17 pc.nSites=117020 firstpos=55501197 lastpos=54538609
end of dataset is before end of window: end of window:55507000 last position in chr:54538609
pc.chr=SUPER__18 pc.nSites=164923 firstpos=54538618 lastpos=51625315
end of dataset is before end of window: end of window:54544000 last position in chr:51625315
pc.chr=SUPER__19 pc.nSites=112220 firstpos=51625766 lastpos=42335022
end of dataset is before end of window: end of window:51631000 last position in chr:42335022
pc.chr=SUPER__2 pc.nSites=465503 firstpos=241748331 lastpos=212276913
end of dataset is before end of window: end of window:241754000 last position in chr:212276913
pc.chr=SUPER__20 pc.nSites=79091 firstpos=42335077 lastpos=28301775
end of dataset is before end of window: end of window:42341000 last position in chr:28301775
pc.chr=SUPER__21 pc.nSites=23927 firstpos=28302174 lastpos=24677751
end of dataset is before end of window: end of window:28308000 last position in chr:24677751
pc.chr=SUPER__22 pc.nSites=50970 firstpos=24677758 lastpos=17701304
end of dataset is before end of window: end of window:24683000 last position in chr:17701304
pc.chr=SUPER__3 pc.nSites=459592 firstpos=212277246 lastpos=211539269
end of dataset is before end of window: end of window:212283000 last position in chr:211539269
pc.chr=SUPER__5 pc.nSites=234115 firstpos=211540747 lastpos=113635203
end of dataset is before end of window: end of window:211546000 last position in chr:113635203
pc.chr=SUPER__6 pc.nSites=281994 firstpos=113635310 lastpos=111356735
end of dataset is before end of window: end of window:113641000 last position in chr:111356735
pc.chr=SUPER__7 pc.nSites=247518 firstpos=111356987 lastpos=96758326
end of dataset is before end of window: end of window:111362000 last position in chr:96758326
pc.chr=SUPER__8 pc.nSites=243936 firstpos=96768413 lastpos=91674074
end of dataset is before end of window: end of window:96774000 last position in chr:91674074
pc.chr=SUPER__9 pc.nSites=246447 firstpos=92442959 lastpos=91687292
end of dataset is before end of window: end of window:92448000 last position in chr:91687292
Dumping file: "gatk.snp.super_filtered_autosomes_POST.thetas.idx.pestPG"
I'm not sure if this is something I'm doing incorrectly on my end or if it's a slight bug in generating the .thetas.idx file. Please let me know if you need any other information.
The text was updated successfully, but these errors were encountered:
I actually figured this out - the .saf.idx file was also unsorted because I multithreaded (-P 24). When I run on just one thread the files remain sorted. I'm wondering if it's possible to adjust how the realSFS utility multithreads so that files will remain sorted.
Hello, I'm trying to calculate theta values using sliding window means and there is an issue with the
.thetas.idx
file fromrealSFS saf2theta
. All SNPs have theta values calculated, but for some reason most, but not all, of the SNPs for each scaffold are out of order, causing thefirstpos
andlastpos
values to be incorrect. I'm using the bioconda install of version 0.940 and my VCF is made up of hard calls generated using GATK4. Here's the code I used to generate the theta output files:This is what the out.thetas.idx scaffold information looks like:
Due to the SNPs being out of order, when I try to run
thetaStat do_stat out.thetas.idx -win 5000 -step 1000
I get:I'm not sure if this is something I'm doing incorrectly on my end or if it's a slight bug in generating the .thetas.idx file. Please let me know if you need any other information.
The text was updated successfully, but these errors were encountered: