Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

realSFS saf2theta output (out.thetas.idx) positions out of order #598

Open
slcapel opened this issue Sep 27, 2023 · 1 comment
Open

realSFS saf2theta output (out.thetas.idx) positions out of order #598

slcapel opened this issue Sep 27, 2023 · 1 comment

Comments

@slcapel
Copy link

slcapel commented Sep 27, 2023

Hello, I'm trying to calculate theta values using sliding window means and there is an issue with the .thetas.idx file from realSFS saf2theta. All SNPs have theta values calculated, but for some reason most, but not all, of the SNPs for each scaffold are out of order, causing the firstpos and lastpos values to be incorrect. I'm using the bioconda install of version 0.940 and my VCF is made up of hard calls generated using GATK4. Here's the code I used to generate the theta output files:

# edit vcf header for ANGSD compatability
INFO='##INFO=<ID=INDEL,Number=0,Type=Flag,Description=\"\">'
bcftools view in.vcf.gz | awk -v infohdr=$INFO '/^#CHROM/ {print infohdr"\n"$0} !/^#CHROM/' | bcftools view -O z -o in_edit.vcf.gz

# estimate global SFS 
angsd -vcf-pl in_edit.vcf.gz -doSaf 1 -anc Myoluc_HiC.fa -P 24 -out out
realSFS out.saf.idx -fold 1 -P 24 > out.sfs  # calculate folded SFS due to unknown ancestral states

# calculate thetas
realSFS saf2theta out.saf.idx -sfs out.sfs -outname out
thetaStat do_stat out.thetas.idx -win 5000 -step 1000  # calculate using sliding window

This is what the out.thetas.idx scaffold information looks like:

                Information from index file:
                0       SUPER__1        497250  8       60
                1       SUPER__10       215800  356975365972    60
                2       SUPER__11       200989  515562415793    60
                3       SUPER__12       200814  660925188998    60
                4       SUPER__13       175691  803695353587    60
                5       SUPER__14       112942  932352390936    60
                6       SUPER__15       173623  1013386837125   60
                7       SUPER__16       165080  1139937497546   60
                8       SUPER__17       117020  1260393693991   60
                9       SUPER__18       164923  1345374816484   60
                10      SUPER__19       112220  1466220096393   60
                11      SUPER__2        465503  1546709593414   60
                12      SUPER__20       79091   1882840797514   60
                13      SUPER__21       23927   1940574741807   60
                14      SUPER__22       50970   1958908293492   60
                15      SUPER__3        459592  1997193424641   60
                16      SUPER__5        234115  2327556531933   60
                17      SUPER__6        281994  2496506902849   60
                18      SUPER__7        247518  2698132841805   60
                19      SUPER__8        243936  2876606637369   60
                20      SUPER__9        246447  3052961043541   60
pc.chr=SUPER__1 pc.nSites=497250 pc.nChr=60 firstpos=126323 lastpos=241748174
pc.chr=SUPER__10 pc.nSites=215800 pc.nChr=60 firstpos=42544 lastpos=83096058
pc.chr=SUPER__11 pc.nSites=200989 pc.nChr=60 firstpos=83096059 lastpos=83370225
pc.chr=SUPER__12 pc.nSites=200814 pc.nChr=60 firstpos=83370911 lastpos=81489261
pc.chr=SUPER__13 pc.nSites=175691 pc.nChr=60 firstpos=81489263 lastpos=69988656
pc.chr=SUPER__14 pc.nSites=112942 pc.nChr=60 firstpos=69988867 lastpos=58769242
pc.chr=SUPER__15 pc.nSites=173623 pc.nChr=60 firstpos=58769290 lastpos=59753403
pc.chr=SUPER__16 pc.nSites=165080 pc.nChr=60 firstpos=59753675 lastpos=55501109
pc.chr=SUPER__17 pc.nSites=117020 pc.nChr=60 firstpos=55501197 lastpos=54538609
pc.chr=SUPER__18 pc.nSites=164923 pc.nChr=60 firstpos=54538618 lastpos=51625315
pc.chr=SUPER__19 pc.nSites=112220 pc.nChr=60 firstpos=51625766 lastpos=42335022
pc.chr=SUPER__2 pc.nSites=465503 pc.nChr=60 firstpos=241748331 lastpos=212276913
pc.chr=SUPER__20 pc.nSites=79091 pc.nChr=60 firstpos=42335077 lastpos=28301775
pc.chr=SUPER__21 pc.nSites=23927 pc.nChr=60 firstpos=28302174 lastpos=24677751
pc.chr=SUPER__22 pc.nSites=50970 pc.nChr=60 firstpos=24677758 lastpos=17701304
pc.chr=SUPER__3 pc.nSites=459592 pc.nChr=60 firstpos=212277246 lastpos=211539269
pc.chr=SUPER__5 pc.nSites=234115 pc.nChr=60 firstpos=211540747 lastpos=113635203
pc.chr=SUPER__6 pc.nSites=281994 pc.nChr=60 firstpos=113635310 lastpos=111356735
pc.chr=SUPER__7 pc.nSites=247518 pc.nChr=60 firstpos=111356987 lastpos=96758326
pc.chr=SUPER__8 pc.nSites=243936 pc.nChr=60 firstpos=96768413 lastpos=91674074
pc.chr=SUPER__9 pc.nSites=246447 pc.nChr=60 firstpos=92442959 lastpos=91687292

Due to the SNPs being out of order, when I try to run thetaStat do_stat out.thetas.idx -win 5000 -step 1000 I get:

                Information from index file:
                0       SUPER__1        497250  8       60
                1       SUPER__10       215800  356975365972    60
                2       SUPER__11       200989  515562415793    60
                3       SUPER__12       200814  660925188998    60
                4       SUPER__13       175691  803695353587    60
                5       SUPER__14       112942  932352390936    60
                6       SUPER__15       173623  1013386837125   60
                7       SUPER__16       165080  1139937497546   60
                8       SUPER__17       117020  1260393693991   60
                9       SUPER__18       164923  1345374816484   60
                10      SUPER__19       112220  1466220096393   60
                11      SUPER__2        465503  1546709593414   60
                12      SUPER__20       79091   1882840797514   60
                13      SUPER__21       23927   1940574741807   60
                14      SUPER__22       50970   1958908293492   60
                15      SUPER__3        459592  1997193424641   60
                16      SUPER__5        234115  2327556531933   60
                17      SUPER__6        281994  2496506902849   60
                18      SUPER__7        247518  2698132841805   60
                19      SUPER__8        243936  2876606637369   60
                20      SUPER__9        246447  3052961043541   60
         -r=(null) outnames=(null) step: 1000 win: 5000
        pc.chr=SUPER__1 pc.nSites=497250 firstpos=126323 lastpos=241748174
        pc.chr=SUPER__10 pc.nSites=215800 firstpos=42544 lastpos=83096058
        pc.chr=SUPER__11 pc.nSites=200989 firstpos=83096059 lastpos=83370225
        pc.chr=SUPER__12 pc.nSites=200814 firstpos=83370911 lastpos=81489261
end of dataset is before end of window: end of window:83376000 last position in chr:81489261
        pc.chr=SUPER__13 pc.nSites=175691 firstpos=81489263 lastpos=69988656
end of dataset is before end of window: end of window:81495000 last position in chr:69988656
        pc.chr=SUPER__14 pc.nSites=112942 firstpos=69988867 lastpos=58769242
end of dataset is before end of window: end of window:69994000 last position in chr:58769242
        pc.chr=SUPER__15 pc.nSites=173623 firstpos=58769290 lastpos=59753403
        pc.chr=SUPER__16 pc.nSites=165080 firstpos=59753675 lastpos=55501109
end of dataset is before end of window: end of window:59759000 last position in chr:55501109
        pc.chr=SUPER__17 pc.nSites=117020 firstpos=55501197 lastpos=54538609
end of dataset is before end of window: end of window:55507000 last position in chr:54538609
        pc.chr=SUPER__18 pc.nSites=164923 firstpos=54538618 lastpos=51625315
end of dataset is before end of window: end of window:54544000 last position in chr:51625315
        pc.chr=SUPER__19 pc.nSites=112220 firstpos=51625766 lastpos=42335022
end of dataset is before end of window: end of window:51631000 last position in chr:42335022
        pc.chr=SUPER__2 pc.nSites=465503 firstpos=241748331 lastpos=212276913
end of dataset is before end of window: end of window:241754000 last position in chr:212276913
        pc.chr=SUPER__20 pc.nSites=79091 firstpos=42335077 lastpos=28301775
end of dataset is before end of window: end of window:42341000 last position in chr:28301775
        pc.chr=SUPER__21 pc.nSites=23927 firstpos=28302174 lastpos=24677751
end of dataset is before end of window: end of window:28308000 last position in chr:24677751
        pc.chr=SUPER__22 pc.nSites=50970 firstpos=24677758 lastpos=17701304
end of dataset is before end of window: end of window:24683000 last position in chr:17701304
        pc.chr=SUPER__3 pc.nSites=459592 firstpos=212277246 lastpos=211539269
end of dataset is before end of window: end of window:212283000 last position in chr:211539269
        pc.chr=SUPER__5 pc.nSites=234115 firstpos=211540747 lastpos=113635203
end of dataset is before end of window: end of window:211546000 last position in chr:113635203
        pc.chr=SUPER__6 pc.nSites=281994 firstpos=113635310 lastpos=111356735
end of dataset is before end of window: end of window:113641000 last position in chr:111356735
        pc.chr=SUPER__7 pc.nSites=247518 firstpos=111356987 lastpos=96758326
end of dataset is before end of window: end of window:111362000 last position in chr:96758326
        pc.chr=SUPER__8 pc.nSites=243936 firstpos=96768413 lastpos=91674074
end of dataset is before end of window: end of window:96774000 last position in chr:91674074
        pc.chr=SUPER__9 pc.nSites=246447 firstpos=92442959 lastpos=91687292
end of dataset is before end of window: end of window:92448000 last position in chr:91687292
        Dumping file: "gatk.snp.super_filtered_autosomes_POST.thetas.idx.pestPG"

I'm not sure if this is something I'm doing incorrectly on my end or if it's a slight bug in generating the .thetas.idx file. Please let me know if you need any other information.

@slcapel
Copy link
Author

slcapel commented Sep 28, 2023

I actually figured this out - the .saf.idx file was also unsorted because I multithreaded (-P 24). When I run on just one thread the files remain sorted. I'm wondering if it's possible to adjust how the realSFS utility multithreads so that files will remain sorted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant