Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting allele frequencies for different populations cutting a beagle file or using a -sites file? #610

Open
cleivama opened this issue Jan 4, 2024 · 0 comments

Comments

@cleivama
Copy link

cleivama commented Jan 4, 2024

Hello,

I am trying to get allele frequencies for different populations, in order to run association analises with environmental variables, such as an RDA. I've tried two different ways, which I describe below, but both failed: 1- splitting a beagle file, 2- using the -sites option and bam files as input.

1- Splitting a beagle file:

  • First, I ran angsd with all the samples from all populations, to obtain the genotype likelihoods in beagle format for all individuals and a common list of SNPs:

$ANGSD -b $BAMLIST -ref $REF -out $POPDIR/Tmax_$NAMEPOP'_allSNPs' -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -minMapQ 20 -minQ 20 -minInd 137 -setMinDepth 172 -setMaxDepth 722 -doCounts 1 -GL 2 -doGlf 2 -doPost 1 -doMajorMinor 1 -doMaf 1 -doBcf 1 -doGeno 2 -SNP_pval 1e-6 -minMaf 0.02

  • Then I split the beagle.gz file in different populations. For this I have to take into account that every 3 columns of the beagle file is one individual, and that I want the first 3 columns in all populations, which are the markers, allele1 and allele2. I only include 3 examples here, the first three populations:

gunzip Tmax_ALL_Tmax_allSNPs.beagle.gz #I gunzipped it but I could have used zcat in the following lines
cut -f1-42 Tmax_ALL_Tmax_allSNPs.beagle | gzip > ./RS/Tmax_ALL_Tmax_RS_allSNPs.beagle.gz #for the first population
cut -f1-3,43-60 Tmax_ALL_Tmax_allSNPs.beagle | gzip > ./EUR/Tmax_ALL_Tmax_EUR_allSNPs.beagle.gz #for the second population
cut -f1-3,61-105 Tmax_ALL_Tmax_allSNPs.beagle | gzip > ./ROD/Tmax_ALL_Tmax_ROD_allSNPs.beagle.gz #for the third population

  • Now, from here, I would like to use angsd again to transform each population beagle.gz file to allele frequencies. The problem is that it only works for the first population using the following command:

$ANGSD -beagle Tmax_ALL_Tmax_RS_allSNPs.beagle.gz -doMaf 4 -out Tmax_af_RS -fai $FAIFILE

  • For the other populations, I get a "Segmentation fault" error, when I try to run the script above with the other beagle.gz files as input:

     -> angsd version: 0.938 (htslib: 1.15.1) build(Aug 18 2022 07:14:43)
     -> /media/SCRATCH/martinezc/SOFTWARE/ANGSD/angsd -beagle Tmax_ALL_Tmax_EUR_allSNPs.beagle.gz -doMaf 4 -out Tmax_af_EUR -fai /media/SCRATCH/martinezc/Tridacna/WGS/Tmaxima/assemblies/Tridacna_maxima_HiRise_sorted.fasta.fai 
     -> Inputtype is beagle
    

Segmentation fault

  • I thought it could be due to the individuals not starting at Ind0, because they came from a bigger beagle file that was cut, so I used sed commands to change the names of the individuals to start from Ind0, but I get the same error. For example, something like this:

zcat Tmax_ALL_Tmax_ROD_allSNPs.beagle.gz | sed 's/Ind19/Ind0/g' | sed 's/Ind20/Ind1/g' | sed 's/Ind21/Ind2/g' | sed 's/Ind22/Ind3/g' | sed 's/Ind23/Ind4/g' | sed 's/Ind24/Ind5/g' | sed 's/Ind25/Ind6/g' | sed 's/Ind26/Ind7/g' | sed 's/Ind27/Ind8/g' | sed 's/Ind28/Ind9/g' | sed 's/Ind29/Ind10/g' | sed 's/Ind30/Ind11/g' | sed 's/Ind31/Ind12/g' | sed 's/Ind32/Ind13/g' | sed 's/Ind33/Ind14/g' | gzip > Tmax_ALL_Tmax_ROD_allSNPs_fixed.beagle.gz

$ANGSD -beagle Tmax_ALL_Tmax_ROD_allSNPs_fixed.beagle.gz -doMaf 4 -out Tmax_af_ROD_fixed -fai $FAIFILE

    -> Inputtype is beagle

Segmentation fault

  • I also tried to cut in two other different ways, in case it was due to the cutting script. I used the -complement option to remove the columns that I don't want for each population, and also cutting the beagle file using the awk instead of cut, but I get the same Segmentation fault error.

QUESTION: I think it's weird that it only works for the first population (RS) but not for the rest of beagle files, right? Am I doing anything wrong? Am I not supposed to cut a beagle file like I'm doing? I can't see how these files are different, I had a look at them using zcat $FILE | less and all look exactly the same as the one from the first populations (RS) that works fine.

2- Then the second approach that I've tried is to obtain a sites file from the beagle file from the first run:

zcat Tmax_ALL_Tmax_allSNPs.mafs.gz | cut -f 1-4 > Tmax_allSNPs_Mm.pos

I'm cutting the first four columns so I also have the major and minor alleles, and then I removed the first row which is the header.

  • My idea here was to use this sites file, running angsd for each population again from the bam files, but after doing it, I get a different number of SNPs in each run, which I don't understand as I'm using a sites file and no other filters. I wanna have the same number of SNPs in the same order so I will be able to make a matrix combining the allele frequencies from the different populations. This is the loop I'm using:

for POP in cat $POPLIST; do

$ANGSD -b $BAMLSDIR/'Tmax_bamlist_'$POP'.txt' -ref $REF -fai $FAI -uniqueOnly 1 -remove_bads 1 -only_proper_pairs 1 -trim 0 -C 50 -minMapQ 20 -minQ 20 -doMajorMinor 3 -sites $SITES -doMaf 1 -out $OUTDIR/$POP/'Tmax_af_'$POP -nThreads 10 -GL 2 -doGlf 2 -doCounts 1

done

Then I count the number of lines in the resulting mafs.gz files with wc -l, and get a different number, which is always lower than the number of lines in the sites file, so there is a filter set there that is filtering some sites that are not being written in the output files.

QUESTION: How is it possible that I don't get the same number of SNPs using the same sites file?
The only thing I could think about that could be filtering SNPs was the SNP p-value, so I also tried runnning the previous loop adding a -SNP_pval 1 option, so it wouldn't filter any SNP(?), but I get exactly the same number of SNPs. I checked the .arg files and all filters are disables (-1), and I also tried changing the skipMissing option from 1 to 0, but I got the same result.

Am I doing anything wrong? Is there an easy way to obtain the allele frequencies with the files that I already have?

I appreciate any advice and help on this issue.

Many thanks in advance!
Carlos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant