Skip to content

ANGSD/NgsRelate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

30juni 2018. Added new version which does analysis from bcf/vcf and outputs all nine jacquards

This page refers to the new v2 of NgsRelate which coestimates relatedness and inbreeding. For the old version please use this link: http://www.popgen.dk/software/index.php?title=NgsRelate&oldid=694

Brief description

This page contains information about the program called NgsRelate, which can be used to infer relatedness, inbreeding coefficients and many other summary statistics for pairs of individuals from low coverage Next Generation Sequencing (NGS) data by using genotype likelihoods instead of called genotypes. To be able to infer the relatedness you will need to know the population allele frequencies and have genotype likelihoods. This can be obtained e.g. using the program ANGSD as shown in example 1 below. For more information about ANGSD see here: http://popgen.dk/angsd/index.php/Quick_Start. As of version 2, VCF/BCF files can also be parsed.

Options

$ ./ngsRelate 

Usage main analyses: ./ngsrelate  [options] 
Options:
   -f <filename>       Name of file with frequencies
   -O <filename>       Output filename
   -L <INT>            Number of genomic sites. Must be provided if -f (allele frequency file) is NOT provided 
   -m <INTEGER>        model 0=normalEM 1=acceleratedEM
   -i <UINTEGER>       Maximum number of EM iterations
   -t <FLOAT>          Tolerance for breaking EM
   -r <FLOAT>          Seed for rand
   -R <chr:from-to>    Region for analysis (only for bcf)
   -g gfile            Name of genotypellh file
   -p <INT>            threads (default 4)
   -c <INT>            Should call genotypes instead?
   -s <INT>            Should you swich the freq with 1-freq?
   -F <INT>            Estimate inbreeding instead of estimating the nine jacquard coefficients
   -o <INT>            estimating the 3 jacquard coefficient, assumming no inbreeding
   -v <INT>            Verbose. print like per iteration
   -e <FLOAT>            Errorrates when calling genotypes?
   -a <INT>            First individual used for analysis? (zero offset)
   -b <INT>            Second individual used for analysis? (zero offset)
   -B <INT>            Number of bootstrap replicates for (only for single pairs)
   -N <INT>            How many times to start each pair with random seed?
   -n <INT>            Number of samples in glf.gz
   -l <INT>            minMaf or 1-Maf filter
   -z <INT>            Name of file with IDs (optional)
   -T <STRING>         For -h vcf use PL (default) or GT tag
   -A <STRING>         For -h vcf use allele frequency TAG e.g. AFngsrelate (default)
   -P <filename>       plink name of the binary plink file (excluding the .bed)

Or
 ./ngsrelate extract_freq_bim pos.glf.gz plink.bim plink.freq
Or
 ./ngsrelate extract_freq .mafs.gz .pos.glf.gz [-rmTrans]
Or
 ./ngsrelate -h my.bcf [DEVELOPMENT ONLY]

How to download and install

On a linux or mac system with curl and g++ installed NgsRelate can be downloaded and installed as follows:

git clone --recursive https://github.com/SAMtools/htslib
git clone https://github.com/ANGSD/ngsRelate
cd htslib/;make -j2;cd ../ngsRelate;make HTSSRC=../htslib/

Run examples

Run example 1: using only NGS data

Below is an example of how NgsRelate can be used to coestimate relatedness and inbreeding from NGS data. Assume we have file (filelist) containing paths to 100 BAM/CRAM files; one line per BAM/CRAM file. Then we can use ANGSD to estimate allele frequencies and calculate genotype likelihoods while doing SNP calling and in the end produce the the two input files needed for the NgsRelate program as follows:

### First we generate a file with allele frequencies (angsdput.mafs.gz) and a file with genotype likelihoods (angsdput.glf.gz).
./angsd -b filelist -gl 2 -domajorminor 1 -snp_pval 1e-6 -domaf 1 -minmaf 0.05 -doGlf 3

### Then we extract the frequency column from the allele frequency file and remove the header (to make it in the format NgsRelate needs)
zcat angsdput.mafs.gz | cut -f5 |sed 1d >freq

### run NgsRelate
./ngsrelate  -g angsdput.glf.gz -n 100 -f freq  -O newres

The output should be a file called newres that contains the output for all pairs between six individuals.

Run example 2: using only VCF/BCF files

As of version 2, NgsRelate can parse BCF/VCF files using htslib with the following command:

./ngsrelate  -h my.VCF.gz -O vcf.res

By default, NgsRelate will estimate the allele frequencies using the individuals provided in the VCF files. Allele frequencies from the INFO field can used be used instead using -A TAG. The TAG usually take the form of AF or AF1 but can be set to anything. By default the PL data (Phred-scaled likelihoods for genotypes) is parsed, however, the called genotypes can also be used instead with the -T GT option. If called genotypes are being used, the software requires an additional argument (-c 1). If using -c 2, ngsRelate calls genotypes assuming hardy-weinberg.

Input file format

Genotype likelihood input

NgsRelate takes two files as input: a file with genotype likelihoods (-g) and a file with population allele frequencies (-f) for the sites there are genotype likelihoods for. The genotype likelihood file needs to contain a line for each site with 3 values for each individual (one log transformed genotype likelihood for each of the 3 possible genotypes encoded as 'double's) and it needs to be in binary format and gz compressed. The frequency file needs to contain a line per site with the allele frequency of the site in it.

NgsRelate also calculates a few summary statistics based on the 2dsfs instead of known population allele frequencies. Even if the population allele frequencies are not available, these summary statistics can still be estimated with NgsRelate by providing it with the number of sites (-L) instead of the allele frequency file (-f).

VCF input

NgsRelate takes a standard VCF file generated with e.g. bcftools. By default, NgsRelate will estimate the allele frequencies using the individuals provided in the VCF files. As shown in example 2, external allele frequencies can be provided with -f. These frequencies will overwrite the ones estimated from the VCF file.

Output format

a  b  nSites  J9        J8        J7        J6        J5        J4        J3        J2        J1        rab       Fa        Fb        theta     inbred_relatedness_1_2  inbred_relatedness_2_1  fraternity  identity  zygosity  2of3IDB   FDiff      loglh           nIter  coverage  2dsfs                                                                             R0        R1        KING       2dsfs_loglike   2dsfsf_niter
0  1  99927   0.384487  0.360978  0.001416  0.178610  0.071681  0.000617  0.002172  0.000034  0.000005  0.237300  0.002828  0.250330  0.127884  0.001091                0.035846                0.001451    0.000005  0.001456  0.345411  -0.088997  -341223.335664  103    0.999270  0.154920,0.087526,0.038724,0.143087,0.155155,0.139345,0.038473,0.087632,0.155138  0.497548  0.290124  0.000991   -356967.175857  7

The first two columns contain indices of the two individuals used for the analysis. The third column is the number of genomic sites considered. The following nine columns are the maximum likelihood (ML) estimates of the nine jacquard coefficients, where K0==J9; K1==J8; K2==J7 in absence of inbreeding. Based on these Jacquard coefficients, NgsRelate calculates 11 summary statistics:

  1. rab is the pairwise relatedness (J1+J7+0.75*(J3+J5)+.5*J8) Hedrick et al
  2. Fa is the inbreeding coefficient of individual a J1+J2+J3+J4 Jacquard
  3. Fb is the inbreeding coefficient of individual b J1+J2+J5+J6 Jacquard
  4. theta is the coefficient of kinship J1 + 0.5*(J3+J5+J7) + 0.25*J8) Jacquard
  5. inbred_relatedness_1_2 J1+0.5*J3 Ackerman et al
  6. inbred_relatedness_2_1 J1+0.5*J5 Ackerman et al
  7. fraternity J2+J7 Ackerman et al
  8. identity J1 Ackerman et al
  9. zygosity J1+J2+J7 Ackerman et al
  10. Two-out-of-three IBD J1+J2+J3+J5+J7+0.5*(J4+J6+J8) Miklos csuros
  11. Inbreeding difference 0.5*(J4-J6) Miklos csuros
  12. the log-likelihood of the ML estimate.
  13. number of EM iterations. If a -1 is displayed. A boundary estimate had a higher likelihood.
  14. If differs from -1, a boundary estimate had a higher likelihood. Reported loglikelihood should be highly similar to the corresponding value reported in loglh
  15. fraction of sites used for the ML estimate

The remaining columns relate to statistics based on a 2D-SFS.

  1. 2dsfs estimates using the same methodology as implemented in realSFS, see ANGSD
  2. R0 Waples et al
  3. R1 Waples et al
  4. KING Waples et al
  5. the log-likelihood of the 2dsfs estimate.
  6. number of iterations for 2dsfs estimate

You can also input a file with the IDs of the individuals (on ID per line), using the -z option, in the same order as in the file filelist used to make the genotype likelihoods or the VCF file. If you do the output will also contain these IDs as column 3 and 4.

Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood).

Help and additional options

To get help and a list of all options simply type

./ngsrelate

Note that for the new version of ngsRelate it is not necessary to flip the allele frequencies if the input allele frequencies are for the minor allele.

running version 1 of NgsRelate

To run the old version (V1) that assumes both individuals to be out-bred, just add -o 1 to your command. This will force Jacqaurd coefficient 1-6 (the coefficients related to inbreeding) to zero.

./ngsrelate -o 1

Test for inbreeding only

To only estimate inbreeding, use the following command

./ngsrelate -F 1

Citing and references

Changelog

Contacts

thorfinn@binf.ku.dk, k.hanghoej@snm.ku.dk, and ida@binf.ku.dk

Testdataset

bcftools call -v -m test.bcf -Ob >small.bcf
bcftools view small.bcf -Oz -o small.vcf.gz -m2 -M2 -v snps
plink --vcf small.vcf.gz --make-bed --out small 
bgzip small.bed
md5sum small.* >small.md5

check glf file in R

a <- file("test.glf", "rb")
nobs <- 1e8 # large number 3*N*Nsites
N <- 6 # number of individuals
mat <- matrix(readBin(a, "double", nobs ), byrow=T, ncol=3*N)
con <- file("outfile.glf", "wb")
writeBin(c(t(mat)), con=con)
close(con)