Skip to content

rtfcoimbra/Coimbra-et-al-2023_BMCBiol

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

Code from: Genomic analysis reveals limited hybridization among three giraffe species in Kenya

Code used to analyze whole-genome sequencing data of giraffe in Coimbra et al. (2023):

  • Coimbra RTF, Winter S, Muneza A, Fennessy S, Otiende M, Mijele D, Masiaine S, Stacy-Dawes J, Fennessy J, Janke A (2023) Genomic analysis reveals limited hybridization among three giraffe species in Kenya. BMC Biology, 21:215. DOI: https://doi.org/10.1186/s12915-023-01722-y

Note: The plotting scripts do not necessarily reproduce the figures exactly as shown in the paper. In some cases, I used a free image editing software, namely Krita, to assemble independent plots and add or correct plot annotations.

Workflow

  • workflow.txt: describes the steps and the context in which the scripts described below were used for processing and analyzing the whole-genome sequencing data of giraffe.

Read quality control

  • trim_reads_v2.sh and trim_reads_v2-sra.sh: trim paired-end reads with fastp.

Read mapping

  • map_reads_v2.sh: map reads against the Masai giraffe assembly with BWA and sort the output BAMs with samtools.
  • mark_duplicates.sh: mark PCR/optical duplicate reads with Picard MarkDuplicates.
  • mapping_flagstats.sh: calculate mapping statistics with samtools.
  • realigner_target_creator.sh: create a list of target intervals for indel realignment with GATK.
  • indel_realigner.sh: perform local realignment around indels with GATK.
  • clean_bams_v3.sh: remove bad reads (flags 4, 256, 512, or 1024) from BAM files and keep only properly paired reads (flag 2) mapped to non-repetitive regions in autosomes with samtools.

Site depth statistics

  • site_depth_sample.sh: calculate the mean site depth per sample with mosdepth.
  • site_depth_global.sh: calculate the global site depth for multiple giraffe BAMs with sambamba.
  • site_depth_stats_v2.py: calculate summary statistics from the global site depth distribution (i.e. 5th and 95th percentiles, median, and median absolute deviation) with Python (NumPy and SciPy).

SNP and genotype calling

  • snp_calling_v2.sh: estimate genotype likelihoods with ANGSD.
  • genotype_calling.sh: call genotypes with bcftools.

Linkage pruning

  • calc_pairwise_ld.sh: calculate pairwise linkage disequilibrium with ngsLD.
  • ld_decay_multi.sh: fit LD decay curves for multiple species with fit_LDdecay.R.
  • ld_pruning.sh: prune linked sites with prune_graph.pl and extract unlinked sites from ANGSD's genotype likelihoods file.
  • process_pruned_snps.sh: combine pruned SNPs of all species, split them into separate files by autosome, and index resulting files for ANGSD.

Relatedness

  • infer_relatedness.sh: infer pairwise relatedness among individuals based with NGSadmix and NGSremix.
  • plot_figureS1.R: plot the relatedness coefficient as a heatmap and find the number of individuals to exclude (and identify them) for increasing thresholds of relatedness. Note: this script was co-developed with Emma Vinson.

Population structure and admixture analyses

  • pcangsd_hwe.sh: estimate covariance matrix and perform a Hardy-Weinberg equilibrium test with PCAngsd.
  • ngsadmix.sh: estimate admixture proportions with NGSadmix.
  • evaladmix.sh: calculate a matrix of pairwise correlation of residuals between individuals using evalAdmix to test the goodness of fit of the data to an admixture model.

SNP-based phylogenomic inference

Assembly and phylogeny of mitochondrial genomes

  • See workflow.txt.

Inference of migration events and test for introgression

  • infer_admixture_graphs.sh: converts a VCF file into TreeMix input with PLINK and plink2treemix, estimates admixture graphs with either TreeMix or OrientAGraph allowing for a range of migration edges (m) with multiple bootstrap replicates, and finds the replicate with the highest likelihood per m value.
  • estimate_fbranch.sh: calculate Patterson’s D, f4-ratio, and f-branch statistics with Dsuite.

Contemporary migration rates

Demographic reconstruction

  • create_fasta.sh: create genome consensus sequence in FASTA format with sambamba and ANGSD.
  • estimate_sfs.sh: estimate the unfolded site frequency spectrum (SFS) with ANGSD and realSFS.

Figures

  • Figure 1: QGIS + plot_figure1b-e.R + visFuns_modified.R (script slightly modified from visFuns.R from evalAdmix)
  • Figure 2: plot_figure2.R
  • Figure 3: plot_figure3a.R + estimate_fbranch.sh + plot_figure3c.R
  • Figure 4: plot_figure4.R
  • Figure S1: plot_figureS1.R
  • Figure S2: plot_figureS2.R
  • Figure S3: plot_figureS3.R
  • Figure S4: plot_figureS4.R + visFuns_modified.R (script slightly modified from visFuns.R from evalAdmix)
  • Figure S5: plot_figureS5.R
  • Figure S6: plot_figureS6.R
  • Figure S7: ld_decay_multi.sh
  • Figure S8: plot_figureS8.R