Skip to content

rtfcoimbra/Coimbra-et-al-2021_CurrBiol

Repository files navigation

DOI

Code from: Whole-genome analysis of giraffe supports four distinct species

Code used to analyze whole-genome sequencing data of giraffe in Coimbra et al. (2021):

  • Coimbra RTF, Winter S, Kumar V, Koepfli K-P, Gooley RM, Dobrynin P, Fennessy J, Janke A (2021) Whole-genome analysis of giraffe supports four distinct species. Current Biology, 31(13):P2929-2938.E5. https://doi.org/10.1016/j.cub.2021.04.033

Note: The scripts used to generate plots do not necessarily reproduce the figures exactly as shown in the paper. In many cases, I used a free image editing software, namely Krita, to assemble independent plots and add or correct plot annotations. I tried to use code as much as I could for reproducibility, but my skills are still not quite in the level of coding everything.

Kordofan giraffe genome assembly and annotation

  • denovo_assembly.sh: generate a de novo pseudohaploid genome assembly from a Chromium-prepared library using Supernova.
  • assembly_qc: assess genome assembly quality with QUAST, BUSCO, and JupiterPlot.
  • process_assembly.sh: index assembly, annotate repeats with RepeatMasker and RepeatModeler, mask FASTA, extract scaffolds >= 1 Mbp from masked FASTA, and create a BED file of non-repetitive regions in scaffolds >= 1 Mbp.
  • annotate_coding.sh: soft-mask assembly and annotate protein-coding genes with BRAKER using protein sequences from Bos taurus (GCA_002263795.2) as extrinsic evidence.

Quality control and read mapping

  • remove_10Xbarcodes.sh: remove 10X Chromium barcodes from sample ENP11 with proc10xG.
  • trim_reads.sh: trim paired-end reads with trimmomatic.
  • map_reads.sh: map reads against the Kordofan giraffe assembly with BWA and sort the output BAMs with samtools.
  • merge_bams.sh: merge same sample BAMs from different lanes with samtools.
  • edit_readID.sh: remove 10X Chromium barcode sequences from read ID in the BAM file of sample ENP11.
  • deduplicate_bams.sh: mark PCR/optical duplicate reads with Picard MarkDuplicates.
  • mapping_qc.sh: assess mapping quality with QualiMap.
  • realigner_target_creator.sh: create list of target intervals for indel realignment with GATK.
  • indel_realigner.sh: perform local realignment around indels with GATK.
  • clean_bams.sh: remove bad reads (flags 4, 256, 512, 1024 or 2048) from BAM files and keep only properly paired reads (flag 2) mapped to non-repetitive regions in scaffolds >= 1 Mbp with samtools.

SNP calling and linkage pruning

  • snp_calling.sh and site_depth_stats.py: calculate the global site depth for multiple giraffe BAMs with sambamba, compute summary statistics from the global site depth distribution (i.e. 5th and 95th percentiles, median, and median absolute deviation) with Python (NumPy and SciPy), and estimate genotype likelihoods with ANGSD.
  • ld_pruning.sh: calculate pairwise linkage disequilibrium with ngsLD, plot LD decay curve with fit_LDdecay.R, prune linked sites with prune_graph.pl, and extract unlinked sites from ANGSD's genotype likelihoods file.

Population structure and admixture analyses

  • pcangsd_hwe.sh: estimate covariance matrix and perform a Hardy-Weinberg equilibrium test with PCAngsd.
  • plot_3dPCA.R: perform PCA and make a 3-dimensional plot with a custom R script.
  • ngsadmix.sh: estimate admixture proportions with NGSadmix.
  • plot_admixture.R: plot admixture proportions with a custom R script.
  • plot_likelihoods_deltaK.R: plot run likelihoods for each K, generate input file for the delta K analysis (Evanno et al. 2005) implemented in the CLUMPAK webserver, and plot delta K values (after running CLUMPAK separately).

Genetic distances and neighbor-joining tree

  • snp_calling_gendist.sh and site_depth_stats.py: similar to snp_calling.sh but including the okapi BAM as an outgroup.
  • gendist_tree.sh and make_pops_label.py: calculate bootstrapped genetic distance matrices with ngsDist and generate a BioNJ tree with FastME.
  • plot_bionj_tree.R: plot BioNJ tree with ggtree.

Nuclear phylogenomic inference

The scripts for phylogenomic analyses with genome fragments presented here were based on and modified from Fritjof Lammers' original pipeline used in Árnason et al. (2018).

  • generate_fasta.sh and site_depth_stats.py: calculate per-base depth from cleaned BAMs using sambamba and generate consensus sequences in FASTA format with ANGSD.
  • clean_fasta.sh and concatenate_fasta.py: remove repetitive regions from the consensus sequences with bedtools and a custom Python script and calculate the percentage of ambiguous sites in each sequence.
  • gf_phylogenomics.txt: workflow for the phylogenomic analyses with genome fragments (GFs) - i.e. generate GFs with a custom Python script, perform approximately unbiased (AU) tree topology test with IQ-TREE, infer multispecies coalescent tree with ASTRAL, analyze quartet frequencies with DiscoVista.
  • genome_fragments.py: generate GF alignments from consensus genome sequences.
  • check_gfs.py: check the percentage of ambiguous sites (Ns) per sequence in each GF alignment and save a list of "good" and "bad" GFs (check code comments for description). Only "good" GFs were used for analyses.
  • iqtree_parallel.sh and iqtree_au_test.sh: perform AU tree topology test with IQ-TREE and collect the inferred trees and p-AU values.
  • parse_iqlog.py: extract pAU values from IQ-TREE's log file.
  • plot_pAU.R: plot the pAU values of each topology for each GF size (with confidence intervals) and plot the alternative topologies analyzed in the AU test.

Assembly and phylogeny of mitochondrial genomes

  • mitogenomes_workflow.txt: subsample raw reads with seqtk, assemble and annotate mitochondrial genomes with MitoZ, extract protein coding sequences, and construct maximum likelihood phylogeny based on protein coding sequences with IQ-TREE. Note: some steps must be performed manually and/or with the use of independent GUI programs.
  • plot_mtdna_tree.R: plot maximum likelihood mitochondrial tree with ggtree.

Demographic reconstruction

  • psmc_bootstrap.sh and site_depth_stats.py: calculate per-base depth from cleaned BAMs using sambamba, generate consensus genome sequences in FASTQ format with bcftools, and infer giraffe effective population size changes through time with the PSMC model.

Heterozygosity

  • generate_foldedSFS.sh: generate the per sample folded site frequency spectrum (SFS) with ANGSD.
  • plot_heterozygosity.R: calculate and plot global heterozygosity per sample in R with tidyverse.

ROH and Inbreeding

  • genotype_calling.sh: call genotypes with bcftools using the SNPs called with ANGSD and convert BCF to Oxford GEN format.
  • estimate_roh.R: identify runs of homozygosity (ROH) and estimate realized inbreeding coefficients (FROH) with RZooRoH.
  • plot_froh.R: plot FROH and the number vs. the accumulated length of ROH with tidyverse and patchwork.