Skip to content

Scripts and notes related to the manuscript: Stuart KC et al. 2022. Historical museum samples enable the examination of divergent and parallel evolution during invasion. Molecular Ecology, 31(1): 1836-1852, doi.org/10.1111/mec.16353

Notifications You must be signed in to change notification settings

katarinastuart/Sv4_HistoricalStarlings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Sv4_HistoricalStarlings

Scripts and notes related to manuscript:

Stuart KC, Sherwin WB, Austin JJ, Bateson M, Eens M, Brandley MC, Rollins LA 2022. Historical museum samples enable the examination of divergent and parallel evolution during invasion. Molecular Ecology, 31(1): 1836-1852, doi.org/10.1111/mec.16353

This includes:

2021_11_20_notebook_10049: PopGen analysis

2021_11_20_notebook_10040: Outlier analysis

This page contains small vignette's on the following

  • DArTseq SNP variant calling
  • SNP variant calling for raw RRS (DArTseq) data (with reference)

    The following is a summary of some SNP calling pipelines that I have used to process raw DArTseq data, but they may be applied to other types of reduced representation sequencing (RRS) raw data (alongside a reference genome).

    Variant calling falls into a few primary steps:

  • Mapping raw reads to a reference genome (or for refernce free variant calling forming the pseudogenome and calling references to this)
  • Calling the variant SNPs for each sample based on the mapped reads
  • Filtering the called SNPs, however I won't cover this here as it is quite project specific.
  • For the first two steps of the process, you can mix and match mapping and variant calling programs, though some are best paired. The below code covers:

    • Mapping with BWA mem and aln, then variant calling with Stacks Gstacks.
    • Mapping with bowtie, then variant calling with GATK.

    Ulitmately I used BWA aln as my mapping software of choice, as this is recommended for Illumina single-end reads shorter than ~70bp, which is good for DArT-seq.

    Software

    Below is a list of the modules used (and versions).

    module load stacks/2.2
    module load fastqc/0.11.8
    module load bwa/0.7.17
    module load bowtie/2.3.5.1
    module load java/8u121
    module load samtools/1.10
    module load picard/2.18.26
    module load gatk/4.1.0.0
    

    Data cleaning: Stacks

    RAW_DIR=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/rawdata/batch2_split
    OUTPUT_DIR=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/processing/samples_batch
    BARCODE_DIR=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/processing/barcodes
    
    process_radtags -p ${RAW_DIR}/ -o ${OUTPUT_DIR}/ -b ${BARCODE_DIR}/bardcode_named.txt -r -c -q --renz_1 pstI --renz_2 sphI
    

    Run QC:

    fastqc *.fq.gz --outdir=fastqc/
    

    Count reads:

    zcat *.fq.gz | echo $((`wc -l`/4))
    

    Mapping: BWA aln and mem

    Place your genome somewhere sensible and index it for BWA. Note that I've put my index files into a folder called bwa, and have given them a prefix (-p). '&> bwa/bwa_index.oe' allows me to keep a record of the output/error messages.

    genome_fa=Sturnus_vulgaris_2.3.1.simp.fasta
    bwa index -p bwa/stuv $genome_fa &> bwa/bwa_index.oe
    bwa_db=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/genome/bwa/stuv
    

    Aligning with bwa mem

    Here I have my sample names as a list within the file 'sample_names.txt'. We are working with single-end reads, so provide just one fastq file. If this was paired-end reduced representation or whole genome sequencing, we could provide a second R2 fastq file.
    for i in cat sample_names.txt;
    do
    sample=$i
    echo WORKING WITH ${sample}
    fq_file=../../samples_batch/merged/${sample}.fq.gz
    bam_file=${sample}.bam
    bwa_db=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/genome/bwa/stuv
    bwa mem -t 16 -M $bwa_db $fq_file | samtools view -bS | samtools sort > $bam_file
    done
    

    Check reads mapped successfully:

    samtools flagstat SAMPLENAME.bam #sub out SAMPLENAME with a few sample names
    

    Aligning with bwa aln

    Aligning with bwa aln is very similar. Note that in the above example I pipe the sam file straight into a sorted bam file, where as here I don't. It doesn't make too much of a difference when you are working on small data files, but when working on whole genome data moving the data into bam form straight away saves a lot of space.

    for sample in cat sample_names.txt;
    do
    echo WORKING WITH ${sample}
    bwa aln -t 16 -B 5 $bwa_db ../../samples_batch/merged/${sample}.fq.gz > ${sample}.sai
    bwa samse $bwa_db ${sample}.sai ../../samples_batch/merged/${sample}.fq.gz > ${sample}.sam
    samtools view -bS ${sample}.sam | samtools sort -o ${sample}.bam
    done
    

    Check reads mapped successfully

    samtools flagstat ABGYY.bam
    samtools flagstat ATB1.bam
    

    ATB18: 2935527/4370005=67.17% mapped
    NOR17: 1787288/2733999=65.37% mapped

    As you can see, my historical samples had much lower mapping rates:

    samtools flagstat HIST11.bam
    samtools flagstat HIST12.bam
    samtools flagstat HIST8.bam
    

    HIST11: 146792/2349968=6.25% mapped
    HIST12:275346/3076038=8.95% mapped
    HIST8: 156867/2843194=5.52%% mapped

    Variant Calling: gstacks

    For BWA mem and aln, the output is a BAM file. Next, I called variants using gstacks. This is not a necessity; once you have the BAM files you can use different variant calling methods.

    gstacks -I ./ -M ../historic_populations.txt -O ./
    

    Now the variant calling is done, we can produce a VCF file using Stacks Populations

    populations -P ./ -M ../historic_populations.txt --vcf
    

    Populations is nice because it gives you some summary info at the end of the variant calling process:

    Removed 0 loci that did not pass sample/population constraints from 412079 loci.
    Kept 412079 loci, composed of 28225204 sites; 20 of those sites were filtered, 243589 variant sites remained.

    27345713 genomic sites, of which 846899 were covered by multiple loci (3.1%).
    Mean genotyped sites per locus: 68.47bp (stderr 0.00).

    Population summary statistics (more detail in populations.sumstats_summary.tsv): aw: 11.272 samples per locus; pi: 0.17294; all/variant/polymorphic sites: 15515519/232371/142127; private alleles: 24046 hist: 4.8985 samples per locus; pi: 0.12483; all/variant/polymorphic sites: 18008520/91625/34830; private alleles: 13284 mv: 11.015 samples per locus; pi: 0.14832; all/variant/polymorphic sites: 19889591/226170/98305; private alleles: 4740 mw: 11.146 samples per locus; pi: 0.16776; all/variant/polymorphic sites: 15527851/232281/128802; private alleles: 13279 nc: 11.442 samples per locus; pi: 0.17365; all/variant/polymorphic sites: 15591084/235304/135530; private alleles: 16176 or: 11.45 samples per locus; pi: 0.16313; all/variant/polymorphic sites: 14915003/232899/117912; private alleles: 7942

    You can even do some preliminary filtering on Stacks:

    populations -P ./ -M ../historic_populations.txt --vcf -r 0.50 --min_maf 0.01 --write_random_snp -t 8
    

    Though I prefer the filtering options on VCFtools personally, as there are more options. The flag 'write_random_snp' or 'write-single-snp' is useful to run here through as it keep only one SNP per locus.

    Alternate mapping and variant calling process: Bowtie + GATK

    You will find that this process has very similar steps to the above BWA and Stacks processes. The overall workflow of SNP variant calling is usually quite similar.

    Create a Bowtie genome database:

    bowtie2-build -f Sturnus_vulgaris_2.3.1.simp.fasta Sturnus_vulgaris_2.3.1.simp
    

    You will also need to create a sequence dictionary for GATK.

    java -Xmx10g -jar /apps/picard/2.18.26/picard.jar CreateSequenceDictionary R=Sturnus_vulgaris_2.3.1.simp.fasta O=Sturnus_vulgaris_2.3.1.simp.dict
    samtools faidx Sturnus_vulgaris_2.3.1.simp.fasta
    

    The below covers a few steps: bowtie2 mapping; samtools processed a SAM into a sorted BAM file; picard marks duplicates; picard index; GATK haplotype caller. Each of these steps is run independently for each sample.

    
    for i in cat sample_names.txt;
    do
    echo working with $i;
    GENOME=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/genome/Sturnus_vulgaris_2.3.1.simp
    READS=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/processing/samples_batch/merged
    bowtie2 -p 10 --phred33 --very-sensitive-local -x ${GENOME} -I 149 -X 900 --rg-id "$i" --rg SM:"$i" -U ${READS}/"$i".fq.gz -S "$i".sam
    samtools view -bS ${i}.sam | samtools sort -o ${i}.bam
    java -Xmx128g -jar /apps/picard/2.18.26/picard.jar MarkDuplicates INPUT="$i".bam OUTPUT="$i"_mark.bam METRICS_FILE="$i"_sort.metrics.txt MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
    java -Xmx48g -jar /apps/picard/2.18.26/picard.jar BuildBamIndex I="$i"_mark.bam
    gatk --java-options "-Xmx120G" HaplotypeCaller -R ${GENOME}.fasta -I ${i}_mark.bam -O ${i}.g.vcf -ERC GVCF --native-pair-hmm-threads 16;
    done
    

    Finally, we combine the individually called haplotypes into one VCF file.

    GENOME=/srv/scratch/z5188231/KStuart.Starling-Aug18/Sv4_Historic/genome/Sturnus_vulgaris_2.3.1.simp.fasta
    gatk --java-options "-Xmx120G" CombineGVCFs \
    --reference $GENOME \
    --variant SAMPLE_A.g.vcf \
    ...
    --variant SAMPLE_Z.g.vcf \
    --output historical_GATK_variantscombined.vcf
    

    And finally perform joint genotyping on the samples

    gatk --java-options "-Xmx120g" GenotypeGVCFs \
    -R $GENOME \
    -V historical_GATK_variantscombined.vcf \
    -O historical_GATK_variantsgenotyped.vcf
    

    GATK doesn't produce any summaries like populations, however we can make a quick count of the number of SNPs using the below:

    grep -v "#" historical_BWAmem_GATK_variantsgenotyped.vcf | wc -l
    

    Closing remarks

    Below is a summary of the number of variants I obtained at the end of the above pipeline, both without filtering and with some filtering. Note that the low number of SNPs called by the bowtie-GATK combo was not a result of mapping (bowtie produced high read mapping percentages). This leaves me to believe that GATK didn't do very well with the DArT-seq data, which is a bit strange as I have seen it used quite successfully on many reduced representation data sets. Most of my new projects work with WGS so I have not investigated this further, but will leave this result here in case another finds similar results in their own work and needs perepctive.

    ScreenShot

    About

    Scripts and notes related to the manuscript: Stuart KC et al. 2022. Historical museum samples enable the examination of divergent and parallel evolution during invasion. Molecular Ecology, 31(1): 1836-1852, doi.org/10.1111/mec.16353

    Topics

    Resources

    Stars

    Watchers

    Forks

    Releases

    No releases published

    Packages

    No packages published