Skip to content

Grelot/reserveBenefit--snpsdata_analysis

Repository files navigation

Codes for the paper : "Genomic resources for Mediterranean fishes"

https://www.singularity-hub.org/static/img/hosted-singularity--hub-%23e32929.svg


Pierre-Edouard Guerin, Stephanie Manel

Montpellier, 2017-2019

Submited to Molecular Ecology Ressources, 2019


Prerequisites

Softwares

Singularity container

See https://www.sylabs.io/docs/ for instructions to install Singularity.

Download the container

singularity pull --name snpsdata_analysis.simg shub://Grelot/reserveBenefit--snpsdata_analysis:snpsdata_analysis

Run the container

singularity run snpsdata_analysis.simg

Data files

We work on three species : mullus surmuletus, diplodus sargus and serranus cabrilla. Let's define the wildcard species as any of these three species.

  • genome assembly .fasta

  • SNPs data from radseq .vcf

Filtering SNPs

Only one randomly selected SNP was retained per locus, and a locus was retained only if present in at least 85% of individuals. Individuals with an excess coverage depth (>1,000,000x) or >30% missing data were filtered out. We kept loci with maximum observed heterozygosity=0.6.

Filtering steps (IBD paper)

  1. Remove loci with inbreeding coefficient Fis > 0.5 or < -0.5
  2. Keep all pairs of loci that are closer than 5000 bp
  3. Keep pairs of loci with linkage desequilibrum > 0.8
  4. Keep SNPs with a minimum minor allele frequency (MAF) of 1%
  5. Remove loci that deviated significantly (p-value <0.01) from expected Hardy-Weinberg genotyping frequencies under random mating

Filtering steps (genome paper)

  1. Keep all pairs of loci that are closer than 5000 bp
  2. Keep pairs of loci with linkage desequilibrum > 0.8
  3. Keep SNPs with a minimum minor allele frequency (MAF) of 1%

INPUTS:

OUTPUTS:

  • species.lmiss: number of missing individuals by locus table
  • species.imiss: number of missing loci by individual table
  • species.idepth: mean locus depth coverage by individual table
  • species.geno.ld: linkage desequilibrum _r² table
  • species.snps.fisloc_rm.vcf
  • species.fisloc_rm.ld_5000.log
  • species.fisloc_rm.ld_5000.recode.vcf
  • species.fisloc_rm.ld_5000.r2.recode.vcf
  • species.fisloc_rm.ld_5000.r2.maf001.recode.vcf
  • species.fisloc_rm.ld_5000.r2.maf001.hwe.recode.vcf: final filtered snps
  • speciesfiltering_count_snps_report.tsv: number of SNPs at each filtering step
cd filter_vcf
bash filter_vcf.sh

Description of SNPs onto genome

Generate tables

  1. Split the genome into genome-windows of 400 Kbp.
  2. Count number of SNPs located on each genome-windows.
  3. Count number of reads for each SNP for each individuals.

INPUTS:

  • species.fasta: genome fasta file of species
  • species.vcf: SNPs from radseq data of species
  • species.gff3: coordinates and related information of coding region annotation genome of species

OUTPUTS:

  • speciescoverage.bed: a table with row as genome-windows of 400000bp of the genome of species with genome-coordinates (scaffold, start position, end position) and coverage (number of SNPs)
  • speciesmeandepth.bed: a table with row as SNPs with genome-windows, coordinates (scaffold, start position, end position) and depth coverage (number of reads) for each SNP for each individuals
  • speciescoords.snps.bed: coordinates (scaffold, position) of SNPs onto genomes
  • speciescoding.snps.bed: snps located on coding region
bash snpsontothegenome/command.sh

Build the figure

Rscript snpsontothegenome/figure_cover_genome.R

Average distance between SNPs loci

INPUTS:

  • speciescoords.snps.bed : coordinates (scaffold, position) of SNPs onto genomes
Rscript snpsontothegenome/average_distance_loci.R

SNPs located/not in coding regions

Simply count number of lines of the file speciescoding.snps.bed (each line is a snp located on a coding region)

SNPs located/not in mitochondrial regions

Simply count SNPs annotated as "mitochondrial" by Augustus

species number of SNPs located in mitochondrial regions
diplodus 173
mullus 178
serran 226

Results

species mean median sd max min
diplodus 35388.9078430345 23751 34996.9143024498 459616 5000
mullus 30716.8684498214 20930 29189.8335674228 384550 5002
serran 28239.7585528699 19084 27013.2843728281 403508 733
  • summary_snps.csv: number of SNPs, average distance between consecutive loci (in bp) and number of SNPs located on a coding region for each species
species number_snps average_distance_bp number_coding_snps number_mitochondrial_snps
diplodus 20074 35389 11978 173
mullus 15710 30717 10304 178
serranus 21101 28240 13107 226

More detail about coding snps location (CDS, exon, intron) in this table: count_snps_annotation.csv