Skip to content

denniswiersma/SGBS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SGBS

Snake Genotyping By Sequencing. A reimplementation of Fast-GBS using snakemake.

Usage

First make a conda environment using the environment.yml file:
conda env create -f environment.yml

Make sure you have:

  • Your paired ended data as {sample_name}_1.fq.gz and {sample_name}_2.fq.gz.
  • A reference genome.
  • An adapter fasta file.
  • A barcodes fasta file.

Then run the pipeline using the following command:

snakemake -c10 --use-conda

Divergence from Fast-GBS

Gathering test data

The original paper, which can be found in this repository at /docs/Fast-GBSPaper.pdf, says the following about the data used to test the pipeline:
"To test the performance of Fast-GBS, we used existing sequence datasets for panels of 24 unrelated accessions / clones for three species covering a range of genomic situations: soybean [22], barley [Abed et al., unpublished], and potato [Bastien et al., unpublished]."

Reference [22] links to a paper which lists the data as available under study accession SRP059747 in the NCBI SRA database. However, querying the database using this accession number returns 324 results. Way more than the 24 mentioned in the paper, so which ones to use? Luckily this same accession number nets four results on PubMed Central, the first of which links to an xlsx file containing the relevant accession numbers. These are:

SRR2073085
SRR2073084
SRR2073083
SRR2073082
SRR2073081
SRR2073080
SRR2073079
SRR2073078
SRR2073077
SRR2073076
SRR2073075
SRR2073074
SRR2073073
SRR2073072
SRR2073071
SRR2073070
SRR2073069
SRR2073068
SRR2073067
SRR2073066
SRR2073065
SRR2073064
SRR2073063

This data can be downloaded using the NCBI SRA Toolkit and using the fasterq-dump SRR2073085 command. Alternatively, the data can first be downloaded as SRR2073085.sra using the prefetch SRR2073085 command. This will download the data to the directory set up during the SRA Toolkit installation. The data can then be converted to fastq format using the fasterq-dump SRR2073085 command. This data should be placed in the resources/data/ directory, although this can be changed in the config.yaml file. It should be in the fq.gz format.

Additionally, a reference genome is needed. For this a reference genome for Glycine max (soybean) was used, which can be downloaded from here. This reference genome should be placed in the resources directory, although this can be changed in the config.yaml file.

Last but not least, adapters and barcodes should be provided in fasta format. The path to these files can again be set in the config.yaml file.

Dependencies

DAG

DAG of pipeline

About

Snake Genotyping By Sequencing. A reimplementation of Fast-GBS using snakemake.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages