SGBS

Snake Genotyping By Sequencing. A reimplementation of Fast-GBS using snakemake.

Usage

First make a conda environment using the environment.yml file:
conda env create -f environment.yml

Make sure you have:

Your paired ended data as {sample_name}_1.fq.gz and {sample_name}_2.fq.gz.
A reference genome.
An adapter fasta file.
A barcodes fasta file.

Then run the pipeline using the following command:

snakemake -c10 --use-conda

Divergence from Fast-GBS

Uses Flexbar instead of Sabre.
Uses Flexbar instead of Cutadapt.
Uses GATK HaplotypeCaller instead of Platypus.

Gathering test data

The original paper, which can be found in this repository at /docs/Fast-GBSPaper.pdf, says the following about the data used to test the pipeline:
"To test the performance of Fast-GBS, we used existing sequence datasets for panels of 24 unrelated accessions / clones for three species covering a range of genomic situations: soybean [22], barley [Abed et al., unpublished], and potato [Bastien et al., unpublished]."

Reference [22] links to a paper which lists the data as available under study accession SRP059747 in the NCBI SRA database. However, querying the database using this accession number returns 324 results. Way more than the 24 mentioned in the paper, so which ones to use? Luckily this same accession number nets four results on PubMed Central, the first of which links to an xlsx file containing the relevant accession numbers. These are:

SRR2073085
SRR2073084
SRR2073083
SRR2073082
SRR2073081
SRR2073080
SRR2073079
SRR2073078
SRR2073077
SRR2073076
SRR2073075
SRR2073074
SRR2073073
SRR2073072
SRR2073071
SRR2073070
SRR2073069
SRR2073068
SRR2073067
SRR2073066
SRR2073065
SRR2073064
SRR2073063

This data can be downloaded using the NCBI SRA Toolkit and using the fasterq-dump SRR2073085 command. Alternatively, the data can first be downloaded as SRR2073085.sra using the prefetch SRR2073085 command. This will download the data to the directory set up during the SRA Toolkit installation. The data can then be converted to fastq format using the fasterq-dump SRR2073085 command. This data should be placed in the resources/data/ directory, although this can be changed in the config.yaml file. It should be in the fq.gz format.

Additionally, a reference genome is needed. For this a reference genome for Glycine max (soybean) was used, which can be downloaded from here. This reference genome should be placed in the resources directory, although this can be changed in the config.yaml file.

Last but not least, adapters and barcodes should be provided in fasta format. The path to these files can again be set in the config.yaml file.

Name		Name	Last commit message	Last commit date
Latest commit History 71 Commits
config		config
docs		docs
workflow		workflow
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dag.png		dag.png
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

docs

docs

workflow

workflow

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

dag.png

dag.png

environment.yml

environment.yml

Repository files navigation

SGBS

Usage

Divergence from Fast-GBS

Gathering test data

Dependencies

DAG

About

Releases

Packages

Languages

License

denniswiersma/SGBS

Folders and files

Latest commit

History

Repository files navigation

SGBS

Usage

Divergence from Fast-GBS

Gathering test data

Dependencies

DAG

About

Topics

Resources

License

Stars

Watchers

Forks

Languages