Skip to content

5. Configuration File

d-j-e edited this page Mar 14, 2016 · 16 revisions

The Configuration File:

At the heart of any RedDog pipeline run is the RedDog_config file.
This file contains all the inputs, options and commands for the pipeline.

The four most important inputs are the reference file, sequences to be mapped, the output folder for the run, and, if a merge run is being executed, the name of the folder with the output from the prior run. During a merge run, the results of mapping any new reads will eventually be merged into this prior output folder, updating the results with the expanded data set.

'''
Configuration file for RedDog.py V1beta.9
-------------------------------
Essential pipeline variables.
'''
reference = "/full/path/to/folder/NC_007384_with_plasmid.gbk"
sequences = "/full/path/to/reads/fastq_folder/*.fastq.gz"
output = "/full/path/to/folder/RedDog_output/<ref>_<date>/"
out_merge_target = ""

Essential pipeline variables:

"reference"

Can be Genbank or fasta format, and contain one or more replicons - if Genbank format, this will be converted to a fasta version for mapping. If the Genbank file is not given, the gene cover and depth matrices for genes will not be generated and nor will SNP consequences. If you don't have the Genbank record, or don't want the above matrices to be generated, use a fasta format reference instead. You don’t need to tell the pipeline which type of reference you have – it will work it out.

Note: if you download a fasta file from Genbank, you will need to simplify the header for each replicon in the reference before running the pipeline. Also, each replicon is required to have a unique name.

"sequences"

Sequence reads to be mapped; you can point to the folder and the set of files to be mapped with a wildcard qualifier:

sequences =  '/full/path/to/reads/*.fastq.gz'

You can also combine sequences from different folders into the same run:

sequences = ['/full/path/to/reads/set1/*.fastq.gz', '/full/path/to/reads/set2/*.fastq.gz']

Note that the pipeline expects the reads to be in fastq format and stored as gzip files. The pipeline can take reads from Illumina platforms, as either single end or paired end reads, or Ion Torrent sequences, though Illumina paired end sequences are the default setting (see 'readType' below for more details). Each read set is required to have a unique name.

"output"

Full path name for all output folders, including a final '/' (though the pipeline will add this if missing). Note that this folder should not exist at the start of any run (new or merge).

"out_merge_target"

When running a 'new' analysis, set to null string. Otherwise set to the directory you want to merge with. You can only merge a prior run with a new run (not two prior runs).

This 'merge target' folder must have the bams and indexes in one sub-folder (./bam) and the vcfs in another (./vcf). There also must be a 'sequence_list.txt' file, and preferably (but not necessarily) a 'run_report.txt' file - if a previous run_report.txt is available, this information will be used to ensure continuity in settings between merge runs. (See Outputs for more information)

The 'output' folder (see above) for a merge run should NOT exist prior to the run, and will be deleted at completion of the pipeline.

Previous Home Next