BINF_toolkit

This directory consists of scripts developed by Yu Wan for routine bioinformatic analysis.

A list of scripts

add_sample_name_FASTA.py
downloadSeqFromNCBI.py
extractNuclRegionFromFASTA.py
gbk2tbl.py
gbk2tsv.py
gc.py
extractSeqFromGBK.py
parse_ENA_sampleInfo_XML.py
run_CutAdapt.py
filename_generator.py

Manual

add_sample_name_FASTA.py

This script appends a sample name at the beginning of each sequence in a FASTA file. For example, the header ">g1 description" becomes ">sample1__g1 description" after running this script.

Command example: python add_sample_name_FASTA.py -i filename.txt (or filename.fna) -o output_dir -n

downloadSeqFromNCBI.py

This script takes as input a list of NCBI accession numbers (one for each line) from the STDIN and downloads corresponding entries (either GenBank files or FASTA files) under the target directory.

Examples

python downloadSeqFromNCBI.py --records "file:objects.txt" --format fasta --email xxx@xxx.com --suffix fna --outdir ./ref --skip > download.log  

python downloadSeqFromNCBI.py --records "NC_0001,NC_0002" --format genbank --email xxx@xxx.com --suffix gbk --outdir ./ref --skip > download.log

Type python downloadSeqFromNCBI.py -h or --help for help information.

Notes about options and option arguments

--records: can be either a file (must contain a suffix of ".txt") listing targets to be downloaded, or a string of accession IDs separated by commas (no space is allowed).
--format or -f: the format of files to be downloaded
--suffix or -s: the file extension, can be "fasta" (default), "fna", "gb", or "gbk". No dot preceding the extension is needed.
--outdir or -o: output directory, no backslash at the end.

An example of the input list: seq_list.txt. Note that accession IDs may not include version numbers such as ".1" (HG326223.1, CP011642).

References

This script is inspired by Mark Schultz's (dr.mark.schultz@gmail.com, GitHub: schultzm) script "downloadGenbankByAccessions.py".
A post on the BioStars forum.
Damien Farrell's blog post: Retrieving genome assemblies via Entrez with Python.

extractNuclRegionFromFASTA.py

This script extracts a region of nucleotides by positions from a fasta file.

Arguments
-i: the path of the input file
-n: the name of your selected contig
-f: feature name specified by the user
-s: the first nucleotide to be selected
-e: the last nucelotide to be selected
-o: the filename of the output

Requirements

Only one genomic region should be selected;
the start and end positions should not spill out.

gbk2tbl.py

This script converts a GenBank file (.gbk or .gb) from Stdin into a Sequin feature table (.tbl), which is an input file of tbl2asn used for creating an ASN.1 file (.sqn).

Package requirement: BioPython and argparse

Usage

python gbk2tbl.py --mincontigsize 200 --prefix any_prefix --modifiers modifier_file.txt < annotation.gbk 2> stderr.txt

Note that this script reads the GenBank file through the stdin ("< annotation.gbk") and you may want to redirect the stderr to a file via "> stderr.txt" (redirection).

Inputs
A GenBank file, which ought to be passed to the script through the standard input (stdin).

A modifier file: a plain text file containing modifiers for every FASTA definition line.

All modifiers must be written in a single line and are separated by a single space character.
No space should be placed besides the '=' sign. Check NCBI help for choosing a proper format for modifiers.
For example: a line "[organism=Serratia marcescens subsp. marcescens] [sub-species=marcescens] [strain=AH0650_Sm1] [topology=linear] [moltype=DNA] [tech=wgs] [gcode=11] [country=Australia] [isolation-source=sputum]" will be copied and printed along with the record name as the definition line of every contig sequence.

Outputs

any_prefix.tbl: the Sequin feature table
any_prefix.fsa: the corresponding fasta file
These files are inputs for tbl2asn which generates ASN.1 files (*.sqn).

Arguments

--mincontigsize: the minimum contig size, default = 200 in accordance with NCBI's regulation
--prefix: the prefix of output filenames, default = 'seq'
--modifiers: the filename of the modifier file, default = 'modifiers.txt'

Demonstration A test data set for this script is provided in the directory example. This data set is composed of a compressed GenBank file NJST258_1__CP006923.gbk.gz and a modifier file gbk2tbl_modifiers.txt. Users can run the following command line to produce a TBL file as well as a FASTA file:

zcat ./example/NJST258_1__CP006923.gbk.gz | python gbk2tbl.py --mincontigsize 200 --prefix Kp --modifiers gbk2tbl_modifiers.txt

gbk2tsv.py

This script converts one or multiple GenBank files into tab-delimited feature tables (plain text), which can be imported to Excel or R afterwards.

Relevant blog post.

gc.py

This program calculates the length, GC content, and entropy for each record in a multi-fasta file.

Input: a fasta file which contains multiple sequences from the standard input

Output: for each sequence, the script prints: the header line, total sequence length, (G+C)% and entropy of the input sequence.

Command line:

python gc.py < filename.fasta

Treatment of the extended alphabet in this script:

consider all of 15 characters
construct a weighted-count table using dictionary
for each character in the table, take the probability of being A, G, C or T as effective counts
counts for A, G, C and T is computed by adding up the vectors for every character read from the sequence.

extractSeqFromGBK.py

This script extracts gene sequences from a GenBank file, in accordance with a list of (locus_tag, feature type) tuples.

Required module: Bio, argparse, csv

Usage

python extractSeqFromGBK.py --tags locus_tag.tsv --gb demo.gbk > genes.fna

Inputs

A GenBank file.
A text file listing selected locus_tags in the following format: locus_tag"\t"feature_type. This file MUST use ASCII codes because the module csv/2.3 does not support Unicode inputs.
Allowed feature types are: CDS, tRNA, rRNA and tmRNA. For example:
SMDB11_RS00910 rRNA
SMDB11_RS21915 rRNA
SMDB11_RS00015 CDS

Warnings

Although it is unlikely in a GenBank file, but please always ensure that there is no duplication of locus_tags in the table because this script treats locus_tag"s as keys for retrieving feature types.
An "IndexError: list index out of range" will arise if the tag list uses Unicode codes.

parse_ENA_sampleInfo_XML.py

This script parses an ENA metadata file in XML format and prints a subset of information.

Usage

python parse_ENA_sampleInfo_XML.py ERP000909.xml > samples.txt

Input: an XML file exported for a list of ERS accession numbers from ENA using the REST URLs API. For example, one can download an XML file for sample ERS086023 using the link http://www.ebi.ac.uk/ena/data/view/ERS086023&display=xml.

Outputs

tab-delimited text file containing information retrieved from the XML file.
study_accession, sample_accession, secondary_sample_accession, experiment_accession, run_accession, Isolate_ID, Host, Place_of_isolation, Year_of_isolation

run_CutAdapt.py

This script runs CutAdapt for a list of paired-end readsets.

Dependency: slurm on a computational cluster (Linux OS)

filename_generator.py

This script generates a list of file names based on a list of strings. It is useful if you want to generate a list of file names for read sets from a list of bacterial strain names.

Usage

python filename_generator.py -i input_file -o output_file -p prefix -s suffix -f from -l to -pe

Input: a plain-text file consists of a list of filenames

Example input files: (inlist.txt)
sample1__genes__results.txt
sample2__genes__results.txt

Command

python filename_generator.py -i inlist.txt -o outlist.txt -p /reads/ -s .fastq.gz -f 0 -l 7 -pe

Output: a list of new file names generated on the basis of strings in inlist.txt

Example output items: (outlist.txt)
    /reads/sample1_1.fastq.gz
    /reads/sample1_2.fastq.gz
    /reads/sample2_1.fastq.gz
    /reads/sample2_2.fastq.gz

Name		Name	Last commit message	Last commit date
Latest commit History 194 Commits
example		example
other_licence		other_licence
shell		shell
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
add_sample_name_FASTA.py		add_sample_name_FASTA.py
downloadSeqFromNCBI.py		downloadSeqFromNCBI.py
extractNuclRegionFromFASTA.py		extractNuclRegionFromFASTA.py
extractSeqFromGBK.py		extractSeqFromGBK.py
extractSeqFromMultiFASTA.py		extractSeqFromMultiFASTA.py
filename_generator.py		filename_generator.py
filterSPAdesContigs.py		filterSPAdesContigs.py
gbk2tbl.py		gbk2tbl.py
gbk2tsv.py		gbk2tsv.py
gc.py		gc.py
gfa_stats.py		gfa_stats.py
linkPEreadsets.py		linkPEreadsets.py
mergeGenomicRegions.R		mergeGenomicRegions.R
parse_ENA_sampleInfo_XML.py		parse_ENA_sampleInfo_XML.py
parse_biosample.py		parse_biosample.py
rename_fasta_seqs.py		rename_fasta_seqs.py
run_CutAdapt.py		run_CutAdapt.py
screen_genes_blast.py		screen_genes_blast.py
seqlen.py		seqlen.py
tabulateMUMmerCoordinates.py		tabulateMUMmerCoordinates.py

License

wanyuac/BINF_toolkit

Folders and files

Latest commit

History

Repository files navigation

BINF_toolkit

A list of scripts

Manual

add_sample_name_FASTA.py

downloadSeqFromNCBI.py

extractNuclRegionFromFASTA.py

gbk2tbl.py

gbk2tsv.py

gc.py

extractSeqFromGBK.py

parse_ENA_sampleInfo_XML.py

run_CutAdapt.py

filename_generator.py

About

Resources

License

Stars

Watchers

Forks

Languages