Skip to content

barapost local

masikol edited this page May 26, 2023 · 9 revisions

Description

barapost-local.py -- This script is designed for taxonomic classification of nucleotide sequences by finding the most similar sequence in a nucleotide database stored on local machine.

"barapost-local.py" downloads records "discovered" by "barapost-prober.py" (and all replicons related to them: other chromosomes, plasmids) from Genbank according to file (hits_to_download.tsv) generated by barapost-prober.py" and creates a database on local machine. After that "baraposst.py" classifies the rest of data with "BLAST+" toolkit.

Script processes FASTQ and FASTA (as well as .fastq.gz and .fasta.gz) files.

If you have your own FASTA files that can be used as database to blast against, you can omit "barapost-prober.py" step and go to "barapost-local.py" (see -l option).

Input files

  • Input FASTA and FASTQ files should be specified as positional arguments (see examples below).
  • Input files must have different names. I.e. files .../dir1/reads.fastq and .../dir2/reads.fastq are not allowed.

Default parameters

  • if no input files are specified, all FASTQ and FASTA files in current directory will be processed;
  • packet size (see -p option): 100 sequences;
  • algorithm (see -a option): 0 (zero, i.e. megaBlast);
  • numbers of threads to launch (-t option): 1 thread.

Options

    -h (--help) --- show help message.
        '-h' -- brief, '--help' -- full;

    -v (--version) --- show version;

    -r (--annot-resdir) --- result directory generated by script 'barapost-prober.py'
        This is directory specified to 'barapost-prober.py' with '-o' option.
        If you omit 'barapost-prober.py' and use your own FASTA files
        to create a database, this directory may not exist
        before start of 'barapost-local.py' (i.e. it will be a simple output directory).
        Default value is "barapost_result".

    -d (--indir) --- directory which contains FASTQ or FASTA files meant to be processed.
        I.e. all FASTQ and FASTA files in this direcory will be processed;

    -p (--packet-size) --- size of the packet, i.e. number of sequence
        to align in one blastn launching.
        Value: positive integer number. Default value is 100;

    -a (--algorithm) --- BLASTn algorithm to use for aligning.
        Available values: 0 for megaBlast, 1 for discoMegablast, 2 for blastn.
        Default is 0 (megaBlast);

    -l (--local-fasta-to-db) --- your own (local) FASTA file that will be
        added to downloaded database
        (or used instead of it if you omit 'barapost-prober.py' step);

    -s (--accession) --- accession(s) of GenBank record to download and
        include in database. Multiple accession should be
        separated by comma without whitespaces;

    -t (--threads) --- number of CPU threads to use;

Explanation of output files

  • "classification.tsv" file in directory corresponding to a particular input file (for example, for input file something.fastq result directory will be named something). In this file you can find:

    • IDs of classified sequences;
    • classification (usually lineage);
    • accession(s) of best hit(s) divided with &&;
    • some alignment statistics;
    • quality information (for FASTQ files).

    It is the same file that "barapost-prober.py" generates. "barapost-local.py" appends it's classification results to this file.

Notes about using your own FASTA files as database

  1. Besides using -l option, you can specify your own FASTA files using accession TSV file generated by "barapost-prober.py". To do this, just write your FASTA file's path to this TSV file in a new line.

  2. "makeblastdb" utility from "BLAST+" toolkit considers the first word (it separates words by spaces) of sequence ID in FASTA file as sequence accession. Naturally, duplicated accessions are not allowed. Therefore, in order to avoid this duplication, "barapost-local.py" uses modified sequence IDs of your own sequences in FASTA files while database creating. It adds custom accession number in the beginning of sequence IDs. This custom accessions have following format: OWN_SEQ_<N>, where N is integer number (ordinal number of a sequence). These modified sequence IDs are used only in the database -- your own FASTA files will be kept intact.

  3. If you include SPAdes or a5 assembly FASTA file in the database with "barapost-local.py", sequence IDs will be modified in a specific way. If there are more than one assembly file generated by one assembler (e.g. two files named "contigs.fasta" generated by SPAdes), absolute paths to these "contigs.fasta" files will be added to sequence IDs while creating database. So, sequence IDs will look like, e.g. for SPAdes:

    OWN_SEQ_4 /some/happy/path/contigs.fasta_NODE_3_length_546787_cov_102.642226

    Sequences from assembly files affect binning process in their own specific way (see "Binning details" section in "barapost-binning" description, section #1).

Lambda control detection

In order to detect nanopore's lambda control sequences, barapost-local.py always includes 3.6 kb sequence of lambda phage (DNA-CS, ONT, United Kingdom) to reference database. Sequence is derived from Koivunen, Sampo. "Evaluation of the Sequencing Pipeline for the Oxford Nanopore MinION Long-read DNA Sequencer." (2019). (Additional Data).

In Barapost, DNA-CS sequence can be found in file barapost/lambda_control/nanopore_lambda_DNA-CS_control.fasta.gz.

Examples

Note for Windows users: run py -3 barapost-local.py in Windows console. barapost-local.py won't work.

  1. Process all FASTA and FASTQ files in working directory with default settings:

barapost-local.py

  1. Process all files starting with "some_fasta" in the working directory with default settings:

barapost-local.py some_fasta*

  1. Process one FASTQ file with default settings. Directory that contains taxonomic annotation is named prober_outdir:

barapost-local.py reads.fastq -r prober_outdir

  1. Process FASTQ file and FASTA file with discoMegablast, packet size of 50 sequences. Directory that contains taxonomic annotation is named prober_outdir:

barapost-local.py reads.fastq.gz another_sequences.fasta -a discoMegablast -p 50 -r prober_outdir

  1. Process all FASTQ and FASTA files in directory named some_dir. Directory that contains taxonomic annotation is named prober_outdir:

barapost-local.py -d some_dir -r prober_outdir

  1. Process file named some_reads.fastq. Directory that contains taxonomic annotation is named 'prober_outdir'. Reference sequence from file my_own_sequence.fasta will be included to the database. Use 4 CPU threads:

barapost-local.py some_reads.fastq -l my_own_sequence.fasta -t 4

  1. Compare two sets of contigs of the same genome. Here we have 50 sequences in 50_contigs.fasta and 70 sequences in 70_contigs.fasta. The latter contains more sequences, therefore, it should be passed to barapost-local as positional argument, and the former -- with -l option. In terms of SQL, it will produce "OUTER JOIN".

barapost-local.py 70_contigs.fasta -l 50_contigs.fasta -r outdir