unmulti - extract individual sequeunces from a fasta file

unmulti is a tool for splitting fasta files containing several sequences into many files containing just one sequence, or for extracting a list of sequences from the same file. The tool supports input compression in .gz, .bz2, or .xz formats, and output compression in .gz format.

Installation

Prebuilt binaries

Prebuilt binaries for generix linux_x86-64 are available from the releases page.

Compiling from source

Requirements

c++11 compliant compiler.
cmake v2.8.2 or newer.
git.
zlib
Internet access.

How-to

Clone the repository, enter the directory and run

mkdir build
cd build
cmake ..
make

this will download the necessary dependencies and compile the unmulti executable in build/bin/.

Usage

unmulti -f <input multifasta> -o <output directory (default: working directory)

Example

Split a multifasta

Running unmulti on an input file in.fasta with the following contents

>seq_1
AAACGT
>seq_2
GGGTAC

will produce files 0.fasta and 1.fasta in the output directory with contents

>seq_1
AAACGT

and

>seq_2
GGGTAC

Extract specific sequence(s)

Running unmulti -f in.fasta --extract seq_2 on the example input above will extract only the sequence starting at >seq_2. Multiple sequences can be supplied by delimiting them with ,. Running unmulti -f in.fasta --extract seq_2,seq_1 will extract both sequences from the example input.

Other options

The input file can be supplied compressed in the zlib/libbz2/liblzma format depending on what was supported on the machine that unmulti was compiled on. Adding the --compress toggle will compress the output files using zlib.

Adding the -t number_to_sequence.tsv argument will write a table linking the output filenames to their sequence names to the supplied argument. In the example above, running unmulti -f in.fasta -t number_to_sequence.tsv would produce the following file

0	seq_1
1	seq_2

If your sequeunces begin with some other character than '>', the --seq-start option can be used to change the character. For example, running unmulti -f in.fasta --seq-start @ would make unmulti compatible with a file in the following format

@read_1
CGCCTAC
+
GGFGGCD
@read_2
TGAGCCA
+
FFGFG=G

Accepted flags/parameters

unmulti accepts the following flags/parameters:

-f            Input multifasta.
-o            Output directory (default: working directory)
-t            Write a table linking the output filenames to sequence names to the argument filename.
--compress    Compress the output files with zlib (default: false)
--extract     Extract only the named sequence(s). Multiple sequences should be delimited by ','.
--seq-start   Sequence begin character (default: '>')

License

The source code from this project is subject to the terms of the MIT license. A copy of the MIT license is supplied with the project, or can be obtained at https://opensource.org/licenses/MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
config		config
include		include
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

config

config

include

include

src

src

.gitignore

.gitignore

CMakeLists.txt

CMakeLists.txt

LICENSE

LICENSE

README.md

README.md

Repository files navigation

unmulti - extract individual sequeunces from a fasta file

Installation

Prebuilt binaries

Compiling from source

Requirements

How-to

Usage

Example

Split a multifasta

Extract specific sequence(s)

Other options

Accepted flags/parameters

License

About

Releases 1

Packages

Languages

License

tmaklin/unmulti

Folders and files

Latest commit

History

Repository files navigation

unmulti - extract individual sequeunces from a fasta file

Installation

Prebuilt binaries

Compiling from source

Requirements

How-to

Usage

Example

Split a multifasta

Extract specific sequence(s)

Other options

Accepted flags/parameters

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages