informatics-tools

Some small and simple programs useful for various bioinformatics purposes.

1. extract-contigs.pl

This convenient Perl script is able to extract contigs of FASTA files either by contig name or a list of contig names.

$ extract-contigs.pl single CONTIGNAME FASTAFILE

or with a list of contigs

$ extract-contigs.pl list LISTNAME FASTAFILE

To save into a new file, use ">" sign

$ extract-contigs.pl list LISTNAME FASTAFILE > NEW_FILENAME

2. ver-horizontal.sh

This Bash script converts a list to horizontal view on the standard output for various automation purposes.

$ ver-horizontal.sh LIST

For instance, a file named LIST

hello
world
happy

will be converted to

hello world happy

3. assembly-stats.pl

This Perl script gives you general assembly statistics including contig number, genome size, largest contig (bases), GC content, N count and gap count. It takes the inputs of FASTA assemblies.

$ assembly-stats.pl FASTAFILE

It generates data on standard output as follows:

Sample_ID  Genome  Contigs Mean    Median  N50     Largest GC(%)   N_count N(%)    Gap_count
test.fasta 158     2       79      89      89      89      5.95    26      16.46   4

Output explanations

Genome: Genome size
Contigs: Number of contigs in the fasta file
Mean: Average size of contigs in bases
Median: Size of median contig
N50: Yardstick of assembly quality - 50% of the contigs are larger than this size (in nucleotide bases)
Largest: Size (nucleotide bases) of largest contigs
GC(%): GC (guanine and cytosine) content of the genome
N_count: Number of N found in the genome (uncertain base calling)
N(%): N count in percentage
Gap_count: count "-" in the fasta file, usually appears in alignment file

4. reverse-complement.sh

This Bash script generates reverse complement for nucleotide FASTA files. That is, A -> T, G -> C and vice versa.

$ reverse-complement.sh -o OUTPUT_FILENAME FASTAFILE

Option -o can be omitted, the default output filename is FASTAFILE-complement.fasta

5. basename_dir.sh

This one-line bash script is able to extract the basenames of files with same suffixes for various purposes. If a directory has 3 files with suffixes .fasta namely ABC.fasta, CDE.fasta and EFG.fasta, usage is below:

$ basename_dir.sh .fasta

It will print on standard output:

$ ABC CDE EFG

6. extract-sequences-ids.sh

This Bash script is superquick at extracting sequences (or, contigs if you like) from multifasta files using an external file of ids list and display on the standard output.

$ extract-sequences-ids.sh ids multifasta

You can do the below for usage options:

$ extract-sequences-ids.sh -h

This bash script can extract sequences from multifasta files using sequence ids

Usage: extract-sequences-ids.sh [options] ids.file multifasta.file
Options:
 -h print usage and exit
 -a print author and exit
 -v print version and exit

You will see something like this on the standard output:

>ABC123
ATGATAAGATTTAAGAAAACAAAATTAATAGCAAGTATTGCAATGGCTTTATGTCTGTTT
TCTCAACCAGTAATCAGTTTCTCAAAGGATATAACAGATAAAAATCAAAGTATTGATTCT
GGAATATCAAGCTTAAGTTACAATAGAAATGAAGTTTTAGCTAGTAATGGAGATAAAATT
GAAAGTTTTGTTCCAAAGGAAGGTAAAAAGACTGGTAATAAATTTATAGTTGTAGAACGT
CAAAAAAGATCCCTTACAACATCACCAGTAGATATATCAATAATTGATTCTGTAAATGAC

To generate the ids file, use vim editor and create any file name e.g. ids and enter the sequence ids line by line

ABC123
DMF123
dlfppt

7. contigs-ids-length.sh

This script estimates the length of each contig in a multi-fasta file.

8. filter-contig.pl

This script filters genome assembly by specifying minimum contig length. For example, too filter out contigs with <500 bp,

$ filter-contig.pl 500 fasta

9. atgc2ATGC.sh

This bash script converts atgc to ATGC using AWK - super fast, >300MB fasta file in under 15 seconds.

10. rename_contigs.sh

This script renames contigs in multi-fasta files.

$ rename_contigs.sh fasta PREFIX

For example (123.fasta is your fasta file, ABC is the prefix you want to use for renaming your contigs)

$ rename_contigs.sh 123.fasta ABC
>ABC.1
ATGCATGC
>ABC.2
AGGTCTCT
>ABC.3
AGGGCCGT

So use > to save into new fasta file, e.g.

$ rename_contigs.sh 123.fasta ABC > ABC.fna

11. basename-one-liner.sh

This bash script uses basename software to generate prefixes in one line. For example if you have three fastq files in the same directory BA1_R1.fastq BA2_R1.fastq BA3_R1.fastq. Using this one-liner command will generate (based on same pattern of suffixes, in this case _R1.fastq):

$ basename-one-liner.sh _R1.fastq
BA1 BA2 BA3

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
LICENSE		LICENSE
Merge_tables.py		Merge_tables.py
README.md		README.md
abricate-toxinotype.sh		abricate-toxinotype.sh
assembly-stats.pl		assembly-stats.pl
atgc2ATGC.sh		atgc2ATGC.sh
basename-one-liner-comma.sh		basename-one-liner-comma.sh
basename-one-liner.sh		basename-one-liner.sh
contigs-ids-length.sh		contigs-ids-length.sh
extract-contig.pl		extract-contig.pl
extract-lines-by-list.sh		extract-lines-by-list.sh
extract-sequences-ids.sh		extract-sequences-ids.sh
extract-snp-dists.sh		extract-snp-dists.sh
filter-contig.pl		filter-contig.pl
rename_contigs.sh		rename_contigs.sh
rename_prefix.sh		rename_prefix.sh
reverse-complement.sh		reverse-complement.sh
run_Kallisto_rnaseqmapping.sh		run_Kallisto_rnaseqmapping.sh
string-replacer.sh		string-replacer.sh
sum-column.sh		sum-column.sh
ver-horizontal.sh		ver-horizontal.sh

License

raymondkiu/bioinformatics-tools

Folders and files

Latest commit

History

Repository files navigation

informatics-tools

1. extract-contigs.pl

2. ver-horizontal.sh

3. assembly-stats.pl

4. reverse-complement.sh

5. basename_dir.sh

6. extract-sequences-ids.sh

7. contigs-ids-length.sh

8. filter-contig.pl

9. atgc2ATGC.sh

10. rename_contigs.sh

11. basename-one-liner.sh

About

Topics

Resources

License

Stars

Watchers

Forks

Languages