Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

Sequence Manipulation Wiki

Elizabeth Tseng edited this page Oct 2, 2019 · 21 revisions

Last Updated: 10/02/2019

Python Requirement

How to Use this Repo

To use this repo (just the sequence/ scripts, at least), simply add the directory to your PYTHONPATH and PATH. No compilation is required.

export PYTHONPATH=$PYTHONPATH:<path_to_Cupcake>/sequence
export PATH=$PATH:<path_to_Cupcake>/sequence

List of Scripts

Detailed Documentation

Summarize sequence lengths in fasta/fastq files

Usage:

get_seq_stats.py <fasta_or_fastq_filename>

Example:

$ get_seq_stats.py test.fa
file type is: fasta
50 sequences
min: 387
max: 7394
avg: 1906.34
Length Breakdown by kb range:
0-1 kb: 17
1-2 kb: 10
2-3 kb: 18
3-4 kb: 3
4-5 kb: 0
5-6 kb: 0
6-7 kb: 0
7-8 kb: 2

Convert between fasta and fastq format

Usage:

fa2fq.py <fasta_filename>
fq2fa.py <fastq_filename>

Example:

fa2fq.py test.fasta
fq2fa.py test.fastq

Reverse complement sequence(s)

Usage:

revcomp.py <seq1> [seq2] [seq3] [seq4]

Example:

$ revcomp.py AAACCT
AGGTTT
$ revcomp.py AAACCT GGCCGAA
AGGTTT
TTCGGCC

Sort fasta file by length (increasing or decreasing)

Usage:

sort_fasta_by_len.py [-r] <fasta_filename>

Example:

$ sort_fasta_by_len.py test.fa   (sort by increasing length)
$ sort_fasta_by_len.py -r test.fa  (sort by decreasing length)

Extract list of sequences given a fasta file and a list of IDs

Usage:

get_seqs_from_list.py <fasta_filename> <list_filename>

list_filename should be a text file where each line is a sequence ID to be found in the fasta file.

Example:

get_seqs_from_list.py test.fasta list.fasta > extracted.fasta

Generate fasta sequences given genome and SAM file

NOTE: for this script to properly work (since it depends on other scripts in the same directory), you must add to your PATH variable the sequence/ subdirectory of the repo. For example:

export PATH=$PATH:<path_to_Cupcake>/sequence/

Usage:

err_correct_w_genome.py <genome_fasta> <SAM_file> <output_fasta_filename>

genome_fasta should be the genomic fasta file (ex: hg38.fa), but note that the program reads the entire genome so it may take a while and you must have enough memory!

SAM_file is the SAM alignment file from which to reconstitute your genomic sequences. A common example would be the GMAP SAM output of a PacBio transcriptome dataset. The SAM file does not need to be sorted.

Example (including an example GMAP command to first generate the SAM file):

gmap -D /home/gmap_db -d hg38 -f samse -n 0 -t 12 input.fasta > input.fasta.sam 2> input.fasta.sam.log

err_correct_w_genome.py hg38.fa input.fasta.sam input.hg38_corrected.fasta

Calculate expected accuracy from FASTQ file.

This is for calculating expected accuracy based on FASTQ QV. Useful for looking at expected accuracy in low-quality (LQ) isoforms.

usage: calc_expected_accuracy_from_fastq.py [-h] [--qv_trim_5 QV_TRIM_5]
                                            [--qv_trim_3 QV_TRIM_3]
                                            fastq_filename output_filename

An example usage would be:

python calc_expected_accuracy_from_fastq.py lq_isoforms.fastq lq_isoforms.with_exp_acc.fastq

This would add an extra field at the end of the sequence header, for example:

i0_LQ_sample32f89e|c21/f13p0/373 isoform=c21; \
full_length_coverage=13; \
non_full_length_coverage=0; \
isoform_length=373; \
expected_accuracy=0.365

Filter LQ isoforms using customized FL count and expected accuracy cutoff

IMPORTANT: The FASTQ sequence IDs must already have expected_accuracy=. If not, please run calc_expected_accuracy_from_fastq.py first to modify the sequence IDs.

Usage:

usage: filter_lq_isoforms.py [-h] [--min_fl_count MIN_FL_COUNT]
                             [--min_exp_acc MIN_EXP_ACC]
                             fastq_filename output_filename

An example usage is:

python filter_lq_isoforms.py lq_isoforms.with_exp_acc.fastq lq_filtered.fastq --min_fl_count=1 --min_exp_acc=0.95

While running, you will see messages like:

Including i10_LQ_sample32f89e|c526/f1p0/10181 into output file (FL: 1, acc: 0.968).
Including i10_LQ_sample32f89e|c545/f1p35/10190 into output file (FL: 1, acc: 0.993).
Including i10_LQ_sample32f89e|c550/f1p0/10108 into output file (FL: 1, acc: 0.97).
Including i11_LQ_sample32f89e|c204/f1p0/11961 into output file (FL: 1, acc: 0.994).
Including i11_LQ_sample32f89e|c313/f1p0/11313 into output file (FL: 1, acc: 0.989).

Convert SAM to BAM

usage: sam_to_bam.py sam_filename

Uses samtools to convert SAM to BAM file.

Convert SAM to GFF3

Convert SAM file to GFF3 format using BCBio GFF library. Requires installing BCBio first.

usage: sam_to_gff3.py  [-i INPUT_FASTA]  sam_filename

Group identical ORF sequences

You can predict ORF sequences using tools such as GMST and ANGEL. Different isoforms may encode for the same protein sequences. This script will group fully identical ORF sequences and de-duplicate them.

usage: group_ORF_sequences.py FAA_FILENAME OUTPUT_PREFIX [--is_pbid]

Use the optional parameter --is_pbid only if your .faa file have sequence IDs that have the format PB.X.Y.

For example, if you run:

group_ORF_sequences.py test.faa test.dedup --is_pbid

This will generate two output files, test.dedup.faa and test.dedup.group.txt. The .group.txt file will tell you which sequences from the input file have identical ORFs.

Clone this wiki locally