This guide will give a general overview of how to use best
.
The alignment bam
file must contain the read sequences and quality scores.
The alignment CIGAR string can either use M
or =
/X
for matches/mismatches.
Let's say we have aligned reads in aln.bam
and the reference assembly
ref.fasta.gz
. We want to collect statistics on the types of
errors that occur in the alignments. Then, we can run best
like
best -t 4 aln.bam ref.fasta.gz output
This will use 4 threads to collect stats and generate the following files:
output.per_aln_stats.csv.gz
output.summary_cigar_stats.csv
output.summary_identity_stats.csv
output.summary_qual_score_stats.csv
output.summary_yield_stats.csv
The per-alignment stats file is gzipped to save space. The
output.summary_cigar_stats.csv
file contains the distribution of lengths of
consecutive insertions and deletions. The output.summary_identity_stats.csv
file contains some general stats on the error rates across all alignments.
The output.summary_qual_score_stats.csv
file contains the empirical Q-value
calculated from matches and mismatches for each corresponding quality score.
The all_alignments
feature indicates the quality score stats across all
alignments. If intervals are specified, then this is computed per interval feature.
The output.summary_yield_stats.csv
contains the yield (number of reads/bases)
above certain quality thresholds.
We can collect even more data in one run of best
. If we run
best -t 4 --intervals-hp --bin-types q_len:1000 gc_contant:0.05 -- aln.bam ref.fasta.gz output
then we will get two more files:
output.summary_feature_stats.csv
output.summary_bin_stats.csv
The output.summary_feature_stats.csv
file contains stats stratified by
intervals. In this case, we use --intervals-hp
to indicate that the intervals
are homopolymer regions, so this will produce the error types at homopolymer
regions of different lengths. The output.summary_bin_stats.csv
file will
contain the error types for reads binned by both the read (query) length (q_len
,
bin by increments of 1000bp) and the GC content (gc_content
, bin by increments
of 0.05).
It is also possible to use bed files as custom intervals. These files can have three columns (all intervals will have the same feature) or four columns, where the last column indicates the feature. The feature stats are aggregated across all bed intervals with the same feature.
best 0.1.0
Daniel Liu, Daniel E. Cook
Bam Error Stats Tool (best): analysis of error types in aligned reads.
USAGE:
best [OPTIONS] <INPUT> <REFERENCE> <STATS_PREFIX>
ARGS:
<INPUT>
Input BAM file
<REFERENCE>
Input reference FASTA file. Can be gzipped
<STATS_PREFIX>
Prefix for output files that contain statistics
OPTIONS:
-b, --bin-types <BIN_TYPES>...
Types of bins to use for per alignment stats.
Each bin should be of the format <bin_type>:<step_size>.
Supported bin types: q_len (read sequence length), subread_passes, mapq, mean_qual,
gc_content, concordance_qv (phred scale Q-value)
-h, --help
Print help information
--intervals-bed <INTERVALS_BED>...
Use intervals from a BED file.
The BED file should have the columns chrom, start, stop, and feature. The feature column
is optional.
This allows stats to be gathered separately for different types of intervals. Note that
all intervals are on the reference, not the reads.
--intervals-border <INTERVALS_BORDER>...
Use fixed-width nonoverlapping window border regions as intervals.
This is used to specify the window widths.
--intervals-hp
Use homopolymer regions in the reference as intervals
--intervals-match <INTERVALS_MATCH>...
Use regions in the reference that match any of the specified subsequences as intervals
--intervals-window <INTERVALS_WINDOW>...
Use fixed-width nonoverlapping windows as intervals.
This is used to specify the window widths.
--intervals-window-pos <INTERVALS_WINDOW_POS>...
Use fixed-width nonoverlapping windows with positions as intervals.
This is used to specify the window widths.
-n, --name-column <NAME_COLUMN>
Add column with a specific name in CSV outputs
--no-per-aln-stats
Turn off outputting per alignment stats
-t, --threads <THREADS>
Number of threads. Will be automatically determined if this is set to 0
[default: 0]
-V, --version
Print version information