ddRaptor is a tool designed to help with ddRAD enzyme selection—it uses Aho–Corasick pattern matching to find cut sites for enzyme pairs, counts the fragments within a size range, and generates summary tables and a heatmap of fragment counts per chromosome.
- Python 3.6+
- Biopython
- pyahocorasick
- tqdm
- pandas
- matplotlib
Install them via:
pip install biopython pyahocorasick tqdm pandas matplotlib
-
Clone or download this repository.
-
Make sure you have the dependencies installed (see above).
-
Ensure the script is executable:
chmod +x ddraptor.py
-
(Optional) Copy into your
PATH
:cp ddraptor.py /usr/local/bin/ddraptor
Tab-delimited list of enzymes and their IUPAC motifs, with a caret (^
) marking the cut position:
# name motif_with_caret
EcoRI G^AATTC
BamHI G^GATCC
HindIII A^AGCTT
SphI GCATG^C
PstI CTGCA^G
CviQI G^TAC
NsiI ATGCA^T
Tab-delimited list of ddRAD enzyme pairs (combos):
# combo_name enzymeA,enzymeB
Combo1 EcoRI,BamHI
Combo2 EcoRI,HindIII
Combo3 BamHI,HindIII
Combo4 SphI,PstI
Combo5 EcoRI,PstI
Combo6 CviQI,NsiI
Any multi-FASTA file with chromosome or contig sequences:
>chr1
ACGTACGT...
>chr2
TTGACGTA...
...
python ddraptor.py \
<enzymes.tsv> <combos.tsv> <reference.fasta> \
--min <MIN_LENGTH> --max <MAX_LENGTH> [options]
enzymes.tsv
– Path to enzyme definitions.combos.tsv
– Path to combo definitions.reference.fasta
– Path to your multi-FASTA reference.--min
Minimum fragment length (inclusive).--max
Maximum fragment length (inclusive).
--processes
Number of parallel worker processes (default: number of CPU cores).--totals-out
Path for combo totals TSV (default:ddrad_totals.tsv
).--summary-out
Path for per-chromosome summary TSV (default:ddrad_summary.tsv
).--heatmap-out
Path for heatmap PNG (default:ddrad_heatmap.png
).
-
Combo totals TSV (
--totals-out
) Columns:combo
|total_count
- Sorted by
total_count
descending, so the enzyme combination producing the most fragments—and thus the best candidate for ddRAD—appears first.
combo total_count Combo3 12345 Combo1 9876 ...
- Sorted by
-
Per-chromosome summary TSV (
--summary-out
) Columns:combo
|chromosome
|count
combo chromosome count Combo3 chr1 2345 Combo3 chr2 1987 ... Combo1 chr1 1234 ...
-
Heatmap PNG (
--heatmap-out
) A matrix plot of fragment counts (count
) with:- Rows = enzyme combinations, sorted by
total_count
descending - Columns = chromosomes/contigs in the FASTA
- Rows = enzyme combinations, sorted by
python ddraptor.py enzymes.tsv combos.tsv genome.fasta \
--min 200 --max 600
Produces:
ddrad_totals.tsv
ddrad_summary.tsv
ddrad_heatmap.png
python ddraptor.py enzymes.tsv combos.tsv genome.fasta \
--min 150 --max 600 --processes 8 \
--totals-out=my_totals.tsv \
--summary-out=my_summary.tsv \
--heatmap-out=my_heatmap.png
- IUPAC & reverse strands: The script auto-expands ambiguous IUPAC motifs and searches both forward and reverse complements.
- Performance: Uses Aho–Corasick plus multiprocessing; scales linearly with CPU cores and number of contigs.
- Error handling: Malformed lines in TSVs are skipped with a warning to stderr.
- Future goals: Rust Implementation (ddRustor)
License: MIT
Author: Georgios Kousis Tsampazis
Contact: georgekousis6@gmail.com