Skip to content
/ sedef Public

Identification of segmental duplications in the genome

License

Notifications You must be signed in to change notification settings

vpc-ccg/sedef

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🔴 ⚠️ SEDEF has been deprecated. Please use BISER (SEDEF's successor) instead. ⚠️ 🔴

SEDEF: Segmental Duplication Evaluation Framework

SEDEF is a tool for quick detection of segmental duplications in a genome.

Paper

SEDEF has been presented at ECCB 2018 (DOI 10.1093/bioinformatics/bty586). Preprint is available here. Get the final paper here.

Results

👨🎨 Human (hg38) 👨‍🎨 Human (hg19) 🐭 Mouse (mm8)
Final calls Final calls Final calls

The experiment pipeline from the paper is described in this Jupyter notebook.

How to compile

Simple! Do this:

git clone https://github.com/vpc-ccg/sedef
cd sedef
make -j release

By default, SEDEF uses Intel C++ compiler. If you are using g++, build with:

make -j release CXX=g++

If you are using Clang on macOS, compile as

brew install libomp
make -j release OPENMP="-Xpreprocessor -fopenmp" CXX=clang++

You need at least g++ 5.1.0 (C++14) to compile SEDEF. Clang should work fine as well.

SEDEF requires Boost libraries in order to compile. In case you installed Boost in a non-standard directory, you can still compile as follows:

CPATH={path_to_boost} make -j release

How to run

The genome assembly must be soft-masked (i.e. all common and tandem repeats should be converted to lower-case letters) and indexed. Suppose that our genome is hg19.fa (we use UCSC hg19 genome with 24 standard chromosomes that does not contain patches (unGl) or random strains (chrXX_random)).

Automatic transmission

Just go to sedef directory and run

./sedef.sh -o <output> -j <jobs> <genome> 

For example, to run hg19.fa on 80 cores type:

./sedef.sh -o sedef_hg19 -j 80 hg19.fa

You can add -f if sedef_hg19 already exists (it will overwrite the existing content though). The final results will be located in sedef_hg19/final.bed.

Please note that sedef.sh depends on Samtools and GNU Parallel. If you want to experiment with different parameters, run sedef help for parameter documentation.

Output

Output will be located in <out_dir>/final.bed.

The fields of BEDPE file are as follows:

First 6 fields are standard BEDPE fields describing the coordinates of SD mates:

  • chr1, start1 and end1
  • chr2, start2 and end2

Other fields are (in the order of appearance):

Field Description
name SD name
score Total alignment error
strand1 1st SD mate strand
strand2 2nd SD mate strand
max_len Length of longer mate
aln_len Alignment length (length with gaps)
cigar Empty string
comment Comment: currently shows mismatch base error (m) and gap base error (g)
indel_a Number of gap bases in the 1st mate
indel_b Number of gap bases in the 2nd mate
alnB Aligned base count (matches and mismatches without gaps)
matchB Match base count
mismatchB Mismatch base count
transitionsB Transition count (A <-> G and C <-> T)
transversions Transversion count (all mismatches that are not transitions)
fracMatch matchB / alnB
fracMatchIndel matchB / aln_len
jck Jaccard score: where w = mismatchB / alnB
k2K Kimura score: where p = transitionsB / alnB and q = transitionsB / alnB
aln_gaps Number of gaps in the alignment
uppercaseA Number of non-masked (uppercase) bases in the 1st mate
uppercaseB Number of non-masked (uppercase) bases in the 2nd mate
uppercaseMatches Non-masked match count
aln_matches Match base count
aln_mismatches Mismatch base count
aln_gaps Number of gaps in the alignment
aln_gap_bases Number of gap bases (indel_a + indel_b)
cigar CIGAR string of the SD mate alignment
filter_score (aln_gaps + aln_mismatches) / aln_len (should be ≥ 0.5)

All errors are expressed as ratios (0.0--1.0) of the alignment length unless otherwise noted.

Warning: as per WGAC, when calculating the similarity and error rates (fields score, fracMatch, fracMatchIndel and filter_score) SEDEF counts a gap as a single error (so the hypothetical alignment of A-----GC and AT-----C will have error 4 and NOT 8). This might lead to SDs with rather large gap contents. For more filtering, consult comment field that provides the percentage of match/mismatch and gap bases.

Manual transmission

First make sure to index the genome:

samtools faidx hg19.fa

Then run the sedef-search in parallel (in this example, we will use GNU parallel) to get the initial seeds:

mkdir -p out # For the output
mkdir -p out/log # For the logs

for i in `seq 1 22` X Y; do 
for j in `seq 1 22` X Y; do  
	SI=`awk '$1=="chr'$i'" {print $2}' hg19.fa.fai`; 
	SJ=`awk '$1=="chr'$j'" {print $2}' hg19.fa.fai`; 
	if [ "$SI" -le "$SJ" ] ; 
	then 
		for m in y n ; do
		[ "$m" == "y" ] && rc="-r" || rc="";
		echo "sedef search $rc hg19.fa chr$i chr$j >out/${i}_${j}_${m}.bed 2>out/log/${i}_${j}_${m}.log"
		done; 
	fi
done
done | time parallel --will-cite -j 80 --eta

# Now make sure that all runs completed successfully 
grep Total out/log/*.log | wc -l
# You should see here 600 (or n(n+1) if you have n chromosomes in your file)

# Get the single-core running time
grep Wall out/log/*.log | tr -d '(' | awk '{s+=$4}END{print s}'

# Get the maximum meory usage as well
grep Memory out/log/*.log | awk '{if($3>m)m=$3}END{print m}'

Then use sedef-align to bucket the files for the optimal parallel alignment. Afterwards, start the whole alignment:

# First bucket the reads into 1000 bins
mkdir -p out/bins
mkdir -p out/log/bins
time sedef align bucket -n 1000 out out/bins

# Now run the alignment
for j in out/bins/bucket_???? ; do
	k=$(basename $j);
	echo "sedef align generate -k 11 hg19.fa $j >${j}.bed 2>out/log/bins/${k}.log"
done | time parallel --will-cite -j 80 --eta

# Make sure that all runs finished nicely
grep Finished out/log/bins/*.log | wc -l
# Should be number of bins (in our case, 1000)

# Get again the total running time
grep Wall out/log/bins/*.log | tr -d '(' | awk '{s+=$4}END{print s}'

# And the memory
grep Memory out/log/bins/*.log | awk '{if($3>m)m=$3}END{print m}'

Finally, run sedef-stats to produce the final output:

# Concatenate the files 
cat out/*.bed > out.bed # seed SDs
cat out/bins/bucket_???? > out.init.bed # potential SD regions
cat out/bins/*.bed | sort -k1,1V -k9,9r -k10,10r -k4,4V -k2,2n -k3,3n -k5,5n -k6,6n |\
	uniq > out.final.bed # final chains

# Now get the final calls
sedef stats generate hg19.fa out.final.bed |\
		sort -k1,1V -k9,9r -k10,10r -k4,4V -k2,2n -k3,3n -k5,5n -k6,6n |\
		uniq > out.hg19.bed

Acknowledgments

SEDEF uses {fmt}, argh and the modified version of Heng Li's ksw2.

Support

Questions, bugs? Open a GitHub issue or drop me an e-mail at inumanag at mit dot edu.