local-rearrangements

Here are methods to find local rearrangements in DNA reads relative to a genome sequence. "Local" means intra-chromosomal rearrangements that are encompassed by one read.

An example is shown on the right. The top panel shows an alignment (diagonal line) between an 8.1 kb chunk of human chromosome 5, and chimp (panTro5) chromosome 5. The lower panels show alignments of three human DNA reads. Each read base is aligned to at most one genome base (this is guaranteed by last-split).

The DNA reads come randomly from either DNA strand: here, readA is from the opposite strand to readB and readC.

readC's alignment seems to have missing parts: this might be due to sequencing errors causing the aligner to drop those parts. Perhaps they could be rescued by re-aligning in a more slow-and-sensitive mode.

The vertical stripes show human genome annotations: green = exon, pink = forward-strand transposon, blue = reverse-strand transposon, purple = low-complexity sequence.

Limitations

There are no methods here to characterize rearrangements automatically. The approach is to make pictures of them, which a human has to interpret.

Simple usage

First, align your sequences as described here. Then, find rearrangements:

local-rearrangements myseq.maf > out.maf

This also works with myseq.maf.gz. The output begins with lines like this:

# Rearrangement: chr1 16088760 16089609 2 read217 read696
# Rearrangement: chr4 32069345 32072990 1 read101 read333 read777

Each line has:

The start and end coordinates of a rearranged region, in BED3 format, e.g. chr1 16088760 16089609.
The number of DNA reads that have a "complex" local rearrangement in this region (anything except tandem duplication. This can be useful to know, because tandem duplications are overwhelmingly common.)
The names of DNA reads that have a local rearrangement in this region.

Next, the output has the alignments of the rearranged reads.

Finally, rearrangement counts are shown on the screen:

# Rearrangements: 3187
# Complex query sequences:  1   Rearrangements: 2471
# Complex query sequences:  2   Rearrangements: 146
# Complex query sequences:  3   Rearrangements: 113
...

This means there are 3187 rearranged regions, of which 2471 have one DNA read with complex rearrangement, 146 have two, etc.

Recommended usage

LAST's rearranged alignments work best when aligning a derived query sequence to an ancestral reference sequence. This is because last-split seeks a unique best alignment for each part of the query: so it allows for arbitrary deletions and duplications in the query relative to the reference, but not vice-versa.

Thus, it is recommended to use an outgroup to check the ancestral condition of each rearrangement. If we are aligning human DNA reads to a human genome, then a suitable outgroup would be chimp or gorilla.

Specifically, we need alignments between outgroup sequences and our reference genome. The outgroup sequences can be either assembled genomes or unassembled reads. If any outgroup sequence has an un-rearranged alignment covering a rearrangement, then the reference genome's arrangement is ancestral.

Alignments of chimp and gorilla to human genome hg38 are available here. We can simply concatenate them:

cat hg38-panTro5.tab hg38-gorGor5.tab > apes.tab

And then find rearrangements where the reference genome is ancestral:

local-rearrangements --outgroup apes.tab myseq.maf > out.maf

Re-aligning

Rearrangements are likely to be enriched for alignment errors. So it's wise to re-align potentially-rearranged reads more carefully.

First, let's extract all reads, whether rearranged or not, that are near rearranged regions (so these reads might be rearranged if we align them more carefully):

local-rearrangements --rearrangements out.maf --min-complex=2 myseq.maf > out2.maf

Here, we only consider rearranged regions that have at least two reads with "complex" rearrangements (anything except tandem duplications). This is because: in some datasets, many reads are rearranged due to (non-alignment) artifacts; and tandem duplications are overwhelmingly common (and sometimes overlap artifactual rearrangements by chance). --min-complex=2 omits a lot of artifactual and boring cases, so we can afford to re-align more slowly-and-sensitively.

Next, get the FASTA sequences of these reads:

fasta-from-maf out2.maf myseq.fa > redo.fa

Now, re-align these reads more carefully:

lastal -P8 -m20 -p myseq.par mydb redo.fa | last-split -m1 > redo.maf

It's recommended to use a non-repeat masked mydb here. You can expect these alignments to be pretty accurate, but not 100% perfect (e.g. small rearranged fragments might be missing). If you're patient, you can be even more slow-and-sensitive:

lastal -P8 -m50 -d90 -p myseq.par mydb redo.fa | last-split -m1 > redo.maf

Finally, get rearrangements from our careful alignments:

local-rearrangements --outgroup apes.tab redo.maf > redone.maf
local-rearrangements --rearrangements redone.maf redo.maf > redone2.maf

local-rearrangement-pics

This draws pictures of the rearrangements:

local-rearrangement-pics redone-pics redone.maf

This draws fancier pictures with genes and repeats:

local-rearrangement-pics -g refGene.txt -R rmsk.txt redone-pics redone.maf

This shows all reads (including non-rearranged ones) that overlap the rearranged regions:

local-rearrangement-pics -g refGene.txt -R rmsk.txt -a redone2-pics redone2.maf

Showing all reads is more informative, but it can be more cluttered and confusing.