Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to visualize reads containing expansions #20

Open
gspirito opened this issue Dec 20, 2023 · 5 comments
Open

How to visualize reads containing expansions #20

gspirito opened this issue Dec 20, 2023 · 5 comments

Comments

@gspirito
Copy link

Hello, here's my issue:

I ran tandem-genotypes on long reads (Oxford Nanopore) on a RepeatMasker locus and obtained this result:
chr11 70487135 70487173 TGC SHANK2 coding 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,2,2,3 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,2,3

Therefore there should be 13 reads with additional copies of the sequence 'TGC' compared to the reference genome.
However, if I extract all reads mapping to the locus 'chr11:70487135-70487173' from the MAF file and convert it to BAM (with LAST), I cannot see any insertion with IGV, in any read mapped to that locus.

How can I visualize the STR expansions? Is there a way to know which specific reads support the expansions?

Thanks in advance,

Giovanni

@mcfrith
Copy link
Owner

mcfrith commented Dec 20, 2023

Many thanks for your interest in tandem-genotypes. What you're doing seems correct: I don't know why it doesn't work. Maybe if you could share your intermediate files...

To know which reads support the expansions, you can use tandem-genotypes option -v.

@gspirito
Copy link
Author

gspirito commented Jan 8, 2024

Thank you very much for the answer, I attach the locus I used for the analysis, the result I got from Tandem-genotypes and the MAF file containing the reads mapping to that locus:

SHANK2_locus_rpmsk.txt
SAMPLE_tg_SHANK2.txt
SAMPLE_MAF.txt

@mcfrith
Copy link
Owner

mcfrith commented Jan 8, 2024

Thanks for this interesting example!
In short, tandem-genotypes is "working as designed", but the design isn't looking good in this case.

It's faithfully following the "tandem-genotypes method" in here: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1667-6

This dotplot shows the alignment (red) of one read that supposedly has 3 additional copies of TGC:
zoomin

To the left of the repeat (purple), there's an insertion and deletion almost adjacent to each other. tandem-genotypes is counting the insertion as a repeat expansion. It counts insertions that are slightly outside the repeat: we found it necessary to do that in general, because the precise boundaries of repeats can be fuzzy and ambiguous (for non-exact repeats).

You could use tandem-genotypes option -n20 (to only count insertions <= 20 bp outside the repeat, instead of 60).

Maybe tandem-genotypes should be changed like this: when an insertion and deletion are so close to each other, merge them into one "in-del".

@gspirito
Copy link
Author

Hi, thank you for the response, may you provide the command to make the plot you showed? Thank you very much

@mcfrith
Copy link
Owner

mcfrith commented Jun 3, 2024

Amazingly, it's still in my shell's history:
grep -B3 6f8e3f3a SAMPLE_MAF.txt | last-dotplot -a SHANK2_locus_rpmsk.txt -1 chr11:70487085-70487223 - myfig.png

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants