Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Options to favor detection of several small deletions instead of single large ones #8

Open
elcortegano opened this issue Jun 1, 2021 · 6 comments

Comments

@elcortegano
Copy link

unimap does a great job aligning complex regions (e.g. satellites) even using reads. We are using it to detect some variants at telomeric regions, using PacBio HiFi sequencing data.

We have evidence from one of these regions that there should be 2 separate deletions, corresponding to two separate groups of monomers from a satellite, each of them of different size. However, unimap finds one large deletion instead, close together to several variants that suggest mismapping. I assume this behavior comes from the default settings, that may favour large deletions instead of many small deletions. The program was run with unimap -a -x asm5 -x hifi --cs. I attach snapshot below for this region.

telomere_5_unimap2

I have been trying to tune settings to penalize nucleotide mismatchings and favour more than 1 deletion if convenient. However, I find some problems including:

  • -B 500 -O 1 -E 2: increased mismatch penalty and lowered cost for gaps. This results in a "Segmentation fault" error.
  • -B 500: Does not seem to solve the problem, the alignment remains the same.
  • -O 1 -E 2: This makes some improvements, and for some reads we observe that mismatches disappear in favour of deletions (snapshot below), but not for the majority of reads. Cannot make progress from here however, since using "-B > 6" will cause the segfault, lower values of "-B" won't change anything, and -O and -E are near their minimum values (cannot take values of zero).

telomere_5_unimap3

Wonder if there are other options that are worth exploring here. Any hints on how to deal with this situation? Thank you

@lh3
Copy link
Owner

lh3 commented Jun 2, 2021

Minimap2/unimap won't work with a very large -B because internally every score must be within [-128,127]. I should have added a test for that. Nonetheless, the failure of -O1 -E2 -B7 is caused by something else that I have overlooked. Note that under this scoring, the optimal alignment will have no mismatches. You can turn any mismatch to two gaps for a better score:

ATCATCA     ATC-ATCA
ATCtTCA     ATCt-TCA

This is probably violating the SW formulation unimap/minimap2 is using. You need to make sure ({-O}+{-E})*2>{-B} at least.

In your case, you can try -A1 -B4 -O6 -E2, or -A1 -B4 -O4 -E1.

@elcortegano
Copy link
Author

Unfortunately, these options does not solve the issue, and still the majority of reads show the large deletion + mismatches.

From what I understand, it may be possible to try something like -B100 -O1 -E50, only that this would need to be in a future release to allow B > 7? We are aware that this might return unrealistic alignments for most of the genome, but for this particular region allowing multiple deletions could be more appropriate.

Thank you for the hints and feedback,

@lh3
Copy link
Owner

lh3 commented Jun 2, 2021

The requirement is ({-O}+{-E})*2>{-B}; otherwise the alignment largely doesn't make sense. B<=7 is not a requirement. asm5 uses -A1 -B19 -O39,81 -E2,1. If you have the sequences, I can give a try.

@elcortegano
Copy link
Author

Thank you for the clarifications, and for offering giving it a look. I have been trying different combinations of parameters following the above rule, but with no success.

Shall I send you the files to an email address? apparently I cannot attach them here.

@ilikvlad
Copy link

Hello,

@elcortegano I would like to ask, how did you obtain such a nice view of the alignment? I guess using some software (?) and fetching it bam/sam format coming out of the unimap? Thank you for your response.

@elcortegano
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants