Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Further speedup in iVar Trim primer finding #162

Open
niemasd opened this issue Apr 5, 2023 · 3 comments
Open

Further speedup in iVar Trim primer finding #162

niemasd opened this issue Apr 5, 2023 · 3 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@niemasd
Copy link

niemasd commented Apr 5, 2023

Is your feature request related to a problem? Please describe.
I noticed that iVar Trim was recently optimized quite a bit (awesome stuff!), and I noticed that (assuming I'm understanding the release descriptions properly) one of the optimizations is that a faster sorting implementation was incorporated to sort primers, and if I recall correctly, the primers need to be sorted to enable binary search to find the end of a primer that spans a certain position (for primer trimming). However, you can actually avoid binary search entirely, which could dramatically speed things up

Describe the solution you'd like
Because viral genomes are so short, at the beginning of iVar Trim's execution, you can precompute an array where index i contains the end position of the primer covering position i of the reference genome. Then, while trimming a given read that mapped to position i, you can perform a constant-time lookup (rather than binary search) to find the primer end position. A general summary of the preprocessing algorithm is as follows:

primer_end = array of length genome size
primers = sorted list of (start, end) tuples
primers_ind = 0
curr_primers = empty queue
for i from 0 to genome size:
    while curr_primers.front()[1] < i:
        curr_primers.dequeue()
    while primers[primers_ind][0] >= i:
        curr_primers.enqueue(primers[primers_ind])
        primers_ind = primers_ind + 1
    primer_end[i] = curr_primers.back()[1]

This slightly overly complicated approach allows overlapping primers, but if you can assume primers do not overlap:

primer_end = array of length genome size
primers = sorted list of (start, end) tuples
for start, end in primers:
    for i from start to end:
        primer_end[i] = end

Then, in your code, if a read maps to position i, you primer trim until the base that mapped to position primer_end[i]

Describe alternatives you've considered
N/A

Additional context
@kga1978 told me to make a GitHub Issue; happy to chat with you folks about this approach if you're interested 😄

@cmaceves
Copy link
Collaborator

cmaceves commented Apr 6, 2023

hi @niemasd! thank you for the suggestion, it seems pretty reasonable for future development.

@cmaceves cmaceves self-assigned this Apr 6, 2023
@cmaceves cmaceves added this to the v1.4.3 milestone Apr 6, 2023
@cmaceves cmaceves added the enhancement New feature or request label Apr 6, 2023
@gkarthik
Copy link
Member

gkarthik commented Apr 6, 2023

Hey @niemasd, this will certainly reduce the lookup time, thanks! A few additional things Chrissy and I discussed yesterday,

  • Using a hashmap instead of an array since the array will be pretty sparse.
  • We'll need to preprocess this for forward and reverse reads (if paired).
  • We'll need to take the offset into account while creating the hashmap.

@niemasd
Copy link
Author

niemasd commented Apr 7, 2023

Agreed re: 2nd and 3rd bullets, but re: first bullet, I think the memory overhead a hash map has with respect to load factor (for constant-time average-case time complexity) + buckets will result in similar memory as using an array, and you might not get the benefits of memory locality. But regardless, I think hash map or just a basic preallocated array will have similar runtime + memory (both should be extremely fast and low memory), so whichever is easier for you to implement is probably the better choice (I think the difference will be negligible compared to the rest of the runtime/memory of the tool)

But yeah, glad to hear you'll look into it! Feel free to ping me if you ever want to chat about it

@cmaceves cmaceves mentioned this issue Jun 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

3 participants