-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decentralise duplicate detection #225
Open
michaelweiser
wants to merge
1
commit into
scVENUS:master
Choose a base branch
from
michaelweiser:in-flight-opt
base: master
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
michaelweiser
force-pushed
the
in-flight-opt
branch
from
March 21, 2023 14:53
cc30b29
to
bb5b7cc
Compare
michaelweiser
force-pushed
the
in-flight-opt
branch
from
May 15, 2023 10:42
bb5b7cc
to
fc3d077
Compare
Previously we detected duplicate samples upon job submission. This was a very intricate process that covered two stages of detection: local duplicates and other Peekaboo instances in a cluster analysing the same sample concurrently. Apart from being hard to understand and maintain this was inefficient for analyses which didn't involve any expensive operations such as offloading a job to Cuckoo. This degraded into a downright throughput bottleneck for analyses of large numbers (> 10000) of nonidentical samples which are eventually ignored. This change moves duplicate handling out of the queueing into a new duplicate toolbox module. Duplicate detection is moved into individual rules. Resubmission of withheld samples is done in the worker at the end of ruleset processing after the processing result is saved to the database. Handling of local and cluster duplicates is stricly separated. While that makes the actual code not much easier to understand and maintain, the underlying concepts at least are somewhat untangled. The cluster duplicate handler stays mostly the same, primarily consisting of a coroutine which periodically tries to lock samples from its backlog and then submit it to the local queue. The local duplicate handler is now a distinct module very similar to the cluster duplicate handler but doesn't need any repeated polling. Instead potential duplicates are still resubmitted once a sample finishes processing. The cluster duplicate handler no longer directly interacts with the local duplicate handler by putting samples from its backlog into their backlog. Instead cluster duplicates are submitted to the local queue in bulk and the duplicate handler is expected to either never come into play again (because of the known rule and its cached previous analysis result) or detect the local duplicates and put all but one of them into its own backlog automatically. This new design highlighted an additional point for optimisation: If a sample can be locked by the cluster duplicate handler (i.e. is not currently being processed by another instance) but we find siblings of it in our own cluster duplicate backlog, then obviously this sample was at an earlier point in time a cluster duplicate and withheld samples are waiting for the next polling run to be resubmitted. In this case we short-circuit the process from the cluster duplicate detection and submit them to the job queue immediately.
michaelweiser
force-pushed
the
in-flight-opt
branch
from
May 23, 2023 12:19
fc3d077
to
346f9bc
Compare
With #219 merged this should be good to go as well. |
BTW: It appears the contaner CI pipeline error would be fixed by scVENUS/PeekabooAV-Installer#79 as alpine edge/testing (3.18.something) has apparently moved too far away from 3.15 so dependencies no longer match. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Previously we detected duplicate samples upon job submission. This was a
very intricate process that covered two stages of detection: local
duplicates and other Peekaboo instances in a cluster analysing the same
sample concurrently. Apart from being hard to understand and maintain
this was inefficient for analyses which didn't involve any expensive
operations such as offloading a job to Cuckoo. This degraded into a
downright throughput bottleneck for analyses of large numbers (> 10000)
of nonidentical samples which are eventually ignored.
This change moves duplicate handling out of the queueing into a new
duplicate toolbox module. Duplicate detection is moved into individual
rules. Resubmission of withheld samples is done in the worker at the end
of ruleset processing after the processing result is saved to the
database.
Handling of local and cluster duplicates is stricly separated. While
that makes the actual code not much easier to understand and maintain,
the underlying concepts at least are somewhat untangled.
The cluster duplicate handler stays mostly the same, primarily
consisting of a coroutine which periodically tries to lock samples from
its backlog and then submit it to the local queue.
The local duplicate handler is now a distinct module very similar to the
cluster duplicate handler but doesn't need any repeated polling. Instead
potential duplicates are still resubmitted once a sample finishes
processing.
The cluster duplicate handler no longer directly interacts with the
local duplicate handler by putting samples from its backlog into their
backlog. Instead cluster duplicates are submitted to the local queue in
bulk and the duplicate handler is expected to either never come into
play again (because of the known rule and its cached previous analysis
result) or detect the local duplicates and put all but one of them into
its own backlog automatically.
This new design highlighted an additional point for optimisation: If a
sample can be locked by the cluster duplicate handler (i.e. is not
currently being processed by another instance) but we find siblings of
it in our own cluster duplicate backlog, then obviously this sample was
at an earlier point in time a cluster duplicate and withheld samples are
waiting for the next polling run to be resubmitted. In this case we
short-circuit the process from the cluster duplicate detection and
submit them to the job queue immediately.
This depends on #219 to get back to an uncached Known rule.
Therefore the first three commits here are exactly the same.