New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reactivate appx DBSCAN impl with new disjoint set dependency #304
base: master
Are you sure you want to change the base?
Conversation
ae130d3
to
8091015
Compare
Any chance you're willing to run/post some benchmarks? I believe part of the rationale for removing Appx. DBSCAN, beyond the dependency issue, was that it didn't really add any new capability since the implementation itself was actually slower than just standard DBSCAN, leaving little practical reason why someone would choose to use it. @YuhanLiin, any thoughts? |
So, I benchmarked my disjoint set implementation itself, as I wanted it to be not slower than existing alternatives, and I wanted to choose implementation details to optimize for speed. For that, I looked at how fast Kruskal's algorithm runs on an example graph with 1 million vertices and ~8.8 million edges, already sorted by weight, with measures including creating the disjoint set data structure and allocating memory for the result Vec on demand. The imo most fair comparison to the partition crate is to use With these restrictions, on my local AMD Ryzen 5 3600, I get the following results:
That being said, of course this is a very specific benchmark, and I assume Appx. DBSCAN is a very different algorithm, so this doesn't necessarily mean anything for the performance of that. I don't know anything about what Appx. DBSCAN even is, to be honest (I just drop-in replaced the old dependency with a new one), but my gut feeling would be that the performance of the disjoint set data structure doesn't even matter that much here, as there is so much potentially slow other stuff happening. But I really don't know and probably someone with more knowledge about clusting algorithms should do benchmarks for that, if you are interested in that. |
For a quick comparison can you run the normal DBSCAN and Approx DBSCAN benchmarks and post the results? If Approx is slower than we should just get rid of it. |
Codecov ReportPatch coverage:
📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more Additional details and impacted files@@ Coverage Diff @@
## master #304 +/- ##
==========================================
+ Coverage 37.08% 38.50% +1.41%
==========================================
Files 94 94
Lines 6196 6218 +22
==========================================
+ Hits 2298 2394 +96
+ Misses 3898 3824 -74
... and 50 files with indirect coverage changes Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report in Codecov by Sentry. |
Hi everyone!
I saw the issues with the partitions dependency described in #295 and #299, and finally the removal of the appx DBSCAN in #298.
Coincidentally, I recently was looking for a disjoint set crate to use for myself, and when I wasn't satisfied with what I found on crates.io (all existing ones seem to be not maintained and/or not documented/not tested), I decided to publish my own: https://crates.io/crates/disjoint.
So it turned out there were just some small additional features necessary in my crate to enable it as a possible replacement for partitions for you. I implemented these now, and therefore can now provide this pull request to reactivate your appx DBSCAN.
I split the implementation in two commits:
If you just look at the second commit, you can see that it's really almost a drop-in replacement. Only a few lines change due to the slightly different interface.