Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reactivate appx DBSCAN impl with new disjoint set dependency #304

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

jogru0
Copy link

@jogru0 jogru0 commented Apr 19, 2023

Hi everyone!

I saw the issues with the partitions dependency described in #295 and #299, and finally the removal of the appx DBSCAN in #298.

Coincidentally, I recently was looking for a disjoint set crate to use for myself, and when I wasn't satisfied with what I found on crates.io (all existing ones seem to be not maintained and/or not documented/not tested), I decided to publish my own: https://crates.io/crates/disjoint.

So it turned out there were just some small additional features necessary in my crate to enable it as a possible replacement for partitions for you. I implemented these now, and therefore can now provide this pull request to reactivate your appx DBSCAN.

I split the implementation in two commits:

  1. Reverted Remove appx DBSCAN impl and replace with type aliases #298. This just reintroduces the appx DBSCAN impl without any further changes.
  2. Replaced dependencies: partitions -> disjoint.

If you just look at the second commit, you can see that it's really almost a drop-in replacement. Only a few lines change due to the slightly different interface.

@jogru0 jogru0 changed the title Reactivate apppx DBSCAN impl with new disjoint set dependency Reactivate appx DBSCAN impl with new disjoint set dependency Apr 23, 2023
@jogru0 jogru0 force-pushed the from_partitons_to_disjoint branch from ae130d3 to 8091015 Compare April 23, 2023 18:58
@quietlychris
Copy link
Member

Any chance you're willing to run/post some benchmarks? I believe part of the rationale for removing Appx. DBSCAN, beyond the dependency issue, was that it didn't really add any new capability since the implementation itself was actually slower than just standard DBSCAN, leaving little practical reason why someone would choose to use it. @YuhanLiin, any thoughts?

@jogru0
Copy link
Author

jogru0 commented Apr 25, 2023

So, I benchmarked my disjoint set implementation itself, as I wanted it to be not slower than existing alternatives, and I wanted to choose implementation details to optimize for speed.

For that, I looked at how fast Kruskal's algorithm runs on an example graph with 1 million vertices and ~8.8 million edges, already sorted by weight, with measures including creating the disjoint set data structure and allocating memory for the result Vec on demand.

The imo most fair comparison to the partition crate is to use DisjointSetVec<()> for that, even though you could use DisjointSet directly (which turns out to not even be faster as Rust handles Vec<T> with size_of<T>() == 0 very efficiently), and to not use the Quick-Union variant of the algorithm, as that's not possible with partition.

With these restrictions, on my local AMD Ryzen 5 3600, I get the following results:

  • partition: 290ms +/- 10ms
  • disjoint: 57ms +/- 1ms

That being said, of course this is a very specific benchmark, and I assume Appx. DBSCAN is a very different algorithm, so this doesn't necessarily mean anything for the performance of that. I don't know anything about what Appx. DBSCAN even is, to be honest (I just drop-in replaced the old dependency with a new one), but my gut feeling would be that the performance of the disjoint set data structure doesn't even matter that much here, as there is so much potentially slow other stuff happening. But I really don't know and probably someone with more knowledge about clusting algorithms should do benchmarks for that, if you are interested in that.

@YuhanLiin
Copy link
Collaborator

For a quick comparison can you run the normal DBSCAN and Approx DBSCAN benchmarks and post the results? If Approx is slower than we should just get rid of it.

@codecov-commenter
Copy link

Codecov Report

Patch coverage: 72.72% and project coverage change: +1.41 🎉

Comparison is base (4adece5) 37.08% compared to head (8091015) 38.50%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #304      +/-   ##
==========================================
+ Coverage   37.08%   38.50%   +1.41%     
==========================================
  Files          94       94              
  Lines        6196     6218      +22     
==========================================
+ Hits         2298     2394      +96     
+ Misses       3898     3824      -74     
Impacted Files Coverage Δ
...infa-clustering/src/appx_dbscan/cells_grid/cell.rs 45.90% <ø> (+45.90%) ⬆️
...linfa-clustering/src/appx_dbscan/cells_grid/mod.rs 55.35% <60.00%> (+55.35%) ⬆️
...linfa-clustering/src/appx_dbscan/clustering/mod.rs 48.48% <83.33%> (+48.48%) ⬆️

... and 50 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants