Reactivate appx DBSCAN impl with new disjoint set dependency #304

jogru0 · 2023-04-19T07:56:26Z

Hi everyone!

I saw the issues with the partitions dependency described in #295 and #299, and finally the removal of the appx DBSCAN in #298.

Coincidentally, I recently was looking for a disjoint set crate to use for myself, and when I wasn't satisfied with what I found on crates.io (all existing ones seem to be not maintained and/or not documented/not tested), I decided to publish my own: https://crates.io/crates/disjoint.

So it turned out there were just some small additional features necessary in my crate to enable it as a possible replacement for partitions for you. I implemented these now, and therefore can now provide this pull request to reactivate your appx DBSCAN.

I split the implementation in two commits:

Reverted Remove appx DBSCAN impl and replace with type aliases #298. This just reintroduces the appx DBSCAN impl without any further changes.
Replaced dependencies: partitions -> disjoint.

If you just look at the second commit, you can see that it's really almost a drop-in replacement. Only a few lines change due to the slightly different interface.

…liases)

quietlychris · 2023-04-24T12:41:26Z

Any chance you're willing to run/post some benchmarks? I believe part of the rationale for removing Appx. DBSCAN, beyond the dependency issue, was that it didn't really add any new capability since the implementation itself was actually slower than just standard DBSCAN, leaving little practical reason why someone would choose to use it. @YuhanLiin, any thoughts?

jogru0 · 2023-04-25T07:53:25Z

So, I benchmarked my disjoint set implementation itself, as I wanted it to be not slower than existing alternatives, and I wanted to choose implementation details to optimize for speed.

For that, I looked at how fast Kruskal's algorithm runs on an example graph with 1 million vertices and ~8.8 million edges, already sorted by weight, with measures including creating the disjoint set data structure and allocating memory for the result Vec on demand.

The imo most fair comparison to the partition crate is to use DisjointSetVec<()> for that, even though you could use DisjointSet directly (which turns out to not even be faster as Rust handles Vec<T> with size_of<T>() == 0 very efficiently), and to not use the Quick-Union variant of the algorithm, as that's not possible with partition.

With these restrictions, on my local AMD Ryzen 5 3600, I get the following results:

partition: 290ms +/- 10ms
disjoint: 57ms +/- 1ms

That being said, of course this is a very specific benchmark, and I assume Appx. DBSCAN is a very different algorithm, so this doesn't necessarily mean anything for the performance of that. I don't know anything about what Appx. DBSCAN even is, to be honest (I just drop-in replaced the old dependency with a new one), but my gut feeling would be that the performance of the disjoint set data structure doesn't even matter that much here, as there is so much potentially slow other stuff happening. But I really don't know and probably someone with more knowledge about clusting algorithms should do benchmarks for that, if you are interested in that.

YuhanLiin · 2023-04-26T01:46:12Z

For a quick comparison can you run the normal DBSCAN and Approx DBSCAN benchmarks and post the results? If Approx is slower than we should just get rid of it.

codecov-commenter · 2023-04-26T02:02:09Z

Codecov Report

Patch coverage: 72.72% and project coverage change: +1.41 🎉

Comparison is base (4adece5) 37.08% compared to head (8091015) 38.50%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #304      +/-   ##
==========================================
+ Coverage   37.08%   38.50%   +1.41%     
==========================================
  Files          94       94              
  Lines        6196     6218      +22     
==========================================
+ Hits         2298     2394      +96     
+ Misses       3898     3824      -74

Impacted Files	Coverage Δ
...infa-clustering/src/appx_dbscan/cells_grid/cell.rs	`45.90% <ø> (+45.90%)`	⬆️
...linfa-clustering/src/appx_dbscan/cells_grid/mod.rs	`55.35% <60.00%> (+55.35%)`	⬆️
...linfa-clustering/src/appx_dbscan/clustering/mod.rs	`48.48% <83.33%> (+48.48%)`	⬆️

... and 50 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

jogru0 added 2 commits April 13, 2023 02:06

Reverted rust-ml#298 (Remove appx DBSCAN impl and replace with type a…

879205f

…liases)

Replace partitions::PartitionVec with disjoint::DisjointSetVec.

bd7a892

jogru0 changed the title ~~Reactivate apppx DBSCAN impl with new disjoint set dependency~~ Reactivate appx DBSCAN impl with new disjoint set dependency Apr 23, 2023

Bumped disjoint version from 0.5.0 to 0.6.0.

8091015

jogru0 force-pushed the from_partitons_to_disjoint branch from ae130d3 to 8091015 Compare April 23, 2023 18:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reactivate appx DBSCAN impl with new disjoint set dependency #304

Reactivate appx DBSCAN impl with new disjoint set dependency #304

jogru0 commented Apr 19, 2023 •

edited

quietlychris commented Apr 24, 2023

jogru0 commented Apr 25, 2023 •

edited

YuhanLiin commented Apr 26, 2023

codecov-commenter commented Apr 26, 2023

Reactivate appx DBSCAN impl with new disjoint set dependency #304

Are you sure you want to change the base?

Reactivate appx DBSCAN impl with new disjoint set dependency #304

Conversation

jogru0 commented Apr 19, 2023 • edited

quietlychris commented Apr 24, 2023

jogru0 commented Apr 25, 2023 • edited

YuhanLiin commented Apr 26, 2023

codecov-commenter commented Apr 26, 2023

Codecov Report

jogru0 commented Apr 19, 2023 •

edited

jogru0 commented Apr 25, 2023 •

edited