Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contraction Clustering (RASTER): A very fast and parallelizable clustering algorithm #170

Open
gregorulm opened this issue Dec 26, 2023 · 4 comments

Comments

@gregorulm
Copy link

Copied from: scikit-learn/scikit-learn#27848
In the comments, it was suggested that scikit-learn-extras may be a better fit for this algorithm.

Describe the workflow you want to enable

RASTER is a very fast clustering algorithm that runs in linear time, uses constant memory, and only requires a single pass. The relevant package is cluster.

Describe your proposed solution

RASTER has been shown to be faster than all other clustering algorithms that are part of the cluster package (see comparative results in the "alternatives" field). A detailed description of the algorithm is in this paper. The key idea is that data points are projected onto a grid. This helper data structure that allows us to cluster data points at the desired level of precision and at a speed much faster than any other clustering algorithm we encountered in the literature. The closest comparison we were made aware of was CLIQUE, but RASTER is more efficient and, in fact, many orders of magnitude faster, which we have also shown experimentally, see Appendix B in the paper above.

Plots with comparisons:
Screen Shot 2023-11-26 at 16 10 16

Example of adjusting the precision parameter:
Screen Shot 2023-11-26 at 16 12 47

Pseudo-code:
Screen Shot 2023-11-26 at 16 06 36

Implementation:
https://github.com/FraunhoferChalmersCentre/raster/tree/master

The algorithm is furthermore parallelizable.

Describe alternatives you've considered, if relevant

We compare RASTER to 10 other clustering algorithms, and have found that it outperforms them. RASTER is not only faster, it is also able to process greater amounts of data, ceteris paribus. Here is a summary of the results of our research:
Screen Shot 2023-11-26 at 16 04 57

Additional context

RASTER was discovered in the context of an industrial research project "FUMA - Fleet telematics big data analytics for vehicle Usage Modeling and Analysis" (description, final report), which was administered by the Swedish research agency VINNOVA. It was a collaboration between Scania, a Volkswagen subsidiary, and the Fraunhofer-Chalmers Center for Industrial Mathematics. This research project ran from 2016 to 2019.

A practical result of RASTER was that it made it possible to process TBs of real-world geo-spatial data on a local workstation instead of a data center, leading to significant cost and time-savings. This also implies that we could eliminate security risks as we could keep this highly confidential dataset in-house. In fact, due to the single-pass nature of RASTER, we could generate results locally much faster than we could have gotten them via a data center.

@TimotheeMathieu
Copy link
Contributor

Hello @gregorulm , thanks this seem interesting and could definitely be included into scikit-learn-extra.
Could you do a Pull Request from a fork, with a link to this issue.
For inclusion, you will need to add tests as well as examples and docstrings for your code, if you are upt for it you can do a PR and we can discuss there.

@smruthi-sumanth
Copy link

HI! New to OSS, and really think your proposal is worth adding to sklearn-extra. Can I get the go ahead to start working on this?

@gregorulm
Copy link
Author

@smruthi-sumanth I intend to look at the code soon. If I really do not find the time, I will let you know.

@gregorulm
Copy link
Author

@smruthi-sumanth Feel free to take over the implementation. You may find these resources helpful:

  1. Reference implementations
  2. Code artifacts including an integration with the scikit-learn clustering benchmark

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants