Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export nearest centroids for all data points #338

Open
ylwu-amzn opened this issue Jul 1, 2022 · 3 comments
Open

Export nearest centroids for all data points #338

ylwu-amzn opened this issue Jul 1, 2022 · 3 comments

Comments

@ylwu-amzn
Copy link
Contributor

Refer to opensearch-project/ml-commons#355 (comment)

Seems current RCF summarizer can't export nearest centroids for all data points. That means client side needs to recalculate nearest centroids with the same distance algorithm which looks like duplicate effort. As RCF summarizer has already calculated nearest centroids for all data points, how about we add some option to return all data points' centroids to avoid recalculation from client side?

@sudiptoguha
Copy link
Contributor

This is a premised question. First, RCF summarizer does not compute nearest centroids for all data points. In fact, no advanced clustering algorithm (say circa late 1990's or later) does so either. I would recommend that new implementations post 2020 take a hard look at the two issues below:

a. Syntactic -- use of clustering is most successful in data reduction/summarization. That reduction could be for (i) computational efficiency (same reasons one uses histograms as in 1d data) or denoising, where 1M points are reduced to say a 100 weighted points (almost 10000x improvement) or (ii) conceptual efficiency (or insights) where one takes a "10000 feet" view of the data -- these are the 10 largest clusters in the data (Look! how are they changing over time, etc. etc.) .

b. Semantic -- assignment to nearest centroid is limiting. It has been shown many times over that soft clustering has more desirable properties in information retrieval. Outputting the centroids leaves open the possibility of soft clustering. "Clustering assignment" applications (as in logistics) often comes with multiple constraints and assign-to-nearest is often not the end solution. Now one could use an intermediate solution of cluster-and-assign-to-nearest as a "first cut" for some problem -- but

  • (1) first cuts tend not to scale and get bogged down in non-germane restrictions. In every such use of clustering, an approach that optimizes the end-to-end usage will be more impactful.
  • (2) having "assign-to-nearest" (or any assignment) as an implicit operation saves significant time and space. For example, using 2d integer data, if a assignment column (also an integer) is always added automatically, the output data just increased by 50%. Imagine this occurring at every internal step of an algorithm.

Now, there are use cases of more information from clustering. For example, Dendogram corresponds to the hierarchy of cluster where users can zoom in/out, otherwise navigate the clusters without any fixed number of clusters. But these are application dependent. They consume significant computational resources to build and should not be built as a default for people who are interested in data reduction/summarization as in (a).

I understand that some (justifiably well regarded) common ML platforms have the API that outputs hard cluster membership similar to the output of a classification task. But clustering is not classification -- the distinction between unsupervised and supervised algorithms are more fundamental than APIs. For example, the "nearest-centroid" output is a batch concept and limits the use and benefits of streaming. If the specific output of batch oriented ML platform is desirable, why not ask people to use that batch oriented platform?

RCF will continue to focus on streaming algorithms, or at a minimum, not take a step that precludes streaming algorithms. The use of summarize in RCF is to look at the predictions derived from the individual trees in the ensemble and surface them as a "10000 feet" observability insight as in (a)(ii) above.

@ylwu-amzn
Copy link
Contributor Author

Thanks @sudiptoguha for the detailed explanation. Seems it's not reasonable to calculate every point's nearest centroid inside RCF summary. Is it possible to provide some utility API in RCF to recalculate nearest centroids? Asking this to avoid any wrong calculation and duplicate effort from client side.

@sudiptoguha
Copy link
Contributor

sudiptoguha commented Jul 27, 2022

" Is it possible to provide some utility API in RCF to recalculate nearest centroids?" -> No. I think that bloats the RCF library and is scientifically less desirable as I suggested here opensearch-project/ml-commons#358

This is a decision the "client" of any ML should take seriously; trying to avoid making that decision has serious and potentially unnecessary performance costs.

I strongly recommend the reference "The on-line textbook: Information Theory, Inference, and Learning Algorithms" http://www.inference.phy.cam.ac.uk/mackay/itila/

Please check Chapter 20, "An Example Inference Task: Clustering".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants