You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
UMAP spectral initialization yields unexpected initial layout results. Consequently, global structure of the input data often is not preserved, even when a very high n_epochs parameter is used. If UMAP is used as a pre-processing step for clustering, this behavior can impact results significantly, depending on the geometry of the input data.
Steps/Code to reproduce bug
The code below provides a minimal example which exhibits the issue on scikit-learn's make_circles dataset. Since the results of the cuML UMAP implementation may vary despite setting a random_state seed, one may have to repeatedly run the code below in order to get a result which demonstrates the problematic behavior.
The resulting plot should look something like this:
Expected behavior
Spectral initialization should yield similar global structure as the CPU version of UMAP in the initial layout and consequently after layout optimization. For the most part at least, spatially distinct connected components of the input data should stay separated after embedding; UMAP with spectral initialization should not make clustering such cases more difficult (see rings example below).
The results of CPU and GPU UMAP should not be identical, of course, as there are understandably some implementation differences, particularly in regards to spectral embedding. However, general behavior with respect to global structure preservation under spectral initialization should be the same.
The following figure demonstrates the extreme difference in spectral initialization behavior between CPU and GPU UMAP on scikit-learn's make_blobs. Note that it seems impossible to set n_epochs to precisely 0 in the cuML implementation without invoking default values, so a minimal value of 1 is used below. (Apologies for the small text, CPU results are in the top row, GPU results are in the bottom row.)
For a dataset like make_blobs, the initialization is very different, but the CPU and GPU results can generally be made the same by running at a high number of epochs. Here is a similar figure demonstrating the difference on a dataset comprised of three non-concentric rings, which cannot be improved by setting high n_epochs (see further below):
I do not suspect that the difference in spectral initialization results can always be resolved by raising the n_epochs parameter, as has been suggested as a potential resolution to similar issues reported by other users (e.g. #5474). Since the low-dimensional layout optimization only acts on KNN-localized edge weights, I don't expect that any number of epochs would promise the recovery of global structure. To be sure, we can check that at n_epochs values of 500 and 2000 we observe the same difference in behavior between CPU and GPU UMAP for the three rings dataset above.
Environment details (please complete the following information):
Environment location: Bare metal
Linux Distro/Architecture: Pop!_OS 22.04 LTS x86_64
I initially felt this could be related to the known issue with the Laplacian eigenmaps solver mentioned in the comments of #5474, however the differences in results compared to the CPU solvers seem somewhat extreme.
I am also aware that CPU UMAP handles spectral layout of networks with multiple components somewhat differently than single-component networks. However, datasets which yield single-component graphs may exhibit the same behavior as above, e.g. when embedding a single ring.
I am happy to provide any additional code, examples, or environment details upon request.
Additionally, thank you all for the incredible work you do on this repository, and in particular for bringing UMAP to the GPU. You guys are phenomenal, and your efforts here are so deeply appreciated!
The text was updated successfully, but these errors were encountered:
Thanks for the issue @kc-howe! We have identified a few ill-behaviors and issues with spectral clustering from RAFT that affect UMAP in particular. We will be working on solving them, but we don't have an ETA yet, but is in our roadmap as we work on RAFT, cuML and cuVS in the next few releases.
Describe the bug
UMAP spectral initialization yields unexpected initial layout results. Consequently, global structure of the input data often is not preserved, even when a very high
n_epochs
parameter is used. If UMAP is used as a pre-processing step for clustering, this behavior can impact results significantly, depending on the geometry of the input data.Steps/Code to reproduce bug
The code below provides a minimal example which exhibits the issue on scikit-learn's
make_circles
dataset. Since the results of the cuML UMAP implementation may vary despite setting arandom_state
seed, one may have to repeatedly run the code below in order to get a result which demonstrates the problematic behavior.The resulting plot should look something like this:
Expected behavior
Spectral initialization should yield similar global structure as the CPU version of UMAP in the initial layout and consequently after layout optimization. For the most part at least, spatially distinct connected components of the input data should stay separated after embedding; UMAP with spectral initialization should not make clustering such cases more difficult (see rings example below).
The results of CPU and GPU UMAP should not be identical, of course, as there are understandably some implementation differences, particularly in regards to spectral embedding. However, general behavior with respect to global structure preservation under spectral initialization should be the same.
The following figure demonstrates the extreme difference in spectral initialization behavior between CPU and GPU UMAP on scikit-learn's
make_blobs
. Note that it seems impossible to setn_epochs
to precisely 0 in the cuML implementation without invoking default values, so a minimal value of 1 is used below. (Apologies for the small text, CPU results are in the top row, GPU results are in the bottom row.)For a dataset like
make_blobs
, the initialization is very different, but the CPU and GPU results can generally be made the same by running at a high number of epochs. Here is a similar figure demonstrating the difference on a dataset comprised of three non-concentric rings, which cannot be improved by setting highn_epochs
(see further below):I do not suspect that the difference in spectral initialization results can always be resolved by raising the
n_epochs
parameter, as has been suggested as a potential resolution to similar issues reported by other users (e.g. #5474). Since the low-dimensional layout optimization only acts on KNN-localized edge weights, I don't expect that any number of epochs would promise the recovery of global structure. To be sure, we can check that atn_epochs
values of 500 and 2000 we observe the same difference in behavior between CPU and GPU UMAP for the three rings dataset above.Environment details (please complete the following information):
Additional context
I initially felt this could be related to the known issue with the Laplacian eigenmaps solver mentioned in the comments of #5474, however the differences in results compared to the CPU solvers seem somewhat extreme.
I am also aware that CPU UMAP handles spectral layout of networks with multiple components somewhat differently than single-component networks. However, datasets which yield single-component graphs may exhibit the same behavior as above, e.g. when embedding a single ring.
I am happy to provide any additional code, examples, or environment details upon request.
Additionally, thank you all for the incredible work you do on this repository, and in particular for bringing UMAP to the GPU. You guys are phenomenal, and your efforts here are so deeply appreciated!
The text was updated successfully, but these errors were encountered: