Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN Cleaned cluster/_hdbscan/_reachability.pyx #24701

Merged
merged 29 commits into from
Feb 14, 2023

Conversation

Micky774
Copy link
Contributor

Reference Issues/PRs

Towards #24686

What does this implement/fix? Explain your changes.

Removes unnecessary imports and provides clarifying TODO comment regarding future implementation of algorithm.

Any other comments?

Please feel free to suggest stylistic changes to include to sklearn/cluster/_hdbscan/_reachability.pyx in this PR, however the main work to be done in that file is rewriting the LIL-based sparse mutual reachability algorithm to a CSR-based algorithm, which is currently slated as a follow-up PR after release.

@Micky774 Micky774 added module:cluster Quick Review For PRs that are quick to review labels Oct 19, 2022
@Micky774 Micky774 changed the title CLN Cleaned imports and improved inline documentation CLN Cleaned cluster/_hdbscan/_reachability.pyx Oct 19, 2022
@Micky774 Micky774 mentioned this pull request Oct 19, 2022
13 tasks
@glemaitre glemaitre self-requested a review October 19, 2022 09:55
@glemaitre
Copy link
Member

I am not sure the reason, but the name of the branch does not allow me to checkout your branch.
somehow, it would be best to have a branch name like hdbscan_clean_reachibility instead of HDBSCAN/. I assume that your previous branch that was named hdbscan create some kind of conflicts.

@glemaitre
Copy link
Member

I opened Micky774#5 to address some naming issues and a couple of Cython improvement.

@glemaitre glemaitre removed their request for review November 3, 2022 10:55
@Micky774
Copy link
Contributor Author

Micky774 commented Nov 5, 2022

From the other PR:

Something that I was wondering and seems quite important, it seems that we assume that the distance used is symmetric. Otherwise, the way we build the core_distances would not work.

Yes, this is a critical assumption. I've updated the code to include some validation on the precomputed distance matrix to ensure symmetry. As an aside, in other parts of the codebase that assume symmetric matrices, I know we check for at least squareness but I'm not sure if we check for actual symmetry. It's not expensive using np.allclose(X, X.T) and takes ~650ms for a (10_000, 10_000) symmetric matrix (to avoid potential early-stops). If it is not present in other areas, I wonder if it is worth adding.

@glemaitre
Copy link
Member

np.allclose will not work on sparse matrix I assume.

It might be worth using sklearn.utils._testing.assert_allclose_dense_sparse to not have to bother and reraise an informative error message:

try:
    assert_allclose_dense_sparse(X, X.T)
except AssertError as exc:
    raise ValueError(...) from exc

@thomasjpfan
Copy link
Member

In sklearn/utils/validation.py, there is a _allclose_dense_sparse which raises ValueError already:

def _allclose_dense_sparse(x, y, rtol=1e-7, atol=1e-9):

Copy link
Member

@glemaitre glemaitre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this one.

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, this looks good.

Comment on lines 113 to 114
distance_matrix, further_neighbor_idx, axis=1
)[:, further_neighbor_idx]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using [:, further_neighbor_idx] here, means that core_distances is not contiguous anymore. Does this lead to a performance regression here?

I'm okay with partitioning in the original code, to keep core_distances contiguous.

Also, I think that slicing the original 2d array from np.partition is another instance of #17299. The workaround is to use further_neighbor_idx to index core_distances when computing the mutual_reachibility_distance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out there is no significant performance or memory difference between the current non-contiguous setup and one where c-contiguity is enforced:

from sklearn.metrics.pairwise import euclidean_distances
from sklearn.datasets import make_blobs
from sklearn.cluster._hdbscan._reachability import _dense_mutual_reachability_graph

X, _ = make_blobs(n_samples=20_000, random_state=10)
D = euclidean_distances(X)
%timeit -n 5 _dense_mutual_reachability_graph(D, 5)
non-contig: 2.66 s ± 73.3 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
contig: 2.74 s ± 197 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)
non-contiguous / contiguous
Total number of allocations: 92508/93065
Total number of frames seen: 7109/7115
Peak memory usage: 6.6 GB/6.6 GB

Currently we do not require that distance_matrix is actually contiguous, however that would need to be the case if we enforce that core_distances is contiguous. Granted there's no performance or memory gains, I'm not sure it's worth changing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the benchmark! Probably, the delta needs to be assessed based on what takes the longer to run on _dense_mutual_reachability_graph. This is inspectable using profilers like py-spy for instance.

Given the scope of this PR, I am OK pursuing as such and exploring potential performance improvements in subsequent PRs. What do you think?

Copy link
Contributor Author

@Micky774 Micky774 Nov 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the tip to use py-spy, it's quite nice for this.

Looking at it again, it seems that swapping theprange to the outer loop and enforcing a contiguous 1D core_distances array is significantly faster. Before, the dense loop occupied ~30% of the runtime of _dense_mutual_reachability_graph whereas with the changes the new proportion is ~1%. I tested whether indexing into the 2D core_distances or creating a 1D core_distances is preferable, and it seems the latter is notably faster, most likely due to caching.

The new forced contiguity system is still faster even when dealing with non-contiguous inputs. Flame graphs below (not sure what the best way to share them would be).

Current

current

Swap Parallelization Axis

swap

Swap Parallelization Axis and Enforce C-Contig

force_contig

Swap Parallelization Axis and Enforce C-Contig (non-contig data)

force_contig_worst

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the report.

I think posting .svg profiles of py-spy results interpretable by Speedscope with:

py-spy record --native -o py-spy-profile.svg -f speedscope -- python ./perf.py

might be the most adapted for others to inspect.

Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Micky774.

Here are a few comments.

sklearn/cluster/_hdbscan/_reachability.pyx Outdated Show resolved Hide resolved
sklearn/cluster/_hdbscan/_reachability.pyx Outdated Show resolved Hide resolved
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM up to a few suggestions. Thank you, @Micky774!

Also the linting failure on the CI looks unrelated (the Azure worker seems to have failed).

Edit: since @thomasjpfan has reviewed and since there is one unresolved thread, I would wait for @thomasjpfan to approve this PR before merging it.

sklearn/cluster/_hdbscan/_reachability.pyx Outdated Show resolved Hide resolved
sklearn/cluster/_hdbscan/hdbscan.py Outdated Show resolved Hide resolved
sklearn/cluster/_hdbscan/hdbscan.py Show resolved Hide resolved
sklearn/cluster/_hdbscan/_reachability.pyx Show resolved Hide resolved
sklearn/cluster/_hdbscan/_reachability.pyx Outdated Show resolved Hide resolved
Comment on lines 113 to 114
distance_matrix, further_neighbor_idx, axis=1
)[:, further_neighbor_idx]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the benchmark! Probably, the delta needs to be assessed based on what takes the longer to run on _dense_mutual_reachability_graph. This is inspectable using profilers like py-spy for instance.

Given the scope of this PR, I am OK pursuing as such and exploring potential performance improvements in subsequent PRs. What do you think?

@jjerphan
Copy link
Member

@thomasjpfan: should we merge?

Copy link
Member

@thomasjpfan thomasjpfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good now. One last concern about using prange.

sklearn/cluster/_hdbscan/_reachability.pyx Outdated Show resolved Hide resolved
Copy link
Member

@jjerphan jjerphan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a last comment.

sklearn/cluster/tests/test_hdbscan.py Show resolved Hide resolved
@jjerphan jjerphan merged commit 429e93c into scikit-learn:hdbscan Feb 14, 2023
@Micky774 Micky774 deleted the HDBSCAN/clean_reachability branch February 14, 2023 17:16
Micky774 added a commit to Micky774/scikit-learn that referenced this pull request May 16, 2023
Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>
Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>
Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>
@Micky774 Micky774 mentioned this pull request Jul 7, 2023
13 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants