CLN Cleaned `cluster/_hdbscan/_reachability.pyx` #24701

Micky774 · 2022-10-19T00:23:00Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Removes unnecessary imports and provides clarifying TODO comment regarding future implementation of algorithm.

Any other comments?

Please feel free to suggest stylistic changes to include to sklearn/cluster/_hdbscan/_reachability.pyx in this PR, however the main work to be done in that file is rewriting the LIL-based sparse mutual reachability algorithm to a CSR-based algorithm, which is currently slated as a follow-up PR after release.

glemaitre · 2022-10-19T11:58:08Z

I am not sure the reason, but the name of the branch does not allow me to checkout your branch.
somehow, it would be best to have a branch name like hdbscan_clean_reachibility instead of HDBSCAN/. I assume that your previous branch that was named hdbscan create some kind of conflicts.

…n_reachibility

glemaitre · 2022-10-19T13:59:33Z

I opened Micky774#5 to address some naming issues and a couple of Cython improvement.

…lity MAINT additional cleaning in reachibility.pyx

Micky774 · 2022-11-05T18:38:57Z

From the other PR:

Something that I was wondering and seems quite important, it seems that we assume that the distance used is symmetric. Otherwise, the way we build the core_distances would not work.

Yes, this is a critical assumption. I've updated the code to include some validation on the precomputed distance matrix to ensure symmetry. As an aside, in other parts of the codebase that assume symmetric matrices, I know we check for at least squareness but I'm not sure if we check for actual symmetry. It's not expensive using np.allclose(X, X.T) and takes ~650ms for a (10_000, 10_000) symmetric matrix (to avoid potential early-stops). If it is not present in other areas, I wonder if it is worth adding.

glemaitre · 2022-11-08T08:40:18Z

np.allclose will not work on sparse matrix I assume.

It might be worth using sklearn.utils._testing.assert_allclose_dense_sparse to not have to bother and reraise an informative error message:

try:
    assert_allclose_dense_sparse(X, X.T)
except AssertError as exc:
    raise ValueError(...) from exc

thomasjpfan · 2022-11-08T16:57:48Z

In sklearn/utils/validation.py, there is a _allclose_dense_sparse which raises ValueError already:

scikit-learn/sklearn/utils/validation.py

Line 1774 in 3e47fa9

def _allclose_dense_sparse(x, y, rtol=1e-7, atol=1e-9):

glemaitre

+1 on this one.

thomasjpfan

Overall, this looks good.

thomasjpfan · 2022-11-22T19:38:30Z

sklearn/cluster/_hdbscan/_reachability.pyx

+        distance_matrix, further_neighbor_idx, axis=1
+    )[:, further_neighbor_idx]


Using [:, further_neighbor_idx] here, means that core_distances is not contiguous anymore. Does this lead to a performance regression here?

I'm okay with partitioning in the original code, to keep core_distances contiguous.

Also, I think that slicing the original 2d array from np.partition is another instance of #17299. The workaround is to use further_neighbor_idx to index core_distances when computing the mutual_reachibility_distance.

Turns out there is no significant performance or memory difference between the current non-contiguous setup and one where c-contiguity is enforced:

from sklearn.metrics.pairwise import euclidean_distances from sklearn.datasets import make_blobs from sklearn.cluster._hdbscan._reachability import _dense_mutual_reachability_graph X, _ = make_blobs(n_samples=20_000, random_state=10) D = euclidean_distances(X) %timeit -n 5 _dense_mutual_reachability_graph(D, 5)

non-contig: 2.66 s ± 73.3 ms per loop (mean ± std. dev. of 7 runs, 5 loops each) contig: 2.74 s ± 197 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)

non-contiguous / contiguous Total number of allocations: 92508/93065 Total number of frames seen: 7109/7115 Peak memory usage: 6.6 GB/6.6 GB

Currently we do not require that distance_matrix is actually contiguous, however that would need to be the case if we enforce that core_distances is contiguous. Granted there's no performance or memory gains, I'm not sure it's worth changing.

Thanks for the benchmark! Probably, the delta needs to be assessed based on what takes the longer to run on _dense_mutual_reachability_graph. This is inspectable using profilers like py-spy for instance.

Given the scope of this PR, I am OK pursuing as such and exploring potential performance improvements in subsequent PRs. What do you think?

Thanks for the tip to use py-spy, it's quite nice for this.

Looking at it again, it seems that swapping theprange to the outer loop and enforcing a contiguous 1D core_distances array is significantly faster. Before, the dense loop occupied ~30% of the runtime of _dense_mutual_reachability_graph whereas with the changes the new proportion is ~1%. I tested whether indexing into the 2D core_distances or creating a 1D core_distances is preferable, and it seems the latter is notably faster, most likely due to caching.

The new forced contiguity system is still faster even when dealing with non-contiguous inputs. Flame graphs below (not sure what the best way to share them would be).

Current

Swap Parallelization Axis

Swap Parallelization Axis and Enforce C-Contig

Swap Parallelization Axis and Enforce C-Contig (non-contig data)

Thanks for the report.

I think posting .svg profiles of py-spy results interpretable by Speedscope with:

py-spy record --native -o py-spy-profile.svg -f speedscope -- python ./perf.py

might be the most adapted for others to inspect.

jjerphan

Thanks @Micky774.

Here are a few comments.

sklearn/cluster/_hdbscan/_reachability.pyx

jjerphan

This LGTM up to a few suggestions. Thank you, @Micky774!

Also the linting failure on the CI looks unrelated (the Azure worker seems to have failed).

Edit: since @thomasjpfan has reviewed and since there is one unresolved thread, I would wait for @thomasjpfan to approve this PR before merging it.

sklearn/cluster/_hdbscan/_reachability.pyx

sklearn/cluster/_hdbscan/hdbscan.py

sklearn/cluster/_hdbscan/_reachability.pyx

jjerphan · 2022-11-30T10:15:37Z

sklearn/cluster/_hdbscan/_reachability.pyx

+        distance_matrix, further_neighbor_idx, axis=1
+    )[:, further_neighbor_idx]


Thanks for the benchmark! Probably, the delta needs to be assessed based on what takes the longer to run on _dense_mutual_reachability_graph. This is inspectable using profilers like py-spy for instance.

Given the scope of this PR, I am OK pursuing as such and exploring potential performance improvements in subsequent PRs. What do you think?

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

jjerphan · 2022-12-12T14:48:59Z

@thomasjpfan: should we merge?

thomasjpfan

Overall looks good now. One last concern about using prange.

sklearn/cluster/_hdbscan/_reachability.pyx

sklearn/metrics/_pairwise_distances_reduction/_engines.pxd

This reverts commit cc52631.

sklearn/cluster/_hdbscan/hdbscan.py

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

jjerphan

Just a last comment.

sklearn/cluster/tests/test_hdbscan.py

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz> Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com> Co-authored-by: Guillaume Lemaitre <g.lemaitre58@gmail.com>

Micky774 added 2 commits October 18, 2022 19:35

Improved documentation

a068e51

Updated comment

06152a3

Micky774 added module:cluster Quick Review For PRs that are quick to review labels Oct 19, 2022

github-actions bot added the cython label Oct 19, 2022

Micky774 added No Changelog Needed and removed cython labels Oct 19, 2022

Micky774 changed the title ~~CLN Cleaned imports and improved inline documentation~~ CLN Cleaned cluster/_hdbscan/_reachability.pyx Oct 19, 2022

Micky774 mentioned this pull request Oct 19, 2022

Path to HDBSCAN Inclusion #24686

Closed

13 tasks

glemaitre self-requested a review October 19, 2022 09:55

glemaitre added 3 commits October 19, 2022 15:40

MAINT further style improvement

0ab847c

Merge branch 'HDBSCAN/clean_reachability' into further_cleanup_hdbsca…

b4c1660

…n_reachibility

FIX let's be consistent and call min_samples

d6a59a5

glemaitre added 8 commits October 19, 2022 18:07

TMP POC for CSC processing

e09ece7

ENH CSR, fused type, no-copy

1cb0db8

iter

8a38591

homogeneous dtype for max_distance

41cb21e

TST add a couple of tests (wip)

c510bf8

TST some more tests

85c1914

fused type

9ba964d

FIX put correct name on indices

0c65f8c

glemaitre removed their request for review November 3, 2022 10:55

Micky774 and others added 2 commits November 5, 2022 13:58

Merge pull request #5 from glemaitre/further_cleanup_hdbscan_reachibi…

93f1896

…lity MAINT additional cleaning in reachibility.pyx

Added validation for precomputed distance matrix

b83f614

Micky774 force-pushed the HDBSCAN/clean_reachability branch from e1b52cc to b83f614 Compare November 5, 2022 18:33

Fixed typo

2310e1e

Updated symmetry check to account for sparse

b99ba60

glemaitre approved these changes Nov 10, 2022

View reviewed changes

thomasjpfan reviewed Nov 22, 2022

View reviewed changes

jjerphan reviewed Nov 28, 2022

View reviewed changes

sklearn/cluster/_hdbscan/_reachability.pyx Outdated Show resolved Hide resolved

sklearn/cluster/_hdbscan/_reachability.pyx Outdated Show resolved Hide resolved

Updated import order and used cython integral fused type

2874340

jjerphan approved these changes Nov 30, 2022

View reviewed changes

Micky774 and others added 4 commits November 30, 2022 18:25

Update sklearn/cluster/_hdbscan/hdbscan.py

c47a8ba

Co-authored-by: Julien Jerphanion <git@jjerphan.xyz>

Optimized loops and formatted imports

557975a

Updated sparse check

f34c121

Merge branch 'hdbscan' into HDBSCAN/clean_reachability

93e56b1

thomasjpfan reviewed Dec 19, 2022

View reviewed changes

sklearn/cluster/_hdbscan/_reachability.pyx Outdated Show resolved Hide resolved

Removed prange and added note

cc52631

thomasjpfan reviewed Dec 19, 2022

View reviewed changes

sklearn/metrics/_pairwise_distances_reduction/_engines.pxd Outdated Show resolved Hide resolved

Micky774 added 2 commits December 19, 2022 13:06

Revert "Removed prange and added note"

7eff5ba

This reverts commit cc52631.

Removed excess files and removed prange

6bcc83d

thomasjpfan reviewed Dec 19, 2022

View reviewed changes

sklearn/cluster/_hdbscan/hdbscan.py Show resolved Hide resolved

sklearn/cluster/_hdbscan/hdbscan.py Outdated Show resolved Hide resolved

Micky774 and others added 3 commits February 3, 2023 19:50

Update sklearn/cluster/_hdbscan/hdbscan.py

9715b94

Co-authored-by: Thomas J. Fan <thomasjpfan@gmail.com>

Updated test to include value errors on assymetric distance matrices

4667952

Merge branch 'hdbscan' into HDBSCAN/clean_reachability

bdac7db

jjerphan reviewed Feb 7, 2023

View reviewed changes

sklearn/cluster/tests/test_hdbscan.py Show resolved Hide resolved

Updated sparse distance matrix test

ad83829

jjerphan merged commit 429e93c into scikit-learn:hdbscan Feb 14, 2023

Micky774 deleted the HDBSCAN/clean_reachability branch February 14, 2023 17:16

Micky774 mentioned this pull request Jul 7, 2023

HDBSCAN Ongoing Work #26801

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLN Cleaned `cluster/_hdbscan/_reachability.pyx` #24701

CLN Cleaned `cluster/_hdbscan/_reachability.pyx` #24701

Micky774 commented Oct 19, 2022

glemaitre commented Oct 19, 2022

glemaitre commented Oct 19, 2022

Micky774 commented Nov 5, 2022

glemaitre commented Nov 8, 2022

thomasjpfan commented Nov 8, 2022

glemaitre left a comment

thomasjpfan left a comment

thomasjpfan Nov 22, 2022 •

edited

Micky774 Nov 29, 2022

jjerphan Nov 30, 2022

Micky774 Nov 30, 2022 •

edited

jjerphan Dec 1, 2022

jjerphan left a comment

jjerphan left a comment •

edited

jjerphan Nov 30, 2022

jjerphan commented Dec 12, 2022

thomasjpfan left a comment

jjerphan left a comment

		distance_matrix, further_neighbor_idx, axis=1
		)[:, further_neighbor_idx]

CLN Cleaned cluster/_hdbscan/_reachability.pyx #24701

CLN Cleaned cluster/_hdbscan/_reachability.pyx #24701

Conversation

Micky774 commented Oct 19, 2022

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

glemaitre commented Oct 19, 2022

glemaitre commented Oct 19, 2022

Micky774 commented Nov 5, 2022

glemaitre commented Nov 8, 2022

thomasjpfan commented Nov 8, 2022

glemaitre left a comment

Choose a reason for hiding this comment

thomasjpfan left a comment

Choose a reason for hiding this comment

thomasjpfan Nov 22, 2022 • edited

Choose a reason for hiding this comment

Micky774 Nov 29, 2022

Choose a reason for hiding this comment

jjerphan Nov 30, 2022

Choose a reason for hiding this comment

Micky774 Nov 30, 2022 • edited

Choose a reason for hiding this comment

jjerphan Dec 1, 2022

Choose a reason for hiding this comment

jjerphan left a comment

Choose a reason for hiding this comment

jjerphan left a comment • edited

Choose a reason for hiding this comment

jjerphan Nov 30, 2022

Choose a reason for hiding this comment

jjerphan commented Dec 12, 2022

thomasjpfan left a comment

Choose a reason for hiding this comment

jjerphan left a comment

Choose a reason for hiding this comment

CLN Cleaned `cluster/_hdbscan/_reachability.pyx` #24701

CLN Cleaned `cluster/_hdbscan/_reachability.pyx` #24701

thomasjpfan Nov 22, 2022 •

edited

Micky774 Nov 30, 2022 •

edited

jjerphan left a comment •

edited