Add HDBSCAN from HorseML.jl #273

MommaWatasu · 2024-03-22T16:35:12Z

Add HDBSCAN

I found this issue. I wanted to my code to be useful, so compiled it into hdbscan.jl so it works as is.
I changed some code from my original to get closer to the code of this repo.

Abstract of functions and structures

HDBSCANGraph: is used to build a minimum-spanning-tree
HDBSCANCluster: is used to build cluster based on minimum-spanning-tree
HDBSCANResult: is used to return the result
hdbscan!: main function that performs hdbscan

As I wrote in comment, many utility functions are just converted from numpy by myself, so I don't know many about them.

Usage

This is the usage of main function.

hdbscan!(points::AbstractMatrix, k::Int64, min_cluster_size::INt64; gen_mst::Bool=true, mst=nothing)

Parameters

points: the d×n matrix, where each column is a d-dimensional coordinate of a point
k: we will define "core distance of point A" as the distance between point A and the k th neighbor point of point A.
min_cluster_size: minimum number of points in the cluster
gen_mst: whether to generate minimum-spannig-tree or not
mst: when is specified and gen_mst is false, new mst won't be generated

Example

I checked that this following code is available:

# include hdbscan.jl before run this code
using CSV
using DataFrames
using Plots
data = CSV.read("/home/watasu/Documents/code/HorseML.jl/test/data/clustering2.csv", DataFrame) |> Matrix
result = hdbscan!(data, 5, 3)
plot(title = "Clustering by HDBSCAN")
result = result.labels
for i in -1 : maximum(result)
    X = data[findall(result.==i), :]
    plot!(X[:, 1], X[:, 2], st=:scatter)
end
plot!()

make PR

codecov-commenter · 2024-03-22T16:40:07Z

Codecov Report

Attention: Patch coverage is 97.22222% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 95.56%. Comparing base (b4df21a) to head (dc5dd40).
Report is 9 commits behind head on master.

Files	Patch %	Lines
src/hdbscan.jl	97.39%	3 Missing ⚠️
src/unionfind.jl	96.55%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #273      +/-   ##
==========================================
+ Coverage   95.40%   95.56%   +0.15%     
==========================================
  Files          20       22       +2     
  Lines        1503     1647     +144     
==========================================
+ Hits         1434     1574     +140     
- Misses         69       73       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

alyst · 2024-03-22T17:08:54Z

@MommaWatasu Thanks for the PR! I think it would be a useful addition to the package. I will try to review it soon. Meanwhile -- it looks like there are no unit tests for the method. Could you please add some?

there was no test for it

test failed on a ubuntu-latest-x86

tests still fails on ubuntu-latest-x86

MommaWatasu · 2024-03-23T00:48:38Z

I added some test for hdbscan. But I'm sorry for not being able to write unit test for utility functions.
If you want to write tests for them, you should check scipy for their specification.
They are coming from:

erf: scipy.special.erf
logpdf: I think the original code is this

alyst · 2024-03-23T10:13:31Z

@MommaWatasu Thanks for adding unit tests. Generally we don't test internal utility functions, we only test public API, but aim to cover both meaningful examples (small datasets with nontrivial clusterings) and corner cases (e.g. single data point).
As for some utility functions - erf/erfc are available in Distributions.jl, which Clusterings.jl already depends upon, so you should rather use these ones. pdf/logpdf are also declared there.

there were functions for Xmeans

deleted some utility functions

Documentation workflow failed due to the docs for it

I forgot to export it

MommaWatasu · 2024-03-23T14:46:48Z

I noticed that erf and logpdf aren't for HDBSCAN (but for Xmeans). I deleted them and updated test. And also, I added simple docs since Documentation workflow failed without it.

alyst · 2024-03-23T19:58:07Z

I noticed that erf and logpdf aren't for HDBSCAN (but for Xmeans). I deleted them and updated test.

Great, erf is provided by SpecialFunctions.jl, but as we want to be conservative on the number of package dependencies, it is convenient that we don't need it.

alyst

Thanks again for the PR!
The first iteration looks good -- we don't need much refactoring, mostly some method renames and object field tweaks for clarity.

Of a bigger things:

check the dimensions of the points matrix
use Distances.jl API to calculate the point distances (see the other methods, e.g. clustering_quality() for how we do it); it is worth precalculating all pairwise distances and passing it to the methods that calculate core distances and build the graph. We can also allow the user to specify the metric as a kwarg to hdbscan, which would be a useful generalization.
add more tests to check that the point assignments to the clusters are correct

docs/source/hdbscan.md

src/hdbscan.jl

test/hdbscan.jl

there is no need to create new file

add comment and alias

hdbscan.md remains

the progress is temporary

add comment and improve performance, etc.

changes sugges alyst

error occured with Julia1.10

forgot to remove debugging code

add description for detail

fix links

add space

make isnoise available for user

MommaWatasu · 2024-04-27T15:00:05Z

@alyst
I applied all your suggestions. Is there anything else I have to fix?

src/hdbscan.jl

we don't really need a function, this is one line operation

for consistency

and remove Base.getindex()

add_edges!() is used only once

alyst · 2024-04-27T21:53:38Z

I applied all your suggestions. Is there anything else I have to fix?

@MommaWatasu Thank you for adjusting the PR! We are getting close to be able to merge, but we would need one more iteration.

Note that I have pushed some adjustments to the code directly to your branch, so please make sure to pull them first.

TODO items:

at the moment the tests that you have do not really test whether the clusters are correct. In fact, at the moment all the points are assigned to the noise cluster (I think it was also the case before my changes).
Please add the test(s) that the assignments/clusters are correct. Ideally, we need to test a more complex clustering (more than 2 clusters), also would be nice to test that changing ncore, or min_cluster_size affects the result
Now that I understand the algorithm a bit more, it looks like HdbscanCluster is an internal structure, and it is rather a HdbscanTree node than the cluster you return to the user. It contains the fields like stability, children etc: some of them we should not expose to the user, the others you don't really set when you are preparing the resulting clusters. I suggest that you rename this structure into HdbscanNode (this is a non-exported structure for the algorithm), and create a new one, HdbscanCluster (the exported one returned to the user). If there are any properties of the cluster, such as stability or being noise -- please make sure to add these relevant fields to the HdbscanCluster and properly initialize them when you are generating the result.
HdbscanResult should inherit from ClusteringResult and support its API (in particular, counts is missing)
move UnionFind to a separate source file unionfind.jl. I'm not 100% sure it belongs to Clustering.jl. We may potentially use the DataStructures.jl, but this code looks rather compact, so I think I prefer keeping it over depending on another big package.
cleanup UnionFind terminology. I think set_id/issameset/items would be an optimal choice.
add the unit tests for UnionFind, e.g. that finding root, issameset, unite! work as expected
add the newlines in the docstrings that separate declaration from the description
add newlines that separate struct fields from the inner constructor

replace eachcol with alternatives

add newline

cleanup UnionFind terminology and move it to the other file

there wes no test for it

rename HdbscanCluster into HdbscanNode and create new one to expose to the user

the algorithm went wrong

ensure that `min_size` effects properly

add counts field into HdbscanResult

MommaWatasu · 2024-05-02T02:20:55Z

@alyst I have one thing to apology. I found that the reason why the clustering result went wrong was my serious mistake about the algorithm. I fixed the algorithm and checked the result is correct. In addition, all the TODO items have been done(but I couldn't add the unit test about ncore because I don't know how it effects to the result).

I would appreciate it if you could check for any performance issues regarding the fixed algorithm.

[add function or file]add hdbscan

23bbed1

make PR

MommaWatasu added 3 commits March 23, 2024 09:09

[test]add test for hdbscan

fa44398

there was no test for it

[fix]change Int64 to Int

4e64fdf

test failed on a ubuntu-latest-x86

[fix]change all Int64 into Int

e294565

tests still fails on ubuntu-latest-x86

MommaWatasu added 4 commits March 23, 2024 22:22

[change]change usage and remove extra code

b851997

there were functions for Xmeans

[test]update test

7822a7c

deleted some utility functions

[docs]add docs for HDBSCAN

8901cfe

Documentation workflow failed due to the docs for it

[fix]export HdbscanCluster

6de5d02

I forgot to export it

alyst requested changes Apr 13, 2024

View reviewed changes

MommaWatasu added 12 commits April 15, 2024 08:54

[docs]merge docs of HDBSCAN with DBSCAN.md

85c5644

there is no need to create new file

[clean]refactoring HDBSCANGraph

edcc70a

add comment and alias

[docs]fix docs

939ce65

hdbscan.md remains

[clean]refactoring

2f67d07

the progress is temporary

[clean]refactoring

61463f2

add comment and improve performance, etc.

[test]update test

09ed174

changes sugges alyst

[fix]change isnothing into ===

6039c0c

error occured with Julia1.10

[fix]remove println

3cc7689

forgot to remove debugging code

[docs]update docs

1798148

add description for detail

[docs]fix docs

a0a819e

fix links

[fix]fix docstring

bf38eb6

add space

[fix]add isnoise to the list of exprted function

380acf1

make isnoise available for user