Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HDBSCAN from HorseML.jl #273

Open
wants to merge 47 commits into
base: master
Choose a base branch
from

Conversation

MommaWatasu
Copy link

Add HDBSCAN

I found this issue. I wanted to my code to be useful, so compiled it into hdbscan.jl so it works as is.
I changed some code from my original to get closer to the code of this repo.

Abstract of functions and structures

  • HDBSCANGraph: is used to build a minimum-spanning-tree
  • HDBSCANCluster: is used to build cluster based on minimum-spanning-tree
  • HDBSCANResult: is used to return the result
  • hdbscan!: main function that performs hdbscan

As I wrote in comment, many utility functions are just converted from numpy by myself, so I don't know many about them.

Usage

This is the usage of main function.

hdbscan!(points::AbstractMatrix, k::Int64, min_cluster_size::INt64; gen_mst::Bool=true, mst=nothing)

Parameters

  • points: the d×n matrix, where each column is a d-dimensional coordinate of a point
  • k: we will define "core distance of point A" as the distance between point A and the k th neighbor point of point A.
  • min_cluster_size: minimum number of points in the cluster
  • gen_mst: whether to generate minimum-spannig-tree or not
  • mst: when is specified and gen_mst is false, new mst won't be generated

Example

I checked that this following code is available:

# include hdbscan.jl before run this code
using CSV
using DataFrames
using Plots
data = CSV.read("/home/watasu/Documents/code/HorseML.jl/test/data/clustering2.csv", DataFrame) |> Matrix
result = hdbscan!(data, 5, 3)
plot(title = "Clustering by HDBSCAN")
result = result.labels
for i in -1 : maximum(result)
    X = data[findall(result.==i), :]
    plot!(X[:, 1], X[:, 2], st=:scatter)
end
plot!()

@codecov-commenter
Copy link

codecov-commenter commented Mar 22, 2024

Codecov Report

Attention: Patch coverage is 97.22222% with 4 lines in your changes are missing coverage. Please review.

Project coverage is 95.56%. Comparing base (b4df21a) to head (dc5dd40).
Report is 9 commits behind head on master.

Files Patch % Lines
src/hdbscan.jl 97.39% 3 Missing ⚠️
src/unionfind.jl 96.55% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #273      +/-   ##
==========================================
+ Coverage   95.40%   95.56%   +0.15%     
==========================================
  Files          20       22       +2     
  Lines        1503     1647     +144     
==========================================
+ Hits         1434     1574     +140     
- Misses         69       73       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alyst
Copy link
Member

alyst commented Mar 22, 2024

@MommaWatasu Thanks for the PR! I think it would be a useful addition to the package. I will try to review it soon. Meanwhile -- it looks like there are no unit tests for the method. Could you please add some?

there was no test for it
test failed on a ubuntu-latest-x86
tests still fails on ubuntu-latest-x86
@MommaWatasu
Copy link
Author

I added some test for hdbscan. But I'm sorry for not being able to write unit test for utility functions.
If you want to write tests for them, you should check scipy for their specification.
They are coming from:

  • erf: scipy.special.erf
  • logpdf: I think the original code is this

@alyst
Copy link
Member

alyst commented Mar 23, 2024

@MommaWatasu Thanks for adding unit tests. Generally we don't test internal utility functions, we only test public API, but aim to cover both meaningful examples (small datasets with nontrivial clusterings) and corner cases (e.g. single data point).
As for some utility functions - erf/erfc are available in Distributions.jl, which Clusterings.jl already depends upon, so you should rather use these ones. pdf/logpdf are also declared there.

there were functions for Xmeans
deleted some utility functions
Documentation workflow failed due to the docs for it
I forgot to export it
@MommaWatasu
Copy link
Author

I noticed that erf and logpdf aren't for HDBSCAN (but for Xmeans). I deleted them and updated test. And also, I added simple docs since Documentation workflow failed without it.

@alyst
Copy link
Member

alyst commented Mar 23, 2024

I noticed that erf and logpdf aren't for HDBSCAN (but for Xmeans). I deleted them and updated test.

Great, erf is provided by SpecialFunctions.jl, but as we want to be conservative on the number of package dependencies, it is convenient that we don't need it.

Copy link
Member

@alyst alyst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for the PR!
The first iteration looks good -- we don't need much refactoring, mostly some method renames and object field tweaks for clarity.

Of a bigger things:

  • check the dimensions of the points matrix
  • use Distances.jl API to calculate the point distances (see the other methods, e.g. clustering_quality() for how we do it); it is worth precalculating all pairwise distances and passing it to the methods that calculate core distances and build the graph. We can also allow the user to specify the metric as a kwarg to hdbscan, which would be a useful generalization.
  • add more tests to check that the point assignments to the clusters are correct

docs/source/hdbscan.md Outdated Show resolved Hide resolved
src/hdbscan.jl Outdated Show resolved Hide resolved
src/hdbscan.jl Outdated Show resolved Hide resolved
src/hdbscan.jl Outdated Show resolved Hide resolved
src/hdbscan.jl Outdated Show resolved Hide resolved
src/hdbscan.jl Outdated Show resolved Hide resolved
src/hdbscan.jl Outdated Show resolved Hide resolved
src/hdbscan.jl Outdated Show resolved Hide resolved
src/hdbscan.jl Outdated Show resolved Hide resolved
test/hdbscan.jl Outdated Show resolved Hide resolved
there is no need to create new file
add comment and alias
hdbscan.md remains
the progress is temporary
add comment and improve performance, etc.
changes sugges alyst
error occured with Julia1.10
forgot to remove debugging code
add description for detail
fix links
make isnoise available for user
@MommaWatasu
Copy link
Author

@alyst
I applied all your suggestions. Is there anything else I have to fix?

src/hdbscan.jl Outdated Show resolved Hide resolved
src/hdbscan.jl Outdated Show resolved Hide resolved
src/hdbscan.jl Outdated Show resolved Hide resolved
src/hdbscan.jl Outdated Show resolved Hide resolved
src/hdbscan.jl Outdated Show resolved Hide resolved
@alyst
Copy link
Member

alyst commented Apr 27, 2024

I applied all your suggestions. Is there anything else I have to fix?

@MommaWatasu Thank you for adjusting the PR! We are getting close to be able to merge, but we would need one more iteration.

Note that I have pushed some adjustments to the code directly to your branch, so please make sure to pull them first.

TODO items:

  • at the moment the tests that you have do not really test whether the clusters are correct. In fact, at the moment all the points are assigned to the noise cluster (I think it was also the case before my changes).
    Please add the test(s) that the assignments/clusters are correct. Ideally, we need to test a more complex clustering (more than 2 clusters), also would be nice to test that changing ncore, or min_cluster_size affects the result
  • Now that I understand the algorithm a bit more, it looks like HdbscanCluster is an internal structure, and it is rather a HdbscanTree node than the cluster you return to the user. It contains the fields like stability, children etc: some of them we should not expose to the user, the others you don't really set when you are preparing the resulting clusters. I suggest that you rename this structure into HdbscanNode (this is a non-exported structure for the algorithm), and create a new one, HdbscanCluster (the exported one returned to the user). If there are any properties of the cluster, such as stability or being noise -- please make sure to add these relevant fields to the HdbscanCluster and properly initialize them when you are generating the result.
  • HdbscanResult should inherit from ClusteringResult and support its API (in particular, counts is missing)
  • move UnionFind to a separate source file unionfind.jl. I'm not 100% sure it belongs to Clustering.jl. We may potentially use the DataStructures.jl, but this code looks rather compact, so I think I prefer keeping it over depending on another big package.
  • cleanup UnionFind terminology. I think set_id/issameset/items would be an optimal choice.
  • add the unit tests for UnionFind, e.g. that finding root, issameset, unite! work as expected
  • add the newlines in the docstrings that separate declaration from the description
  • add newlines that separate struct fields from the inner constructor

alyst and others added 12 commits April 27, 2024 15:08
replace eachcol with alternatives
add newline
cleanup UnionFind terminology and move it to the other file
there wes no test for it
rename HdbscanCluster into HdbscanNode and create new one to expose to the user
the algorithm went wrong
ensure that `min_size` effects properly
add counts field into HdbscanResult
@MommaWatasu
Copy link
Author

@alyst I have one thing to apology. I found that the reason why the clustering result went wrong was my serious mistake about the algorithm. I fixed the algorithm and checked the result is correct. In addition, all the TODO items have been done(but I couldn't add the unit test about ncore because I don't know how it effects to the result).

I would appreciate it if you could check for any performance issues regarding the fixed algorithm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants