Skip to content
/ krakow Public

Well balanced hierarchical graph clustering

Notifications You must be signed in to change notification settings

filyp/krakow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Krakow

Well balanced hierarchical graph clustering.

It's based on the Paris algorithm, with just a tiny modification to make the trees more balanced. It's useful for example when you try to choose some cluster by traversing the tree top-down. It operates on networkx graphs.

Installation

pip install krakow

Usage

from krakow import krakow

dendrogram = krakow(Graph)

Graph must be a networkx graph, and dendrogram is in a linkage matrix format. If you want to traverse the tree from top to down, it's useful to convert this linkage matrix into a tree with scipy.cluster.hierarchy.to_tree.

You can easily visualize dendrograms:

from krakow.utils import create_dendrogram

create_dendrogram(dendrogram)

Img

Balance comparison

Balance can be adjusted with the alpha parameter. On default it's set to 2 (can be seen in the image above). When set to 1, the algorithm is identical to the Paris algorithm. The tree is clearly less balanced then:

dendrogram = krakow(Graph, alpha=1)
create_dendrogram(dendrogram)

Img

For alpha=1.5: Img

Any value >= 1 can be set, but enforcing a very high balance can harm the clustering quality.

Math

The only difference from the Paris algorithm, is the definition of distance between clusters. In Paris (see equation 2 in their paper), it's:

formula

Here, it's defined as:

formula

(with both alpha and beta >= 1)

In effect, big clusters are treated as having more distance between them, so small clusters will be merged first. This leads to a more balanced tree.

TODO: prove that for alpha, beta >= 1 algorithm is still reducible (which is necessary for correctness).

We can also experiment with other distance definitions (as long as they are reducible), and check which give the best balance while preserving clustering quality (clustering quality measured by Dasgupta's cost).