Skip to content
This repository has been archived by the owner on Dec 19, 2018. It is now read-only.

culturegraph/culturegraph-clustering

Repository files navigation

culturegraph-clustering

A implementation of the graph clustering method used in culturegraph.

The input problem is encoded as a bipartite input graph. A cluster is defined as a connected component in this input graph.

A bipartite graph G(U,V) is represented by two node sets (namely U and V). We represent the bipartite graph with an adjacency list, that lists all children for each parent (assuming a digraph).

Although the used bipartite graph is undirected, we refer to the node set U with "Parent Node Set" and to the node set V with "Child Node Set".

Installation

Requirements

  • Java 8 or later

Build

culturegraph clustering

Run the following command to construct a jar with dependencies.

gradlew clean test fatJar

Command Line Client

Example

The input file describes a adjacency list, where each line starts with a parent node, followed by its child nodes. Parent and child nodes belong to distinct sets of nodes that only share connections between each other.

The file graph.txt represent a graph as a adjacency list has the schema PARENT CHILD […​ CHILD] for each line:

A a b c
B a
C f

Compute the connected components:

java -Xmx1G -jar clustering-cli-VERSION.jar -o output.graphml graph.txt
Tip
For large problem instances, you need to tweak the JVM Memory option -Xmx.

The resulting GraphML file:

<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
        http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<key id="d0" for="node" attr.name="label" attr.type="string"/>
<key id="d1" for="all" attr.name="component" attr.type="int"/>
<graph id="G" edgedefault="undirected" parse.order="nodesfirst">
<node id="n0"><data key="d0">A</data><data key="d1">1</data></node>
<node id="n1"><data key="d0">B</data><data key="d1">1</data></node>
<node id="n2"><data key="d0">C</data><data key="d1">2</data></node>
<edge id="e0" source="m0" target="n1"><data key="d1">1</data></edge>
<edge id="e1" source="m0" target="n0"><data key="d1">1</data></edge>
<edge id="e2" source="m1" target="n0"><data key="d1">1</data></edge>
<edge id="e3" source="m2" target="n0"><data key="d1">1</data></edge>
<edge id="e4" source="m3" target="n2"><data key="d1">2</data></edge>
</graph>
</graphml>

Graph visualization (made with Gephi):

Graph Visualization

The result GraphML file contains a component number for each node v in V . All nodes that share the same component number form a cluster.

You may extract the node to component mapping with:

cat output.graphml | grep "^<node" | sed -n 's:.*<data.*>\(.*\)</data><data.*>\(.*\)</data>.*:\1 \2:p'
A 1
B 1
C 2

Usage

Usage of the command line client.

Usage:

<main class> [-chV] -o=FILE [-s=NUM] INPUT

Description:

Partitions a bipartite-graph into connected components.

Parameters:

      INPUT                   Input adjacency list (plain text or gzip).

Options:

  -o, --output=FILE           GraphML file.
  -c, --compress              Compress output with gzip.
  -s, --size=NUM              Minimum component size.
                                Default: 0
  -h, --help                  Display this help message.
  -V, --version               Display version info.

Logging

This project uses the Slf4j SimpleLogger implementation for logging.

Any change in the logging behaviour needs to be passed by System Properties before running the program.

Example:

java -Dorg.slf4j.simpleLogger.logFile=/var/log/messages\
 -Dorg.slf4j.simpleLogger.defaultLogLevel=error\
 -jar <JAR> -c -o output.graphml.gz input.csv

SimpleLogger Properties:

  • Use org.slf4j.simpleLogger.logFile for setting a output file (Default: System.err).

  • Use org.slf4j.simpleLogger.defaultLogLevel for setting a log level (Default: info).

Good To Know

  • The input adjacency list should only contain unique lines.

  • If a connected component does not reaches the minimum component size, each parent node in this component is assigned to the component "-1".

Temporary Files

The procedure creates the following temporary files during a run.

Table 1. Table Temporary Files
Name Description

childNodeHashes.tmp

A list of uniques hashes for each unique child node label.

encodedInput.tmp

A encoded representation of the input adjacency list, where each node label is mapped to a unique numerical identifier.

encodedParentNodes.tmp

A label to number mapping for each parent node.