culturegraph-clustering

Table of Contents

Installation
- Requirements
- Build
Command Line Client
- Example
- Usage
Good To Know
- Temporary Files

A implementation of the graph clustering method used in culturegraph.

The input problem is encoded as a bipartite input graph. A cluster is defined as a connected component in this input graph.

A bipartite graph G(U,V) is represented by two node sets (namely U and V). We represent the bipartite graph with an adjacency list, that lists all children for each parent (assuming a digraph).

Although the used bipartite graph is undirected, we refer to the node set U with "Parent Node Set" and to the node set V with "Child Node Set".

Installation

Requirements

Java 8 or later

Build

Run the following command to construct a jar with dependencies.

gradlew clean test fatJar

Command Line Client

Example

The input file describes a adjacency list, where each line starts with a parent node, followed by its child nodes. Parent and child nodes belong to distinct sets of nodes that only share connections between each other.

The file graph.txt represent a graph as a adjacency list has the schema PARENT CHILD [… CHILD] for each line:

A a b c
B a
C f

Compute the connected components:

java -Xmx1G -jar clustering-cli-VERSION.jar -o output.graphml graph.txt

Tip	For large problem instances, you need to tweak the JVM Memory option `-Xmx`.

The resulting GraphML file:

<?xml version="1.0" encoding="UTF-8"?>
<graphml xmlns="http://graphml.graphdrawing.org/xmlns"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://graphml.graphdrawing.org/xmlns
        http://graphml.graphdrawing.org/xmlns/1.0/graphml.xsd">
<key id="d0" for="node" attr.name="label" attr.type="string"/>
<key id="d1" for="all" attr.name="component" attr.type="int"/>
<graph id="G" edgedefault="undirected" parse.order="nodesfirst">
<node id="n0"><data key="d0">A</data><data key="d1">1</data></node>
<node id="n1"><data key="d0">B</data><data key="d1">1</data></node>
<node id="n2"><data key="d0">C</data><data key="d1">2</data></node>
<edge id="e0" source="m0" target="n1"><data key="d1">1</data></edge>
<edge id="e1" source="m0" target="n0"><data key="d1">1</data></edge>
<edge id="e2" source="m1" target="n0"><data key="d1">1</data></edge>
<edge id="e3" source="m2" target="n0"><data key="d1">1</data></edge>
<edge id="e4" source="m3" target="n2"><data key="d1">2</data></edge>
</graph>
</graphml>

Graph visualization (made with Gephi):

The result GraphML file contains a component number for each node v in V . All nodes that share the same component number form a cluster.

You may extract the node to component mapping with:

cat output.graphml | grep "^<node" | sed -n 's:.*<data.*>\(.*\)</data><data.*>\(.*\)</data>.*:\1 \2:p'

A 1
B 1
C 2

Usage

Usage of the command line client.

Usage:

<main class> [-chV] -o=FILE [-s=NUM] INPUT

Description:

Partitions a bipartite-graph into connected components.

Parameters:

      INPUT                   Input adjacency list (plain text or gzip).

Options:

  -o, --output=FILE           GraphML file.
  -c, --compress              Compress output with gzip.
  -s, --size=NUM              Minimum component size.
                                Default: 0
  -h, --help                  Display this help message.
  -V, --version               Display version info.

Logging

This project uses the Slf4j SimpleLogger implementation for logging.

Any change in the logging behaviour needs to be passed by System Properties before running the program.

Example:

java -Dorg.slf4j.simpleLogger.logFile=/var/log/messages\
 -Dorg.slf4j.simpleLogger.defaultLogLevel=error\
 -jar <JAR> -c -o output.graphml.gz input.csv

SimpleLogger Properties:

Use org.slf4j.simpleLogger.logFile for setting a output file (Default: System.err).
Use org.slf4j.simpleLogger.defaultLogLevel for setting a log level (Default: info).

Good To Know

The input adjacency list should only contain unique lines.
If a connected component does not reaches the minimum component size, each parent node in this component is assigned to the component "-1".

Temporary Files

The procedure creates the following temporary files during a run.

Table 1. Table Temporary Files

Name	Description
childNodeHashes.tmp	A list of uniques hashes for each unique child node label.
encodedInput.tmp	A encoded representation of the input adjacency list, where each node label is mapped to a unique numerical identifier.
encodedParentNodes.tmp	A label to number mapping for each parent node.

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
algorithm		algorithm
assets		assets
cli		cli
example		example
gradle/wrapper		gradle/wrapper
graph		graph
io		io
label-encoder		label-encoder
README.adoc		README.adoc
build.gradle		build.gradle
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

culturegraph/culturegraph-clustering

Folders and files

Latest commit

History

Repository files navigation

culturegraph-clustering

Installation

Requirements

Build

Command Line Client

Example

Usage

Logging

Good To Know

Temporary Files

About

Topics

Resources

Stars

Watchers

Forks

Languages