Crystal:A very fast distributed subgraph matching algorithm

This project contains the implemetation of the method described in Subgraph matching: on compression and computation published in VLDB2018.

Patterns

Environment Requirement

Hadoop 2.7.3.

How to use?

1) preprocess datasets (in edge list form) by

run "Preprocess.sh"

2) perform counting

run "CountingTest.sh", which will count subgraph matching for all pattern listed.

3) generate compression form

run "WritingTest.sh", which will dump the the result of subgraph matching for all pattern listed to HDFS

Detail Specification

Source Code is in Subgraph folder, which can be open by Intellij Idea.

In it you can find three packages:

sAlg
- Square
- Chordal Square
- House
  - HouseS: for small data graph
  - House:for large data graph
- Chordal Roof
- Three Triangle
- Solar Square
- Near5Clique
- QuadTriangle
- SolarSquarePlus
sData: data structure used in sAlg
sPreprocess: Preprocessing Related
- sCliqueGeneration
  - classes used to generate the crystal
- sTool
  - preprocessing the graph

Each of the patterns comes with two java class, one for counting, one for outputting(class with name ended with O). More detail please read the code.

Test File

Files related to testing is at Top level.

Data: small datasets for testing the algorithm, in which each dataset is attached with a *-pattern-size.txt recording the pattern's size in each graph for each pattern.
Preprocess.sh: Preprocessing the Data for CountingTest.sh and WritingTest.sh.
CountingTest.sh: Testing the code related to counting in Crystal
WritingTest.sh: Testing the code related to writing in Crystal
pattern.eps: give a visulization about what pattern looks like

Please run Preprocess.sh first, before running the other two shell script. More detail please read the .sh file, and change some of the variables accordingly in the script file.

Parameter

mapreduce.job.reduces: better using the same number as vCore in yarn
mapreduce.reduce.memory.mb: the default is 4000MB, if graph bigger than UK2002 need to be processed, this might need to increase.
mapreduce.map.memory.mb: the default is 4000MB, if graph bigger than UK2002 need to be processed, this might need to increase.
test.memory: same as mapreduce.map.memory.mb. But only available in sAlg.
bloomHash: controlling hash function used in Bloom filter
bloomBit: controlling numbers of bit for representing a record in Bloom filter
test.p
- test.p (for classes in sIntersection): controlling the number of the round for the algorithm, the lower the better (if memory permit)
- test.p (for classes in sPartition)L controlling the number of partitions for the algorithm, the lower the better (if memory permit)
test.isOutput: specify whether the result should be outputted in *O.class.
test.isEnumerating: whether just counting or enumerating.

Format

InputFormat: Edge List
OutputFormat: file format specified in the paper.

    //Below three configuration is commented for maximum compatibility for the different version of Hadoop, if possible, they should be uncommented to improve the performance and compression ratio.
    
    //These configurations can be found in the source code.
    
    SequenceFileOutputFormat.setOutputCompressionType(job, CompressionType.BLOCK);
    
    FileOutputFormat.setCompressOutput(job, true);
    
    FileOutputFormat.setOutputCompressorClass(job, SnappyCodec.class);

Core-Crystal Decomposition

The pattern decomposition aims at optimzing Objective 3 in the paper.

This package provides decomposition.cpp which decomposes the pattern graph with/without the statistics of the cliques.

input specification Input the pattern graph to pattern.txt.
first line : n m b n: # of nodes, m: # of edges. Note that nodes are labeled with [1,n]. b: a boolean value of 0 or 1, specifying that statistics are used (1) or not (0) followed by m lines, each line x y describes an undirected edge (x,y)

followed by a line, if b=1,
k M k: the largest clique with statistics, k should be at least = the largest clique size in the pattern graph M: memory size followed by k lines, the i-th line specifies the total number of Clique_i in the target graph. Specifically, the first line is the # of nodes in the target graph and the second line is the # of edges in the target

Example input 1

4 4 0
1 2
1 3
2 4
3

Example input 2
4 4 1
1 2
1 3
2 4
3 4

2 20
100
200

Please refer to patternpool.txt for further example

For citatation

@article{qiao2017subgraph,
  title={Subgraph matching: on compression and computation},
  author={Qiao, Miao and Zhang, Hao and Cheng, Hong},
  journal={Proceedings of the VLDB Endowment},
  volume={11},
  number={2},
  pages={176--188},
  year={2017},
  publisher={VLDB Endowment}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Data		Data
Subgraph		Subgraph
patterndecomposition		patterndecomposition
CountingTest.sh		CountingTest.sh
CountingTest.sh.out.txt		CountingTest.sh.out.txt
LICENSE		LICENSE
Preprocess.sh		Preprocess.sh
Preprocess.sh.out.txt		Preprocess.sh.out.txt
README copy.md		README copy.md
README.md		README.md
Subgraph.jar		Subgraph.jar
WritingTest.sh		WritingTest.sh
pattern.eps		pattern.eps
pattern.jpg		pattern.jpg
pid.txt		pid.txt
presentation.pdf		presentation.pdf
readme.pdf		readme.pdf
test.sh		test.sh
test.sh.out.txt		test.sh.out.txt
trove4j-3.0.3.jar		trove4j-3.0.3.jar

License

H20Zhang/Crystal

Folders and files

Latest commit

History

Repository files navigation

Crystal:A very fast distributed subgraph matching algorithm

Patterns

Environment Requirement

How to use?

1) preprocess datasets (in edge list form) by

2) perform counting

3) generate compression form

Detail Specification

Test File

Parameter

Format

Core-Crystal Decomposition

contact me at hzhang@se.cuhk.edu.hk

About

Topics

Resources

License

Stars

Watchers

Forks

Languages