$k$-FreqItems: A New Sparse Data Clustering Method Based on Frequent Items

Welcome to the k-FreqItems GitHub!

In this repo, we share the implementations and experiments of our work A New Sparse Data Clustering Method Based on Frequent Items in SIGMOD 2023. We implement k-FreqItems and SILK with CUDA on a distributed platform (one master with 4 GPUs and one slave with 4 GPUs) for clustering massive sparse set data (or say, sparse binary vectors) on Jaccard distance. We also adapt two state-of-the-art methods k-Means++ and k-Means$\parallel$ for Jaccard distance as baselines (by using FreqItem as the cluster center).

Data Sets

Data Sets Details

We choose two small sparse data sets with ground truth laebls (i.e., News20 and RCV1) and five large-scale real-world sparse data sets (i.e., URL, Avazu, KDD2012, Criteo10M and Criteo1B) for performance evaluation. Users can download the data sets from their links. The statistics of data sets are summarized as follows.

Data Sets	# Data	# Dim	# Non-Zero Dim	Data Size	Global $\alpha$	Local $\alpha$
News20	$2.0 \times 10^4$	$6.2 \times 10^4$	80	6.3 MB	0.2	-
RCV1	$5.3 \times 10^5$	$4.7 \times 10^4$	65	136 MB	0.2	-
URL	$2.3 \times 10^6$	$2.3 \times 10^6$	116	1.1 GB	0.4	0.2
Criteo10M	$1.0 \times 10^7$	$1.0 \times 10^6$	39	1.6 GB	0.2	0.1
Avazu	$4.0 \times 10^7$	$1.0 \times 10^6$	15	2.6 GB	0.3	0.2
KDD2012	$1.5 \times 10^8$	$5.4 \times 10^7$	11	7.3 GB	0.5	0.2
Criteo1B	$1.0 \times 10^9$	$1.0 \times 10^6$	39	153 GB	0.2	0.1

Data Format

The input sparse data sets are stored in binary format, where each file consists of two fields: pos and data, as shown below:

field	field type	description
pos	`uint64_t`*(n+1)	start position of `data` for each point
data	`int`*pos[n]	non-zero dimensions IDs of all points

pos is an uint64_t array of (n+1) length, which stores the start position of the data array for each sparse data point. data is an int array of pos[n] length, which store the non-zero dimensions IDs of all sparse data points. Here we assume that all non-zero dimension IDs can be represented by an int type integer. If the dimensionality exceeds the range of int, one can store the non-zero dimension IDs by uint64_t type with minor modification.With pos and data, one can efficient retrieve a specific data point with its data ID.

For example, suppose there is a sparse data set with four points: x_0={1,3,5,8}, x_1={1,3}, x_2={1,6,8}, and x_3={1,8,10}. Then, the pos array is [0,4,6,9,12], e.g., pos[0]=0, pos[4]=12. And the data array is [1,3,5,8,1,3,1,6,8,1,8,10]. If you want to retrieve x_1, you can first get its start position of data and its length from pos by its data ID 1, i.e., start position is pos[1]=4, and its length is pos[1+1]-pos[1]=6-4=2. Then you can retrieve x_1 from data by the start position 4 and its length 2, i.e., x_1={1,3}.

We have included the transformation code in the folder transformation/ to convert sparse data sets into the binary format. In addition, we also provide the binary format of the datasets we used (except Criteo1B) with 1, 2, 4, 8 slices. Users could download the datasets here.

Compilation

The source codes require nvcc with c++11 support. We have provided Makefile for compilation. However, for different machines, users may need to specify your local path to MPI libary and CUDA SM and Gencode variations. Suppose the Makefile is in accordance with the machines, users can use the following commands to compile the source codes:

cd methods/
make clean
make -j

Representative NVIDIA GPU cards for each architecture name and CUDA version can be found in http://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/.

Usages

We provide bash (and python) scripts to reproduce all experiments reported in our SIGMOD 2023 paper. Suppose you have cloned this repo and you are in the folder k-FreqItems/.

Step 1: Prepare Data Sets

Please first download the datasets from the links we provided. Then, you can use transformation/sparse.cc to convert them into the binary format we support. If you want to try multiple GPU and/or nodes, you could further run tranformation/partitin.cc to split the data into 1,2,4,8 slices. After that, you copy the binary files to a specific folder like data/. For example, when you get URL_0.bin, you can move it to the path data/URL/URL_0.bin.

Step 2: Reproduce Experiments

After preparing the data sets, we provide bash scripts to run k-FreqItems, SILK, and two baselines k-FreqItems++ and k-FreqItems$\parallel$. Users can reproduce the experiments by simply running the following command:

cd methods/
bash run.sh

By default, we provide the scripts for running methods with 4 GPUs once users specify the data path. With some minor modifications, users can run methods with different number of GPUs based on their machines.

Step 3: Visualizations

We provide python scripts (i.e., plot.py and plot_util.py in scripts/) to reproduce all the figures that are appeared in our SIGMOD 2023 paper. These scripts require python 3.7 (or higher versions) with numpy, scipy, and matplotlib installed. If not, you might need to use anaconda to create a new virtual environment and use pip to install those packages.

With the experimental results from step 2, users can reproduce all the figures with the following commands.

cd scripts/
python plot.py

Step 4: Experiments for Center Representation

Finally, we provide the source codes and results for the validation of FreqItem representation (Section 3.2 in our SIGMOD 2023 paper). Please refer to the folder representation/.

Reference

Thank you so much for being so patient to read the user manual. We will appreciate using the following BibTeX to cite this work when you use k-FreqItems (or SILK) in your paper.

@article{huang2023new,
  title={A New Sparse Data Clustering Method Based On Frequent Items},
  author={Huang, Qiang and Luo, Pingyi and Tung, Anthony KH},
  journal={Proceedings of the ACM on Management of Data},
  volume={1},
  number={1},
  pages={1--28},
  year={2023},
  publisher={ACM New York, NY, USA}
}

It is welcome to contact me (huangq@comp.nus.edu.sg) if you meet any issue. Thank you.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
figures		figures
methods		methods
representation		representation
scripts		scripts
transformation		transformation
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

figures

figures

methods

methods

representation

representation

scripts

scripts

transformation

transformation

.gitattributes

.gitattributes

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

Repository files navigation

$k$-FreqItems: A New Sparse Data Clustering Method Based on Frequent Items

Data Sets

Data Sets Details

Data Format

Compilation

Usages

Step 1: Prepare Data Sets

Step 2: Reproduce Experiments

Step 3: Visualizations

Step 4: Experiments for Center Representation

Reference

About

Releases

Packages

Languages

License

HuangQiang/k-FreqItems

Folders and files

Latest commit

History

Repository files navigation

$k$-FreqItems: A New Sparse Data Clustering Method Based on Frequent Items

Data Sets

Data Sets Details

Data Format

Compilation

Usages

Step 1: Prepare Data Sets

Step 2: Reproduce Experiments

Step 3: Visualizations

Step 4: Experiments for Center Representation

Reference

About

Topics

Resources

License

Stars

Watchers

Forks

Languages