Skip to content

xyguo/clusterz

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Distributed Clustering with Outliers

This is the experiment code for our paper

X. Guo, S. Li. Distributed k-Clustering for Data with Heavy Noise. NIPS'18

Requirement

  • Python >= 3.4
  • numpy >= 1.15 and scipy >= 1.2.0
  • scikit-learn >= 17.0

Testing

To run the code, please see the kzcenter_exp.py file as example.

Progress

  1. Data preprocessing
  2. (k,z)-center algorithm
  3. coreset
  4. (k,z)-median
  5. (k,z)-means

Summary

All the algorithm implementations are under clusterz/algs:

  • kzcenter.py: includes the implementation of our distributed (k,z)-center algorithm, the algorithm by Malkomes et al.[1], the centralized (k,z)-center algorithm by Charikar et al.[2], and the 2-approx k-center algorithm by Hochbaum and Shmoys[3]

  • kz_clustering_from_others.py: contains the implementation for some (k,z)-clustering algorithms proposed recently, including Guha et al.[4] and Chen et al.[5]

  • kz_lp_clustering.py: contains the implementation for our distributed (k,z)-median/means algorithm. Actually this implementation works for any L_p norms.

  • kzmeans.py: instantiate the distributed L_p clustering routine defined in kz_lp_clustering.py with L_2 norm, thus obtain a distributed (k,z)-means routine. Also contains implementations for centralized k/(k,z)-means algorithm based on Lloyd[6] and Chawla and Gionis[7]

  • kzmedian.py: instantiate the distributed L_p clustering routine defined in kz_lp_clustering.py with L_1 norm, thus obtain a distributed (k,z)-median routine. Also contains implementations for centralized k/(k,z)-means algorithm based on Lloyd[6] and Chawla and Gionis[7]

  • misc.py: utility functions for the classes in kzcenter.py. Maily facilitate the ball-cover step.

  • robust_facility_location.py: centralized solver for the min-max k-clustering problem based on Anthony et al.[8]. Will be invoked by our distributed (k,z)-means/median algorithms as the final step.

[1] Gustavo Malkomes, Matt J. Kusner, Wenlin Chen,Kilian Q. Weinberger, and Benjamin Moseley. Fast distributed k-center clustering with outliers on massive data. NIPS'15

[2] Moses Charikar, Samir Khuller, David M. Mount, and Giri Narasimhan. Algorithms for facility location problems with outliers. SODA'01

[3] Dorit S. Hochbaum and David B. Shmoys. A best possible heuristic for the k-center problem. Math. Oper. Res., 1985

[4] Sudipto Guha, Yi Li, and Qin Zhang. Distributed partial clustering. SPAA'17

[5] Jiecao Chen, Erfan Sadeqi Azer, and Qin Zhang. A practical algorithm for distributed clustering and outlier detection. NIPS'18

[6] Stuart P. Lloyd: Least squares quantization in PCM. IEEE Trans. Information Theory. 1982

[7] Sanjay Chawla and Aristides Gionis. k-means−−: A unified approach to clustering and outlier detection. ICDM'13

[8] Barbara M. Anthony, Vineet Goyal, Anupam Gupta, and Viswanath Nagarajan. A plant location guide for the unsure: Approximation algorithms for min-max location problems. Math. Oper. Res., 2010.

Releases

No releases published

Packages

No packages published

Languages