Skip to content

Latest commit

 

History

History
152 lines (122 loc) · 5.6 KB

kmeans_on_angel_en.md

File metadata and controls

152 lines (122 loc) · 5.6 KB

KMeans

KMeans is a method that aims to cluster data in K groups of equal variance. The conventional KMeans algorithm has performance bottleneck; however,when implemented with PS, KMeans achieves the same level of accuracy with better performance.

1. Introduction

The KMeans algorithm assigns each data point to its nearest cluster, where the distance is measured between the data point and the cluster's centers. In general, Kmeans algorithm is implemented in an iterative way as shown below:

kmeans

where, xi is the ith sample and ci is its nearest cluster; miu_j is the centers of the jth cluster.

Mini-batch KMeans

"Web-Scale K-Means Clustering"[1] proposes an improved KMeans algorithm to address the latency, scalability and sparsity requirements in user-facing web applications, using mini-batch optimization for training. As shown below:

mini_batch_kmeans

2. Distributed Implementation on Angel

Model Storage

KMeans on Angel stores the K centers and K-centers counts on ParameterServer,using a K×N matrix represents the K centers and a K×1 vector represents the K-centers counts, where K is the number of clusters and N is the dimension of data,i.e. number of features.

Model Updating

KMeans on Angel is trained in an iterative way; during each iteration, the centers are updated by mini-batch.

Algorithm

KMeans on Angel algorithm as follows:

KMeans_on_Angel

3. Execution & Performance

Input Format

  • Data format is set in "ml.data.type", which supports "libsvm", "dense" and "dummy" formats. For details, see Angel Data Format

Parameters

  • IO Parameters

    • ml.feature.index.range: number of features
    • ml.data.type: Angel Data Format, can be "dummy" or "libsvm"
    • angel.train.data.path: input path for train
    • angel.predict.data.path:input path for predict
    • angel.save.model.path: save path for trained model
    • angel.predict.out.path:output path for predict
    • angel.log.path: save path for the log
  • Algorithm Parameters

    • ml.epoch.num: number of iterations
    • ml.minibatch.size:: samples for mini-batch
    • ml.kmeans.center.num: K, number of clusters
    • ml.kmeans.c:learning rate
  • Resource Parameters

    • angel.workergroup.number: number of workers
    • angel.worker.memory.mb: worker's memory requested in G
    • angel.worker.task.number: number of tasks on each worker, default is 1
    • angel.ps.number: number of PS
    • angel.ps.memory.mb: PS's memory requested in G

Submit Command

  • Training Job

     ./bin/angel-submit \
     	--action.type=train \
     	--angel.app.submit.class=com.tencent.angel.ml.clustering.kmeans.KMeansRunner  \
     	--ml.model.class.name=com.tencent.angel.ml.clustering.kmeans.KMeansModel \
     	--angel.train.data.path=$traindata \
     	--angel.save.model.path=$modelout \
     	--angel.output.path.deleteonexist=true \
     	--angel.log.path=$logpath \
     	--ml.data.type=libsvm \
     	--ml.model.type=T_DOUBLE_DENSE \
     	--ml.kmeans.center.num=$centerNum  \
     	--ml.kmeans.c=0.15 \
     	--ml.epoch.num=10 \
     	--ml.feature.index.range=$featureNum \
     	--ml.feature.num=$featureNum \
     	--angel.workergroup.number=4 \
     	--angel.worker.memory.mb=5000  \
     	--angel.worker.task.number=1 \
     	--angel.ps.number=4 \
     	--angel.ps.memory.mb=5000 \
     	--angel.job.name=kmeans_train
  • IncTraining Job

     ./bin/angel-submit \
     	--action.type=inctrain \
     	--angel.app.submit.class=com.tencent.angel.ml.clustering.kmeans.KMeansRunner  \
     	--ml.model.class.name=com.tencent.angel.ml.clustering.kmeans.KMeansModel \
     	--angel.train.data.path=$traindata \
     	--angel.load.model.path=$modelout \
     	--angel.save.model.path=$modelout \
     	--angel.output.path.deleteonexist=true \
     	--angel.log.path=$logpath \
     	--ml.data.type=libsvm \
     	--ml.model.type=T_DOUBLE_DENSE \
     	--ml.kmeans.center.num=$centerNum \
     	--ml.kmeans.c=0.15 \
     	--ml.epoch.num=10 \
     	--ml.feature.index.range=$featureNum \
     	--ml.feature.num=$featureNum \
     	--angel.workergroup.number=4 \
     	--angel.worker.memory.mb=5000  \
     	--angel.worker.task.number=1 \
     	--angel.ps.number=4 \
     	--angel.ps.memory.mb=5000 \
     	--angel.job.name=kmeans_inctrain
  • Prediction Job

     ./bin/angel-submit \
     	--action.type=predict \
     	--angel.app.submit.class=com.tencent.angel.ml.clustering.kmeans.KMeansRunner  \
     	--ml.model.class.name=com.tencent.angel.ml.clustering.kmeans.KMeansModel \
     	--angel.predict.data.path=$predictdata \
     	--angel.load.model.path=$modelout \
     	--angel.predict.out.path=$predictout \
     	--angel.output.path.deleteonexist=true \
     	--angel.log.path=$logpath \
     	--ml.data.type=libsvm \
     	--ml.model.type=T_DOUBLE_DENSE \
     	--ml.kmeans.center.num=$centerNum \
     	--ml.feature.index.range=$featureNum \
     	--ml.feature.num=$featureNum \
     	--angel.workergroup.number=4 \
     	--angel.worker.memory.mb=5000  \
     	--angel.worker.task.number=1 \
     	--angel.ps.number=4 \
     	--angel.ps.memory.mb=5000 \
     	--angel.psagent.cache.sync.timeinterval.ms=500 \
     	--angel.job.name=kmeans_predict

Performance

  • data:SVHN,3×10^3 features,7×10^4 samples
  • resource:
    • Angel:executor:4,5G memory,1 task; ps:4,5G memory
  • Time of 100 epochs:
    • Angel:45min

4. References

[1] Sculley D. Web-scale k-means clustering[C]// International Conference on World Wide Web, WWW 2010, Raleigh, North Carolina, Usa, April. DBLP, 2010:1177-1178.