Kmeans|| #650

trivialfis · 2018-07-20T09:08:17Z

This is still WIP, for early review.

TODOs:

Make the algorithm work as expected.
Add proper tests.
Make the algorithm work with k-means.
Factor out operations.
Fix and rebase build-system.
Merge tSVD matrix with KmMatrix.

Above and tuning the algorithm described in #642

close #642

trivialfis · 2018-07-20T09:38:35Z

All tests for KmMatrix should pass.

RAMitchell

The fact that this PR also contains your other PR makes it harder to review in isolation. In future its best if you can keep them separate although this is not always practical. In this case we will have to review and merge your build system PR then rebase this, review and again and then hopefully merge.

Also try not to commit formatting changes as they also make review difficult.

RAMitchell · 2018-07-22T01:54:45Z

src/gpu/kmeans/KmMatrix/Generator.cuh

@@ -0,0 +1,95 @@
+#include <thrust/random.h>


I am not sure if you need all of this code for generating random numbers. Consider using thrust:
https://gist.github.com/ashwin/7245048

Thrust is included by default with the cuda toolkit and will allow you to do a lot of things much more easily in a C++ 11 style with high performance.

It just occurred to me that curand might help gaining a little bit performance and is easy enough to use. But your example is much cleaner. I will try to integrate it and do some benchmarks on larger dataset. If the gain is trivial then I will just remove the curand dependency.

RAMitchell · 2018-07-22T02:12:15Z

src/gpu/kmeans/KmMatrix/KmConfig.h

+#define M_DEV __device__
+#define M_DEVINLINE __device__ __forceinline__
+#define M_HOSTDEVINLINE __host__ __device__ __forceinline__
+


This cuda check macro is duplicated here:

h2o4gpu/src/gpu/kmeans/kmeans_general.h

Line 13 in 2f849e2

// TODO(pseudotensor): Avoid throw for python exception handling. Need to avoid all exit's and return exit code all the way back.

And here:

h2o4gpu/src/gpu/utils/utils.cuh

Line 36 in 3eb4715

#define safe_cuda(ans) throw_on_cuda_error((ans), __FILE__, __LINE__)

I suggest you use the one in utils and delete the one in kmeans. My personal preference is to use the variant of error checking that throws an exception. This gives you more control of how you want to program to exit. We should use the same method throughout H2O4GPU.

I intend to remove all these utils code in KmMatrix later. Sorry for not stating it out. It's just I don't want to pollute the rest of the repo at this stage. And the namespace should be consistent with rest of the project, though I haven't figure out how to do this yet.

RAMitchell · 2018-07-22T02:15:22Z

src/gpu/kmeans/KmMatrix/KmMatrix.cpp

@@ -0,0 +1,311 @@
+/*!


I can see you have implemented KmMatrix using the pimpl design pattern. We did this in xgboost for HostDeviceVector only because it had to compile for CPU only as well as with GPU. This seems unnecessarily complicated for this project. I will check with the others if we need to maintain this requirement of compiling for CPU only.

After looking through your code your matrix class seems very similar in function to https://github.com/h2oai/h2o4gpu/blob/master/src/gpu/data/matrix.cuh

Please explain why you can't use this.

For starter, the matrix in data doesn't deal with CPU and expanding it to support CPU is just rewrite. But now that dropping CPU support completely is possible, this seem avail.

Device context is an explicit parameter for every function, but not all functions use it. It's ok if we just want a matrix that works for current implementation. But if we do want to make a generic data structure that can encapsulate the complexity, the matrix in data might also need a big change.

Surely, with all the time I spent on writing KmMatrix and there are only few hundred lines of code in that matrix, the above can be done. The real difficulty is, there is no unit test in CPP, I need to rely on python tests which are more of integration tests. I tried to run the complete python tests from master branch on my machine and only a fraction of them passed. After that I dropped the thought of relying on existing tests. This also contributes to the fact that why I re-implement those utils.

I hope we can discuss this more lengthily.

I don't actually mind which direction you come at it from but in the end these two matrix representations must be unified. I think we still want to support matrices with CPU memory, but it is no longer a requirement that the code must compile without cuda.

Device context is an explicit parameter for every function, but not all functions use it.

You could possibly remove this and rely on your Singleton class to get things like the cublas handle inside these functions.

The real difficulty is, there is no unit test in CPP, I need to rely on python tests which are more of integration tests. I tried to run the complete python tests from master branch on my machine and only a fraction of them passed. After that I dropped the thought of relying on existing tests.

This is worrying. Perhaps it's a difference in your environment. We will see when we sort out continuous integration.

in the end these two matrix representations must be unified

That's the plan, a unified matrix.

RAMitchell · 2018-07-22T02:28:44Z

src/gpu/kmeans/KmMatrix/utils.cuh

+
+namespace H2O4GPU {
+namespace KMeans {
+


These thread index helpers are not specific to kmeans and so should go in a common utilities file shared by all algorithms.

RAMitchell · 2018-07-22T02:34:45Z

src/gpu/kmeans/KmMatrix/utils.cuh

+
+
+// Work around for shared memory
+// https://stackoverflow.com/questions/20497209/getting-cuda-error-declaration-is-incompatible-with-previous-variable-name


This should be named KmSharedMem instead of KmShardMem as the word "shard" can have a different meaning in the context of GPU programming. Although we do not seem to have an explicit code style for this project, Google style recommends not using abbreviated words in variable names. So this would then be: KmSharedMemory.

As above I also don't think this code is specific to kmeans so should go in a shared utilities file.

Will be changed.

RAMitchell · 2018-07-22T02:39:46Z

src/gpu/kmeans/KmMatrix/GpuInfo.cuh

@@ -0,0 +1,76 @@
+/*!


This GPU info class is not specific to kmeans. Check how the other algorithms are getting this information then consider moving into a common location and making the other algorithms also use this class. Ideally we want things done consistently across the code base. This makes it much easier to maintain.

RAMitchell · 2018-07-22T02:42:13Z

src/gpu/kmeans/kmeans_h2o4gpu.cu

@@ -1,37 +1,37 @@
 /*!


This file seems to be all formatting changes. I cant see if you have actually changed anything. It is better to only commit actual changes and deal with formatting issues in a separate commit or PR.

I will take another look at this if you revert the formatting changes.

There is no change in it, I kept my code isolated from the rest of the project. And my hope is removing this file. Sorry for the mess.

trivialfis · 2018-07-22T06:51:42Z

I will at least get CMake part of the build system get merged. I will work on it as soon as I can see the jenkin report.

The build is performed by cmake. Add missing <numeric> header and use cblas/cblas.h instead of cblas.h.

* Add test target to CMakeLists.txt. * Fix include path in cu files in tests/cpp/gpu/kmeans. * Make SWIG an optional dependencies.

Using Eigen could make KMeans work. But when it comes to expanding the cluster, the memory copy is expensive. Also, it's hard to launch separated kernel without memory copy with Eigen. So, proceed with a home brew matrix.

* Split proxy implementation into cpp file to save some compile time. * Fix some memory error. * Add two tests.

* Checking code is factored into KmConfig * GpuInfo now takes care of cublas_handle; * MatrixDim for latter use.

Builds basic kmeans|| framework on top of KmMatrix. The algorithm is not working yet.

* Fix copy_if prob_ptr index. * Add document for the algorithm object interface. * Move Generator into a member variable and complete its implementation. * Removes dead code.

* Create Generatorbase class for interface. * Move generator related code into KmMatrix. * Move shared memory related code into KmMatrix. * Create mock object based on Generatorbase for later test.

* Re-clustering is now a policy class, for easier testing and flexibility.

* During the extraction, blas operations now convert int to float.

Revert b9d71f for kmeans_h2o4gpu.cu.

* Put KmMatrix under gpu/matrix * Use shared utils for both kmeans and KmMatrix. * Rename the namespaces.

trivialfis · 2018-07-25T21:27:11Z

@RAMitchell Ready for another review. \O/ This time I re-based on a build-system branch that contains only CMake file changes, hopefully better than the last time.

The example about using thrust to generate random number is not implemented yet. The PR is getting way~~ too heavy now.
I might need the build log to get the build system right. If that is not gonna be available soon, then maybe we or someone else can handle the build later.
The code builds on Fedora 27 with CUDA9.2, if there's error on your local system, please send the complete log to me.

RAMitchell

Looks very good in general, there is a clear structure to your code and very good commenting. See comments for a few issues. I am not going to build/test myself as I don't have time and continuous integration should ideally be enforcing that this works on the required platforms.

RAMitchell · 2018-07-25T23:02:27Z

src/common/utils.h.in

@@ -0,0 +1,54 @@
+/*!


This seems like a strange file name.

That is the original utils.h from h2o4gpu with some functions moved from gpu code.
For the *.h.in, it's for cmake configuration, specifically USE_CUDA flag. I know compiling without CUDA is not supported. But anyway I tried to remain the possibility. If that a burden, I can remove it in no time.

RAMitchell · 2018-07-25T23:11:59Z

src/gpu/kmeans/kmeans_init.cu

+__global__ void self_row_argmin_sequential(kParam<int> _res, kParam<T> _val) {
+
+  size_t idx = global_thread_idx();
+  if (idx < _val.rows) {


Consider doing this: if (idx >= _val.rows) return;

I find it a bit cleaner but it's a matter of personal preference.

RAMitchell · 2018-07-25T23:18:43Z

src/gpu/kmeans/kmeans_init.cu

+// ArgMin operation that excludes 0. Used when dealing with distance_pairs
+// in recluster where distance between points with itself is calculated,
+// hence the name.
+// FIXME: Maybe generalize it to selection algorithm.


These two kernels look very similar. Consider generalising them. Less code is better code. Also have a look at this: https://nvlabs.github.io/cub/structcub_1_1_device_segmented_reduce.html

Cub is another really great cuda library.

I will look into it. Thanks for the suggestion. Indeed, those extra kernels are quite annoying.

RAMitchell · 2018-07-25T23:23:31Z

src/gpu/kmeans/kmeans_init.cu

+  void *d_temp_storage = NULL;
+
+  // determine the temp_storage_bytes
+  safe_cuda(cub::DeviceHistogram::HistogramEven(


Nice to see you using cub for this.

RAMitchell · 2018-07-25T23:24:31Z

src/gpu/kmeans/kmeans_init.cu

+
+  // Sort the indices by weights in ascending order, then use those at front
+  // as result.
+  thrust::sort_by_key(thrust::device,


Be aware that this function makes a very large temporary memory allocation. Sometimes it can be better to call cub directly because of this. Another way of dealing with this is to create a custom allocator for thrust algorithms that reuses the same memory buffer.

I have been thinking about custom memory management facility. But that might not be necessary for this function alone. Since it sorts the sampled centroids as one of the final steps. Sampled centroids should be much less then data points. If in practice this assumption doesn't hold, then kmeans|| algorithm might be a bust.
I will look into the cub solution.

RAMitchell · 2018-07-25T23:34:34Z

src/gpu/kmeans/kmeans_labels.h

-  return code;
-}
-
-
 extern cudaStream_t cuda_stream[MAX_NGPUS];


I don't like this global variable. Consider making this a member of the GPUInfo singleton or removing if it's no longer used.

That's not part of kmeans||. If I change that, it might be in another PR that deals with kmeans in general. Is it acceptable for you?

RAMitchell · 2018-07-25T23:36:36Z

src/gpu/matrix/KmMatrix/Arith.cu

+ * @param _val The input matrix with shape m x n
+ */
+template <typename T>
+__global__ void row_min_sequential(kParam<T> _res, kParam<T> _val) {


These two kernels seem to be duplicated in src/gpu/kmeans/kmeans_init.cu. Did you move some code around and forget to remove it?

Not exactly. These two deals with min and argmin, is my original attempt. But then the 0 distance bug surfaced, I created two similar kernels specifically dealing with 0 values in kmeans_init.cu. The former is not used in kmeans, the later is not generic enough to be put in matrix. I will look into cub function you suggested, hopefully we remove the later two.

RAMitchell · 2018-07-25T23:42:42Z

src/gpu/matrix/KmMatrix/Generator.cuh

+}
+
+template <typename T>
+struct UniformGenerator : public GeneratorBase<T> {


A few things:

I think this should be named UniformRandomGenerator. Generator by itself does not imply that it generates random numbers.

We should definitely remove the curand dependency if possible. Extra dependencies heavily impact maintainability.

I think src/gpu/matrix/KmMatrix/Generator.hpp is unnecessary now that we have established we don't need to compile without cuda. Perhaps just put its contents in this file?

I think this should be named UniformRandomGenerator. Generator by itself does not imply that it generates random numbers.

Not problem. it will change.

We should definitely remove the curand dependency if possible. Extra dependencies heavily impact maintainability.

Make sense, it will change.

I think src/gpu/matrix/KmMatrix/Generator.hpp is unnecessary now that we have established we don't need to compile without cuda. Perhaps just put its contents in this file?

If we need to run it in CPU, the header is still useful if we create a CPU generator and put this file in src/common. That's something for future, is it acceptable for you?

RAMitchell · 2018-07-25T23:44:32Z

src/gpu/matrix/KmMatrix/KmMatrix.hpp

+template <typename T>
+std::ostream& operator<<(std::ostream& os, KmMatrix<T>& m);
+
+template <typename T>


What is this proxy class used for?

Originally, I used it to represent extracted row. Indeed now it seems redundant since copying is inevitable. You are right, I should remove it later, but preferablely not in this PR.

RAMitchell · 2018-07-25T23:45:37Z

src/gpu/matrix/KmMatrix/KmMatrix.cpp

@@ -0,0 +1,358 @@
+/*!


Again I don't think we need to have hpp/cuh and cpp/cu files. We just need the cuda files.

This cpp is for KmMatrix, not any specific KmMatrixImpl. We bridge KmMatrix to KmMatrixImpl (CudaKmMatrixImpl specifically) in this file. It's a generic frontend.

RAMitchell · 2018-07-25T23:55:04Z

We can switch to doing some work on xgboost after you have dealt with the review issues until the continuous integration problem gets resolved.

mdymczyk · 2018-07-26T04:44:33Z

@trivialfis @RAMitchell just had a quick glance, changes look really good! Will do a proper review over the weekend.

@RAMitchell by CI issues you mean @trivialfis not being able to see the logs? Would a log emailed when the build fails be sufficient? I remember adding that a long time ago but maybe someone changed it and need to add it back in.

trivialfis · 2018-07-26T07:19:47Z

@mdymczyk Thanks for spending extra time on this. I haven't receive any mail from the CI yet.

trivialfis requested a review from mdymczyk July 20, 2018 09:12

trivialfis force-pushed the kmeansll branch from 6a8669a to 9d820c1 Compare July 20, 2018 11:02

RAMitchell reviewed Jul 22, 2018

View reviewed changes

trivialfis added 3 commits July 24, 2018 02:53

Fix building.

626bad5

The build is performed by cmake. Add missing <numeric> header and use cblas/cblas.h instead of cblas.h.

Enable building tests with cmake and make swig optional.

a7cf462

* Add test target to CMakeLists.txt. * Fix include path in cu files in tests/cpp/gpu/kmeans. * Make SWIG an optional dependencies.

Enable specifying CUDA compute capability.

c39c1bb

trivialfis force-pushed the kmeansll branch 2 times, most recently from 61be357 to 2822872 Compare July 25, 2018 09:11

trivialfis added 20 commits July 26, 2018 03:25

[WIP, KMeans||] skeleton for the algorithm.

e28cd4b

Basic for KmMatrix and KMeans.

1cdb79b

Using Eigen could make KMeans work. But when it comes to expanding the cluster, the memory copy is expensive. Also, it's hard to launch separated kernel without memory copy with Eigen. So, proceed with a home brew matrix.

[KmMatrix] Split up proxy, add tests, and fixes.

42e0341

* Split proxy implementation into cpp file to save some compile time. * Fix some memory error. * Add two tests.

More tests and fixes for KmMatrix.

79529c6

Fix KmMatrix initialization with size.

3b19006

Add helper tools for GPU.

b58d868

cuBlas wrapper file. Only two functions are added.

6f771b8

Factor out config code, cublas_handle, add MatrixDim.

c901a7f

* Checking code is factored into KmConfig * GpuInfo now takes care of cublas_handle; * MatrixDim for latter use.

Rename class member variables for CudaKmMatrix.

1db8282

Add stack for KmMatrix, fix impl ownership bug.

27b399f

Add license.

70d0a10

Construct kmeans|| based on KmMatrix.

8f65ea5

Builds basic kmeans|| framework on top of KmMatrix. The algorithm is not working yet.

Fix prob for kmeans||.

b7e7446

Fix K-Means|| along with added documents.

d507259

* Fix copy_if prob_ptr index. * Add document for the algorithm object interface. * Move Generator into a member variable and complete its implementation. * Removes dead code.

Create Generatorbase and refactor code.

7a79169

* Create Generatorbase class for interface. * Move generator related code into KmMatrix. * Move shared memory related code into KmMatrix. * Create mock object based on Generatorbase for later test.

Rename shared memory structs, add license.

8c21e8e

Fix handles_ memory leak.

458e54e

Add argmin op. Fix min op for redundant memory access.

d3b18df

Add centroids weighting.

9d5a094

Add simple re-clustering.

27a5ec1

trivialfis added 17 commits July 26, 2018 03:29

Pass pairwise distance test.

c5d8bfe

Add test for GreedyRecluster.

940d20e

* Re-clustering is now a policy class, for easier testing and flexibility.

Fix zero distance prob, and re-cluster with large centroids.

2e47f71

Extract Arith operation definitions to cuda file.

b45825a

* During the extraction, blas operations now convert int to float.

Improve printing matrix.

c579d26

Finish test for kmeans||.

ca412a4

Remove warning message.

fcfeaf3

Remove Eigen require.

ad18cbb

Fix compiler warnings.

d358c82

Add rows functions for KmMatrix.

1b2b346

Add SizeError.

b1b2d51

Add int generator, reshape for KmMatrix.

93ae39b

Add random kmeans init class.

c95c9ba

Revert format change for kmeans_h2o4gpu.cu.

07c6c3e

Revert b9d71f for kmeans_h2o4gpu.cu.

Revert warning flag

adb6ffa

Inject new kmeans|| into kmeans algorithm.

450051e

Re-structure.

6c8ff4e

* Put KmMatrix under gpu/matrix * Use shared utils for both kmeans and KmMatrix. * Rename the namespaces.

trivialfis force-pushed the kmeansll branch from e6220c9 to c61e8da Compare July 25, 2018 21:10

Add config for USE_CUDA.

9afa105

RAMitchell reviewed Jul 25, 2018

View reviewed changes

trivialfis added 5 commits July 27, 2018 03:10

Fix doc for KmeansLlInit.

bf9e479

Remove SelfMinOp.

10d8ac6

Use cub function for calculating min.

7efaf03

Use thrust to generate random numbers.

7468a0b

Rename Generator to RandomGenerator.

f297074

trivialfis force-pushed the kmeansll branch from c61e8da to f297074 Compare July 26, 2018 23:12



		// Work around for shared memory
		// https://stackoverflow.com/questions/20497209/getting-cuda-error-declaration-is-incompatible-with-previous-variable-name

Kmeans|| #650

Are you sure you want to change the base?

Kmeans|| #650

Conversation

trivialfis commented Jul 20, 2018 • edited

trivialfis commented Jul 20, 2018

RAMitchell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RAMitchell Jul 22, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Jul 22, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RAMitchell Jul 22, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis commented Jul 22, 2018 • edited

trivialfis commented Jul 25, 2018

RAMitchell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Jul 26, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Jul 26, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trivialfis Jul 26, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

RAMitchell commented Jul 25, 2018

mdymczyk commented Jul 26, 2018

trivialfis commented Jul 26, 2018

trivialfis commented Jul 20, 2018 •

edited

RAMitchell Jul 22, 2018 •

edited

trivialfis Jul 22, 2018 •

edited

RAMitchell Jul 22, 2018 •

edited

trivialfis commented Jul 22, 2018 •

edited

trivialfis Jul 26, 2018 •

edited

trivialfis Jul 26, 2018 •

edited

trivialfis Jul 26, 2018 •

edited