Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kmeans|| #650

Open
wants to merge 49 commits into
base: master
Choose a base branch
from
Open

Kmeans|| #650

wants to merge 49 commits into from

Conversation

trivialfis
Copy link
Collaborator

@trivialfis trivialfis commented Jul 20, 2018

This is still WIP, for early review.

TODOs:

  1. Make the algorithm work as expected.
  2. Add proper tests.
  3. Make the algorithm work with k-means.
  4. Factor out operations.
  5. Fix and rebase build-system.
  6. Merge tSVD matrix with KmMatrix.

Above and tuning the algorithm described in #642

close #642

@trivialfis trivialfis requested a review from mdymczyk July 20, 2018 09:12
@trivialfis
Copy link
Collaborator Author

All tests for KmMatrix should pass.

Copy link

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that this PR also contains your other PR makes it harder to review in isolation. In future its best if you can keep them separate although this is not always practical. In this case we will have to review and merge your build system PR then rebase this, review and again and then hopefully merge.

Also try not to commit formatting changes as they also make review difficult.

@@ -0,0 +1,95 @@
#include <thrust/random.h>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if you need all of this code for generating random numbers. Consider using thrust:
https://gist.github.com/ashwin/7245048

Thrust is included by default with the cuda toolkit and will allow you to do a lot of things much more easily in a C++ 11 style with high performance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just occurred to me that curand might help gaining a little bit performance and is easy enough to use. But your example is much cleaner. I will try to integrate it and do some benchmarks on larger dataset. If the gain is trivial then I will just remove the curand dependency.

#define M_DEV __device__
#define M_DEVINLINE __device__ __forceinline__
#define M_HOSTDEVINLINE __host__ __device__ __forceinline__

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cuda check macro is duplicated here:

// TODO(pseudotensor): Avoid throw for python exception handling. Need to avoid all exit's and return exit code all the way back.

And here:

#define safe_cuda(ans) throw_on_cuda_error((ans), __FILE__, __LINE__)

I suggest you use the one in utils and delete the one in kmeans. My personal preference is to use the variant of error checking that throws an exception. This gives you more control of how you want to program to exit. We should use the same method throughout H2O4GPU.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to remove all these utils code in KmMatrix later. Sorry for not stating it out. It's just I don't want to pollute the rest of the repo at this stage. And the namespace should be consistent with rest of the project, though I haven't figure out how to do this yet.

@@ -0,0 +1,311 @@
/*!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see you have implemented KmMatrix using the pimpl design pattern. We did this in xgboost for HostDeviceVector only because it had to compile for CPU only as well as with GPU. This seems unnecessarily complicated for this project. I will check with the others if we need to maintain this requirement of compiling for CPU only.

After looking through your code your matrix class seems very similar in function to https://github.com/h2oai/h2o4gpu/blob/master/src/gpu/data/matrix.cuh

Please explain why you can't use this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For starter, the matrix in data doesn't deal with CPU and expanding it to support CPU is just rewrite. But now that dropping CPU support completely is possible, this seem avail.

Device context is an explicit parameter for every function, but not all functions use it. It's ok if we just want a matrix that works for current implementation. But if we do want to make a generic data structure that can encapsulate the complexity, the matrix in data might also need a big change.

Surely, with all the time I spent on writing KmMatrix and there are only few hundred lines of code in that matrix, the above can be done. The real difficulty is, there is no unit test in CPP, I need to rely on python tests which are more of integration tests. I tried to run the complete python tests from master branch on my machine and only a fraction of them passed. After that I dropped the thought of relying on existing tests. This also contributes to the fact that why I re-implement those utils.

I hope we can discuss this more lengthily.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't actually mind which direction you come at it from but in the end these two matrix representations must be unified. I think we still want to support matrices with CPU memory, but it is no longer a requirement that the code must compile without cuda.

Device context is an explicit parameter for every function, but not all functions use it.

You could possibly remove this and rely on your Singleton class to get things like the cublas handle inside these functions.

The real difficulty is, there is no unit test in CPP, I need to rely on python tests which are more of integration tests. I tried to run the complete python tests from master branch on my machine and only a fraction of them passed. After that I dropped the thought of relying on existing tests.

This is worrying. Perhaps it's a difference in your environment. We will see when we sort out continuous integration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the end these two matrix representations must be unified

That's the plan, a unified matrix.


namespace H2O4GPU {
namespace KMeans {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These thread index helpers are not specific to kmeans and so should go in a common utilities file shared by all algorithms.



// Work around for shared memory
// https://stackoverflow.com/questions/20497209/getting-cuda-error-declaration-is-incompatible-with-previous-variable-name
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be named KmSharedMem instead of KmShardMem as the word "shard" can have a different meaning in the context of GPU programming. Although we do not seem to have an explicit code style for this project, Google style recommends not using abbreviated words in variable names. So this would then be: KmSharedMemory.

As above I also don't think this code is specific to kmeans so should go in a shared utilities file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will be changed.

@@ -0,0 +1,76 @@
/*!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This GPU info class is not specific to kmeans. Check how the other algorithms are getting this information then consider moving into a common location and making the other algorithms also use this class. Ideally we want things done consistently across the code base. This makes it much easier to maintain.

@@ -1,37 +1,37 @@
/*!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file seems to be all formatting changes. I cant see if you have actually changed anything. It is better to only commit actual changes and deal with formatting issues in a separate commit or PR.

I will take another look at this if you revert the formatting changes.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no change in it, I kept my code isolated from the rest of the project. And my hope is removing this file. Sorry for the mess.

@trivialfis
Copy link
Collaborator Author

trivialfis commented Jul 22, 2018

I will at least get CMake part of the build system get merged. I will work on it as soon as I can see the jenkin report.

The build is performed by cmake. Add missing <numeric> header and use
cblas/cblas.h instead of cblas.h.
* Add test target to CMakeLists.txt.
* Fix include path in cu files in tests/cpp/gpu/kmeans.
* Make SWIG an optional dependencies.
@trivialfis trivialfis force-pushed the kmeansll branch 2 times, most recently from 61be357 to 2822872 Compare July 25, 2018 09:11
Using Eigen could make KMeans work. But when it comes to expanding the
cluster, the memory copy is expensive. Also, it's hard to launch
separated kernel without memory copy with Eigen.

So, proceed with a home brew matrix.
* Split proxy implementation into cpp file to save some compile time.
* Fix some memory error.
* Add two tests.
* Checking code is factored into KmConfig
* GpuInfo now takes care of cublas_handle;
* MatrixDim for latter use.
Builds basic kmeans|| framework on top of KmMatrix. The algorithm is
not working yet.
* Fix copy_if prob_ptr index.
* Add document for the algorithm object interface.
* Move Generator into a member variable and complete its
implementation.
* Removes dead code.
* Create Generatorbase class for interface.
* Move generator related code into KmMatrix.
* Move shared memory related code into KmMatrix.
* Create mock object based on Generatorbase for later test.
@trivialfis
Copy link
Collaborator Author

@RAMitchell Ready for another review. \O/ This time I re-based on a build-system branch that contains only CMake file changes, hopefully better than the last time.

The example about using thrust to generate random number is not implemented yet. The PR is getting way~~ too heavy now.
I might need the build log to get the build system right. If that is not gonna be available soon, then maybe we or someone else can handle the build later.
The code builds on Fedora 27 with CUDA9.2, if there's error on your local system, please send the complete log to me.

Copy link

@RAMitchell RAMitchell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good in general, there is a clear structure to your code and very good commenting. See comments for a few issues. I am not going to build/test myself as I don't have time and continuous integration should ideally be enforcing that this works on the required platforms.

@@ -0,0 +1,54 @@
/*!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a strange file name.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the original utils.h from h2o4gpu with some functions moved from gpu code.
For the *.h.in, it's for cmake configuration, specifically USE_CUDA flag. I know compiling without CUDA is not supported. But anyway I tried to remain the possibility. If that a burden, I can remove it in no time.

__global__ void self_row_argmin_sequential(kParam<int> _res, kParam<T> _val) {

size_t idx = global_thread_idx();
if (idx < _val.rows) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider doing this: if (idx >= _val.rows) return;

I find it a bit cleaner but it's a matter of personal preference.

// ArgMin operation that excludes 0. Used when dealing with distance_pairs
// in recluster where distance between points with itself is calculated,
// hence the name.
// FIXME: Maybe generalize it to selection algorithm.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two kernels look very similar. Consider generalising them. Less code is better code. Also have a look at this: https://nvlabs.github.io/cub/structcub_1_1_device_segmented_reduce.html

Cub is another really great cuda library.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look into it. Thanks for the suggestion. Indeed, those extra kernels are quite annoying.

void *d_temp_storage = NULL;

// determine the temp_storage_bytes
safe_cuda(cub::DeviceHistogram::HistogramEven(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see you using cub for this.


// Sort the indices by weights in ascending order, then use those at front
// as result.
thrust::sort_by_key(thrust::device,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Be aware that this function makes a very large temporary memory allocation. Sometimes it can be better to call cub directly because of this. Another way of dealing with this is to create a custom allocator for thrust algorithms that reuses the same memory buffer.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been thinking about custom memory management facility. But that might not be necessary for this function alone. Since it sorts the sampled centroids as one of the final steps. Sampled centroids should be much less then data points. If in practice this assumption doesn't hold, then kmeans|| algorithm might be a bust.
I will look into the cub solution.

return code;
}


extern cudaStream_t cuda_stream[MAX_NGPUS];

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this global variable. Consider making this a member of the GPUInfo singleton or removing if it's no longer used.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not part of kmeans||. If I change that, it might be in another PR that deals with kmeans in general. Is it acceptable for you?

* @param _val The input matrix with shape m x n
*/
template <typename T>
__global__ void row_min_sequential(kParam<T> _res, kParam<T> _val) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These two kernels seem to be duplicated in src/gpu/kmeans/kmeans_init.cu. Did you move some code around and forget to remove it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly. These two deals with min and argmin, is my original attempt. But then the 0 distance bug surfaced, I created two similar kernels specifically dealing with 0 values in kmeans_init.cu. The former is not used in kmeans, the later is not generic enough to be put in matrix. I will look into cub function you suggested, hopefully we remove the later two.

}

template <typename T>
struct UniformGenerator : public GeneratorBase<T> {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things:

  • I think this should be named UniformRandomGenerator. Generator by itself does not imply that it generates random numbers.
  • We should definitely remove the curand dependency if possible. Extra dependencies heavily impact maintainability.
  • I think src/gpu/matrix/KmMatrix/Generator.hpp is unnecessary now that we have established we don't need to compile without cuda. Perhaps just put its contents in this file?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be named UniformRandomGenerator. Generator by itself does not imply that it generates random numbers.

Not problem. it will change.

We should definitely remove the curand dependency if possible. Extra dependencies heavily impact maintainability.

Make sense, it will change.

I think src/gpu/matrix/KmMatrix/Generator.hpp is unnecessary now that we have established we don't need to compile without cuda. Perhaps just put its contents in this file?

If we need to run it in CPU, the header is still useful if we create a CPU generator and put this file in src/common. That's something for future, is it acceptable for you?

template <typename T>
std::ostream& operator<<(std::ostream& os, KmMatrix<T>& m);

template <typename T>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this proxy class used for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Originally, I used it to represent extracted row. Indeed now it seems redundant since copying is inevitable. You are right, I should remove it later, but preferablely not in this PR.

@@ -0,0 +1,358 @@
/*!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again I don't think we need to have hpp/cuh and cpp/cu files. We just need the cuda files.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cpp is for KmMatrix, not any specific KmMatrixImpl. We bridge KmMatrix to KmMatrixImpl (CudaKmMatrixImpl specifically) in this file. It's a generic frontend.

@RAMitchell
Copy link

We can switch to doing some work on xgboost after you have dealt with the review issues until the continuous integration problem gets resolved.

@mdymczyk
Copy link
Contributor

@trivialfis @RAMitchell just had a quick glance, changes look really good! Will do a proper review over the weekend.

@RAMitchell by CI issues you mean @trivialfis not being able to see the logs? Would a log emailed when the build fails be sufficient? I remember adding that a long time ago but maybe someone changed it and need to add it back in.

@trivialfis
Copy link
Collaborator Author

@mdymczyk Thanks for spending extra time on this. I haven't receive any mail from the CI yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KMeans|| on GPU
3 participants