We are interested in GPU-accelerating the SoftImpute algorithm from Hastie et al's paper. This is a SVD-based collaborative filtering algorithm.

We use randomized SVD as it seems to work better with GPUs. We use the accelerated proximal gradient method also mentioned in Quanming Yao et al's paper. See below for references.

We compare these GPU-accelerated versions against the CPU version. We see that there is at least a 10X speedup. We also compare against a simple SGD implementation with a bit of tuning. All CPU implmentations, SoftImpute or SGD, use Eigen or BLAS. They are pretty efficient, to be fair.

For the data, currently we only have the MovieLens 20M dataset. We split the data into 5 parts and use one part for measuring the error. (There is no validation set but we hardly do any tuning anyway.)

The results seem to be that SGD is still faster, even when on single core.

Our CPU is a I7-6700K. Our GPU is a Titan-X with 12G RAM.


  • The slowest is CPU-NoAcc which is the CPU version with un-accelerated proximal gradient.
  • The next slowest is CPU-Acc which is the CPU version with accelerated proximal gradient.
  • The GPU versions are all significantly faster, seemingly >10X. (We do have a fast GPU unfortunately.)
  • However, SGD still seems to be the fastest.

The most tricky part of the GPU code is probably the evaluation of many short inner products. It seems to be the bottleneck, and we have to write a custom kernel to do that efficiently.

Sample run command


./impute_main.o \
--output_filename=/tmp/a.txt \
--train_filename=$DIR/train_1.csr \
--train_t_filename=$DIR/train_1.t.csr \
--test_filename=$DIR/validate_1.csr \
--train_perm_filename=$DIR/train_1.perm \
--use_gpu=true \
--max_time=60 \
--log_every_sec=5 \
--soft_threshold=true \


Install CUDA 8.0

We do not use the one from synaptic. If you have it, delete it. Otherwise it might cause some conflcits.

Run the two installers. Second one is the patch. Say we install to /usr/local/cuda.

We install CUDA to /usr/local/cuda.

  • The libs are in /usr/local/cuda/lib64. Add that to LD_LIBRARY_PATH.
  • The binaries are in /usr/local/cuda/bin. Add that to PATH.
  • The headers are in /usr/local/cuda/include. Add that to makefile.

Check nvcc:

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Tue_Jan_10_13:22:03_CST_2017
Cuda compilation tools, release 8.0, V8.0.61

Try out the sample programs. Do a ldd to make sure it is using the right CUDA library.

Install OpenBLAS

We like to use locally compiled libs.

Download OpenBLAS. Install to /opt/OpenBLAS by default. Make sure we have all the interfaces included:


Install MAGMA

At the bottom of, add in

CUDADIR = /usr/local/cuda-8.0

Then make and make install. Check out some of the test programs. Here is an example.

Install gtest

Download from Use cmake and make install.

Install glog, gflags

Just use synaptic.


On applying accelerated proximal gradient method to SoftImpute by Quanming Yao, James T. Kwok

Spectral regularization algorithms for learning large incomplete matrices by Mazumder, Hastie, Tibshirani.