CUDA DeviceFree: out of memory error when building Aresdb first time #342

alxmrr · 2019-11-13T22:28:28Z

Describe the issue
When running 'make run_server' to build version 0.0.2, the build fails with a DeviceFree: out of memory error after a few minutes. I am using a new server with no other processes running.

Reproduce the issue
NVIDIA driver version: 390.48
Cuda version: release 9.1, V9.1.85
golang version: 1.13
gcc version: 5.4.0
cmake version: 3.15.4

Follow the instructions to compile Aresdb version 0.0.2 through 'run make_server'

Error message
[ 15%] Built target mem
[100%] Built target algorithm
[100%] Built target lib
[100%] Built target aresd
Using config file: config/ares.yaml
{"level":"info","msg":"Bootstrapping service","config":{"Port":9374,"DebugPort":43202,"RootPath":"ares-root","TotalMemorySize":161061273600,"SchedulerOff":false,"Version":"","Env":"","Query":{"DeviceMemoryUtilization":0.95,"DeviceChoosingTimeout":10,"TimezoneTable":{"TableName":"api_cities"},"EnableHashReduction":false},"DiskStore":{"WriteSync":true},"HTTP":{"MaxConnections":300,"ReadTimeOutInSeconds":20,"WriteTimeOutInSeconds":300},"RedoLogConfig":{"DiskConfig":{"Disabled":false},"KafkaConfig":{"Enabled":false,"Brokers":null,"TopicSuffix":""},"DiskOnlyForUnsharded":false},"Cluster":{"Enable":false,"Distributed":false,"Namespace":"","InstanceID":"","Controller":{"Address":"localhost:6708","Headers":null,"TimeoutSec":0},"Etcd":{"Zone":"local","Env":"dev","Service":"ares-datanode","CacheDir":"","ETCDClusters":[{"Zone":"local","Endpoints":["127.0.0.1:2379"],"KeepAlive":null,"TLS":null}],"SDConfig":{"InitTimeout":null},"WatchWithRevision":0},"HeartbeatConfig":{"Timeout":10,"Interval":1}}}}
panic: ERROR when calling CUDA functions: DeviceFree: out of memory

goroutine 1 [running]:
github.com/uber/aresdb/utils.StackError(0x0, 0x0, 0xc00004e040, 0x3d, 0x0, 0x0, 0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/utils/error.go:61 +0x3f9
github.com/uber/aresdb/cgoutils.DoCGoCall(0xc0005b2e18, 0xc0004a44d0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/utils.go:31 +0xa7
github.com/uber/aresdb/cgoutils.doCGoCall(0xc0005b2e48, 0x1)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/memory.go:188 +0x49
github.com/uber/aresdb/cgoutils.DeviceFree(0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/memory.go:111 +0x5c
github.com/uber/aresdb/cmd/aresd/cmd.start(0x249e, 0xa8c2, 0xc0005660c0, 0x9, 0x2580000000, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
       /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:103 +0x1c2
github.com/uber/aresdb/cmd/aresd/cmd.Execute.func1(0xc00038e000, 0x1e39648, 0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:85 +0x13d
github.com/spf13/cobra.(*Command).execute(0xc00038e000, 0xc00003c1d0, 0x0, 0x0, 0xc00038e000, 0xc00003c1d0)
        /nvme1n1/go1/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830 +0x2aa
github.com/spf13/cobra.(*Command).ExecuteC(0xc00038e000, 0xc0004a2050, 0x5, 0x134fe40)
        /nvme1n1/go1/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914 +0x2fb
github.com/spf13/cobra.(*Command).Execute(...)
        /nvme1n1/go1/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/uber/aresdb/cmd/aresd/cmd.Execute(0x0, 0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:95 +0x229
main.main()
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/main.go:20 +0x32

goroutine 1 [running]:
github.com/uber/aresdb/cgoutils.DoCGoCall(0xc0005b2e18, 0xc0004a44d0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/utils.go:31 +0xc1
github.com/uber/aresdb/cgoutils.doCGoCall(0xc0005b2e48, 0x1)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/memory.go:188 +0x49
github.com/uber/aresdb/cgoutils.DeviceFree(0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cgoutils/memory.go:111 +0x5c
github.com/uber/aresdb/cmd/aresd/cmd.start(0x249e, 0xa8c2, 0xc0005660c0, 0x9, 0x2580000000, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:103 +0x1c2
github.com/uber/aresdb/cmd/aresd/cmd.Execute.func1(0xc00038e000, 0x1e39648, 0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:85 +0x13d
github.com/spf13/cobra.(*Command).execute(0xc00038e000, 0xc00003c1d0, 0x0, 0x0, 0xc00038e000, 0xc00003c1d0)
        /nvme1n1/go1/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:830 +0x2aa
github.com/spf13/cobra.(*Command).ExecuteC(0xc00038e000, 0xc0004a2050, 0x5, 0x134fe40)
        /nvme1n1/go1/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:914 +0x2fb
github.com/spf13/cobra.(*Command).Execute(...)
        /nvme1n1/go1/pkg/mod/github.com/spf13/cobra@v0.0.5/command.go:864
github.com/uber/aresdb/cmd/aresd/cmd.Execute(0x0, 0x0, 0x0)
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/cmd/cmd.go:95 +0x229
main.main()
        /nvme1n1/go1/src/github.com/uber/aresdb/cmd/aresd/main.go:20 +0x32
CMakeFiles/run_server.dir/build.make:57: recipe for target 'CMakeFiles/run_server' failed
make[3]: *** [CMakeFiles/run_server] Error 2
CMakeFiles/Makefile2:467: recipe for target 'CMakeFiles/run_server.dir/all' failed
make[2]: *** [CMakeFiles/run_server.dir/all] Error 2
CMakeFiles/Makefile2:474: recipe for target 'CMakeFiles/run_server.dir/rule' failed
make[1]: *** [CMakeFiles/run_server.dir/rule] Error 2
Makefile:298: recipe for target 'run_server' failed
make: *** [run_server] Error 2

shz117 · 2019-11-13T23:31:40Z

what's the output of nvidia-smi in your environment?

alxmrr · 2019-11-14T22:39:01Z

shz117 · 2019-11-15T00:24:04Z

that's wired.. the error is during initializing. AresDB has not copied anything to device memory yet.

looks like you were running on bare metal (whithout docker) but I was not able to reproduce. (I ran the same make rule with same cuda version and driver version, and was able to start the server properly).

several things I would try:

see if you can run any sample cuda app.
maybe try after killing xorg processes. (although it shouldn't matter)

alxmrr · 2019-11-20T00:00:04Z

I tried a number of CUDA samples, including bandwidthTest, deviceQuery, histogram, and more without error.

Linux version: Ubuntu 16.04.6 LTS

Could the trouble have to do with the Linux version, or is there a recommended version of nvidia and cuda to work with Ubuntu 16.04.6 LTS maybe?

shz117 · 2019-11-20T00:03:13Z

not likely. I tested in the same linux version.

alxmrr · 2019-11-21T01:03:34Z

Still having trouble. Has anyone encountered this error from make test-cuda?

[----------] 4 tests from UnaryTransformTest
[ RUN ] UnaryTransformTest.CheckInt
Exception happend when doing UnaryTransform:parallel_for failed: invalid device function
terminate called after throwing an instance of 'thrust::system::system_error'
what(): parallel_for failed: invalid device function
Aborted (core dumped)
CMakeFiles/test-cuda.dir/build.make:60: recipe for target 'CMakeFiles/test-cuda' failed
make[3]: *** [CMakeFiles/test-cuda] Error 134
make[3]: Leaving directory '/nvme0n1/go2/src/github.com/uber/aresdb'
CMakeFiles/Makefile2:388: recipe for target 'CMakeFiles/test-cuda.dir/all' failed
make[2]: *** [CMakeFiles/test-cuda.dir/all] Error 2
make[2]: Leaving directory '/nvme0n1/go2/src/github.com/uber/aresdb'
CMakeFiles/Makefile2:395: recipe for target 'CMakeFiles/test-cuda.dir/rule' failed
make[1]: *** [CMakeFiles/test-cuda.dir/rule] Error 2
make[1]: Leaving directory '/nvme0n1/go2/src/github.com/uber/aresdb'
Makefile:262: recipe for target 'test-cuda' failed
make: *** [test-cuda] Error 2

alxmrr · 2019-11-22T15:01:42Z

I just realized I didn’t mention this before, but I am running in GPU mode, so am running the cmake command ‘cmake -DQUERY_MODE=DEVICE’. I am considering the Docker implementation as well, but I saw the cmake command does not specify DQUERY_MODE. Does the Docker version run in GPU or CPU mode?

shz117 · 2019-12-03T18:03:39Z

I just realized I didn’t mention this before, but I am running in GPU mode, so am running the cmake command ‘cmake -DQUERY_MODE=DEVICE’. I am considering the Docker implementation as well, but I saw the cmake command does not specify DQUERY_MODE. Does the Docker version run in GPU or CPU mode?

QUERY_MODE=DEVICE is the one you want to set with a GPU machine.
if QUERY_MODE is missing the make file will set it base on whether the build machine has GPU card (here)
The docker version should be in GPU mode because it's on top of nvidiadocker.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA DeviceFree: out of memory error when building Aresdb first time #342

CUDA DeviceFree: out of memory error when building Aresdb first time #342

alxmrr commented Nov 13, 2019

shz117 commented Nov 13, 2019

alxmrr commented Nov 14, 2019

shz117 commented Nov 15, 2019

alxmrr commented Nov 20, 2019

shz117 commented Nov 20, 2019

alxmrr commented Nov 21, 2019

alxmrr commented Nov 22, 2019

shz117 commented Dec 3, 2019

CUDA DeviceFree: out of memory error when building Aresdb first time #342

CUDA DeviceFree: out of memory error when building Aresdb first time #342

Comments

alxmrr commented Nov 13, 2019

shz117 commented Nov 13, 2019

alxmrr commented Nov 14, 2019

shz117 commented Nov 15, 2019

alxmrr commented Nov 20, 2019

shz117 commented Nov 20, 2019

alxmrr commented Nov 21, 2019

alxmrr commented Nov 22, 2019

shz117 commented Dec 3, 2019