Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with conv4 on gpu #680

Open
andreasdominik opened this issue Jun 1, 2022 · 4 comments
Open

Problem with conv4 on gpu #680

andreasdominik opened this issue Jun 1, 2022 · 4 comments

Comments

@andreasdominik
Copy link

Dear Deniz,

We are facing a strange problem with conv4 on gpu.
The code

> using Knet
> x = rand(Float32, 224,224,3,4) |> CuArray
> w = param(5,5,3,8)
> conv4(w,x)

generates the error:

MethodError: no method matching similar(::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, ::Missing)

Stacktrace:
  [1] conv4_algo(w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, y::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}; handle::Ptr{Nothing}, o::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Knet.Ops20_gpu ~/.julia/packages/Knet/YIFWC/src/ops20_gpu/conv.jl:166
  [2] conv4(w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}; handle::Ptr{Nothing}, alpha::Int64, o::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ Knet.Ops20_gpu ~/.julia/packages/Knet/YIFWC/src/ops20_gpu/conv.jl:9
  [3] conv4(w::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
    @ Knet.Ops20_gpu ~/.julia/packages/Knet/YIFWC/src/ops20_gpu/conv.jl:7
  [4] forw(::Function, ::Param{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}, ::Vararg{Any}; kwargs::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ AutoGrad ~/.julia/packages/AutoGrad/1QZxP/src/core.jl:66
  [5] forw
    @ ~/.julia/packages/AutoGrad/1QZxP/src/core.jl:65 [inlined]
  [6] #conv4#28
    @ ./none:0 [inlined]
  [7] conv4(w::Param{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}}, x::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
    @ Knet.Ops20 ./none:0

Strange, because this happens on our Nvidia-Server but NOT on my local computer with identical installations of Julia and Knet.
Only difference is the Cuda-driver, which is

Driver Version: 460.91.03 CUDA Version: 11.2

on the nvidia-machine (with error) and Driver Version: 510.73.05 CUDA Version: 11.6 on my computer. Unfortunately it is not so easy to change the driver on the server, because of people are using a multitude of frameworks there.
Maybe (hopefully) you have an idea...

cordially (a)do

@andreasdominik
Copy link
Author

andreasdominik commented Jun 1, 2022

... maybe I need to say in addition:
normal multiplication of CuArrays in the gpu works fine; i.e. cuda is functional.

@andreasdominik
Copy link
Author

O.K. - CUDA.cached_memory() is only available on driver >= 11.3
That's it.
We try to update - and get rid of tensorflow that cannot handle the new driver versions^^
Srry for opening and closing the issue - but so it's alt least documented for the rest of the world...

@andreasdominik
Copy link
Author

andreasdominik commented Jun 1, 2022

closed but it becomes more weird:
I found a docker container on the server with Cuda 11.0 - and it works:

using Knet
smi = `nvidia-smi`; run(smi)
x = rand(Float32, 24,24,3,4) |> CuArray
w = param(5,5,3,8)
size(conv4(w,x))

Wed Jun  1 13:28:34 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 450.191.01   Driver Version: 450.191.01   CUDA Version: 11.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:03:00.0 Off |                    0 |
| N/A   29C    P0    34W / 250W |  19301MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

(20, 20, 8, 4)

@andreasdominik
Copy link
Author

indeed - the longer I try to find out what's going on here, I get more confused; maybe it is better to re-open it..

@andreasdominik andreasdominik reopened this Jun 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant