copyin() broken in ops_cuda #3943

reddyn12 · 2024-03-26T19:56:32Z

#3456 - repro
Method used in copyin was switched from cuda.cuMemcpyHtoD_v2 to cuda.cuMemcpyHtoDAsync_v2

nimlgen · 2024-03-26T20:13:58Z

I am seeing the same output for LLVM, CLANG, CUDA (both async and synced memcpy). What command to repro this?

nimlgen@tiny15:~/tinygrad$ CLANG=1 python3 examples/mamba.py --prompt "Hello."
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
ram used:  1.49 GB, lm_head.weight                                    : 100%|█████████████████████████████████████████████████████████████████| 483/483 [00:00<00:00, 546.90it/s]
loaded weights in 887.47 ms, 1.69 GB loaded at 1.91 GB/s
Speed Gen: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:23<00:00,  2.33s/it]
Hello.

I am a very happy person. I
TIME:  23.280070781707764
Outputs Match: False
nimlgen@tiny15:~/tinygrad$ CUDA=1 python3 examples/mamba.py --prompt "Hello."
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
ram used:  1.49 GB, lm_head.weight                                    : 100%|█████████████████████████████████████████████████████████████████| 483/483 [00:01<00:00, 290.79it/s]
loaded weights in 1665.22 ms, 1.69 GB loaded at 1.02 GB/s
Speed Gen: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00,  2.19it/s]
Hello.

I am a very happy person. I
TIME:  4.596953630447388
Outputs Match: False
nimlgen@tiny15:~/tinygrad$ LLVM=1 python3 examples/mamba.py --prompt "Hello."
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
ram used:  1.49 GB, lm_head.weight                                    : 100%|█████████████████████████████████████████████████████████████████| 483/483 [00:00<00:00, 525.34it/s]
loaded weights in 923.80 ms, 1.69 GB loaded at 1.83 GB/s
Speed Gen: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:15<00:00,  1.54s/it]
Hello.

I am a very happy person. I
TIME:  15.42816686630249
Outputs Match: False

reddyn12 · 2024-03-26T20:41:29Z

CUDA=1 python3 examples/mamba.py

With async - out of memory error within the load dict call. Also, weird your getting False for output matched on CLANG. The contiguous fixed that issue. Are you on the updated branch? When I switched async with sync, CUDA returned true

reddyn12 · 2024-03-26T20:55:16Z

I'm dumb. Didn't see the prompt arg, ignore the output match part. If async works on tiny15, could it be a method that works for multi gpu setups? Cuz I tested it on my single gpu.

nimlgen · 2024-03-26T21:40:30Z

What gpu you have? Can you rebase to master and retry?

reddyn12 · 2024-03-27T03:37:53Z

I have 3080. I'll be home tmr and can check again then. I'm p sure I had cuda 12.4 when I nvidia-smi.

reddyn12 · 2024-03-27T23:30:22Z

https://github.com/reddyn12/tinygrad/tree/mamba_new is the fresh branch @nimlgen. Still getting the same error:

reddyn@Nikhil-3080:/mnt/d/Code/tinygrad$ CUDA=1 python3 examples/mamba.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
ram used:  1.19 GB, backbone.layers.37.mixer.in_proj.weight           :  77%|████████████████████████████████████████████████▍              | 371/483 [00:01<00:00, 358.28it/s]
loaded weights in 1038.09 ms, 1.21 GB loaded at 1.17 GB/s
Traceback (most recent call last):
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 163, in alloc
    try: return super().alloc(size, options)
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 151, in alloc
    return self._alloc(size, options if options is not None else BufferOptions())
  File "/mnt/d/Code/tinygrad/tinygrad/runtime/ops_cuda.py", line 128, in _alloc
    if options.host: return init_c_var(ctypes.c_void_p(), lambda x: check(cuda.cuMemHostAlloc(ctypes.byref(x), size, 0)))
  File "/mnt/d/Code/tinygrad/tinygrad/helpers.py", line 214, in init_c_var
    def init_c_var(ctypes_var, creat_cb): return (creat_cb(ctypes_var), ctypes_var)[1]
  File "/mnt/d/Code/tinygrad/tinygrad/runtime/ops_cuda.py", line 128, in <lambda>
    if options.host: return init_c_var(ctypes.c_void_p(), lambda x: check(cuda.cuMemHostAlloc(ctypes.byref(x), size, 0)))
  File "/mnt/d/Code/tinygrad/tinygrad/runtime/ops_cuda.py", line 30, in check
    if status != 0: raise RuntimeError(f"CUDA Error {status}, {ctypes.string_at(init_c_var(ctypes.POINTER(ctypes.c_char)(), lambda x: cuda.cuGetErrorString(status, ctypes.byref(x)))).decode()}")  # noqa: E501
RuntimeError: CUDA Error 2, out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/d/Code/tinygrad/examples/mamba.py", line 463, in <module>
    model = Mamba.from_pretrained(args.size)
  File "/mnt/d/Code/tinygrad/examples/mamba.py", line 395, in from_pretrained
    load_state_dict(model, weights)
  File "/mnt/d/Code/tinygrad/tinygrad/nn/state.py", line 71, in load_state_dict
    v.replace(state_dict[k].shard(mlb.device, mlb.axis) if isinstance((mlb:=v.lazydata), MultiLazyBuffer) else state_dict[k].to(v.device)).realize()
  File "/mnt/d/Code/tinygrad/tinygrad/tensor.py", line 139, in realize
    Tensor.corealize([self])
  File "/mnt/d/Code/tinygrad/tinygrad/tensor.py", line 136, in corealize
    run_schedule(create_schedule(flatten([x.lazydata.lbs if isinstance(x.lazydata, MultiLazyBuffer) else [x.lazydata] for x in lst])))
  File "/mnt/d/Code/tinygrad/tinygrad/engine/realize.py", line 57, in run_schedule
    if prg: prg.exec(cast(List[Buffer], real_buffers), si.var_vals)
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 50, in exec
    et = self(rawbufs, var_vals)
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 121, in __call__
    self.copy(dest, src)
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 138, in copy
    else: super().copy(dest, src)
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 116, in copy
    def copy(self, dest, src): dest.copyin(src.as_buffer(allow_zero_copy=True))  # may allocate a CPU buffer depending on allow_zero_copy
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 107, in copyin
    self.allocator.copyin(self._buf, mv)
  File "/mnt/d/Code/tinygrad/tinygrad/runtime/ops_cuda.py", line 135, in copyin
    host_mem = self.alloc(len(src), BufferOptions(host=True))
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 166, in alloc
    return super().alloc(size, options)
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 151, in alloc
    return self._alloc(size, options if options is not None else BufferOptions())
  File "/mnt/d/Code/tinygrad/tinygrad/runtime/ops_cuda.py", line 128, in _alloc
    if options.host: return init_c_var(ctypes.c_void_p(), lambda x: check(cuda.cuMemHostAlloc(ctypes.byref(x), size, 0)))
  File "/mnt/d/Code/tinygrad/tinygrad/helpers.py", line 214, in init_c_var
    def init_c_var(ctypes_var, creat_cb): return (creat_cb(ctypes_var), ctypes_var)[1]
  File "/mnt/d/Code/tinygrad/tinygrad/runtime/ops_cuda.py", line 128, in <lambda>
    if options.host: return init_c_var(ctypes.c_void_p(), lambda x: check(cuda.cuMemHostAlloc(ctypes.byref(x), size, 0)))
  File "/mnt/d/Code/tinygrad/tinygrad/runtime/ops_cuda.py", line 30, in check
    if status != 0: raise RuntimeError(f"CUDA Error {status}, {ctypes.string_at(init_c_var(ctypes.POINTER(ctypes.c_char)(), lambda x: cuda.cuGetErrorString(status, ctypes.byref(x)))).decode()}")  # noqa: E501
RuntimeError: CUDA Error 2, out of memory

The synced function works btw.

reddyn12 · 2024-03-27T23:33:26Z

same error with

CUDA=1 python3 examples/gpt2.py

reddyn12 · 2024-03-28T00:12:55Z

mamba and gpt2 work when I use school's compute cluster. Have a strong feeling its single gpu system related

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

copyin() broken in ops_cuda #3943

copyin() broken in ops_cuda #3943

reddyn12 commented Mar 26, 2024

nimlgen commented Mar 26, 2024

reddyn12 commented Mar 26, 2024

reddyn12 commented Mar 26, 2024

nimlgen commented Mar 26, 2024

reddyn12 commented Mar 27, 2024

reddyn12 commented Mar 27, 2024

reddyn12 commented Mar 27, 2024

reddyn12 commented Mar 28, 2024

copyin() broken in ops_cuda #3943

copyin() broken in ops_cuda #3943

Comments

reddyn12 commented Mar 26, 2024

nimlgen commented Mar 26, 2024

reddyn12 commented Mar 26, 2024

reddyn12 commented Mar 26, 2024

nimlgen commented Mar 26, 2024

reddyn12 commented Mar 27, 2024

reddyn12 commented Mar 27, 2024

reddyn12 commented Mar 27, 2024

reddyn12 commented Mar 28, 2024