Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copyin() broken in ops_cuda #3943

Open
reddyn12 opened this issue Mar 26, 2024 · 8 comments
Open

copyin() broken in ops_cuda #3943

reddyn12 opened this issue Mar 26, 2024 · 8 comments

Comments

@reddyn12
Copy link
Contributor

#3456 - repro
Method used in copyin was switched from cuda.cuMemcpyHtoD_v2 to cuda.cuMemcpyHtoDAsync_v2

@nimlgen
Copy link
Collaborator

nimlgen commented Mar 26, 2024

I am seeing the same output for LLVM, CLANG, CUDA (both async and synced memcpy). What command to repro this?

nimlgen@tiny15:~/tinygrad$ CLANG=1 python3 examples/mamba.py --prompt "Hello."
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
ram used:  1.49 GB, lm_head.weight                                    : 100%|█████████████████████████████████████████████████████████████████| 483/483 [00:00<00:00, 546.90it/s]
loaded weights in 887.47 ms, 1.69 GB loaded at 1.91 GB/s
Speed Gen: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:23<00:00,  2.33s/it]
Hello.

I am a very happy person. I
TIME:  23.280070781707764
Outputs Match: False
nimlgen@tiny15:~/tinygrad$ CUDA=1 python3 examples/mamba.py --prompt "Hello."
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
ram used:  1.49 GB, lm_head.weight                                    : 100%|█████████████████████████████████████████████████████████████████| 483/483 [00:01<00:00, 290.79it/s]
loaded weights in 1665.22 ms, 1.69 GB loaded at 1.02 GB/s
Speed Gen: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:04<00:00,  2.19it/s]
Hello.

I am a very happy person. I
TIME:  4.596953630447388
Outputs Match: False
nimlgen@tiny15:~/tinygrad$ LLVM=1 python3 examples/mamba.py --prompt "Hello."
None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
ram used:  1.49 GB, lm_head.weight                                    : 100%|█████████████████████████████████████████████████████████████████| 483/483 [00:00<00:00, 525.34it/s]
loaded weights in 923.80 ms, 1.69 GB loaded at 1.83 GB/s
Speed Gen: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:15<00:00,  1.54s/it]
Hello.

I am a very happy person. I
TIME:  15.42816686630249
Outputs Match: False

@reddyn12
Copy link
Contributor Author

CUDA=1 python3 examples/mamba.py

With async - out of memory error within the load dict call. Also, weird your getting False for output matched on CLANG. The contiguous fixed that issue. Are you on the updated branch? When I switched async with sync, CUDA returned true

@reddyn12
Copy link
Contributor Author

I'm dumb. Didn't see the prompt arg, ignore the output match part. If async works on tiny15, could it be a method that works for multi gpu setups? Cuz I tested it on my single gpu.

@nimlgen
Copy link
Collaborator

nimlgen commented Mar 26, 2024

What gpu you have? Can you rebase to master and retry?

@reddyn12
Copy link
Contributor Author

I have 3080. I'll be home tmr and can check again then. I'm p sure I had cuda 12.4 when I nvidia-smi.

@reddyn12
Copy link
Contributor Author

https://github.com/reddyn12/tinygrad/tree/mamba_new is the fresh branch @nimlgen. Still getting the same error:

reddyn@Nikhil-3080:/mnt/d/Code/tinygrad$ CUDA=1 python3 examples/mamba.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
ram used:  1.19 GB, backbone.layers.37.mixer.in_proj.weight           :  77%|████████████████████████████████████████████████▍              | 371/483 [00:01<00:00, 358.28it/s]
loaded weights in 1038.09 ms, 1.21 GB loaded at 1.17 GB/s
Traceback (most recent call last):
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 163, in alloc
    try: return super().alloc(size, options)
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 151, in alloc
    return self._alloc(size, options if options is not None else BufferOptions())
  File "/mnt/d/Code/tinygrad/tinygrad/runtime/ops_cuda.py", line 128, in _alloc
    if options.host: return init_c_var(ctypes.c_void_p(), lambda x: check(cuda.cuMemHostAlloc(ctypes.byref(x), size, 0)))
  File "/mnt/d/Code/tinygrad/tinygrad/helpers.py", line 214, in init_c_var
    def init_c_var(ctypes_var, creat_cb): return (creat_cb(ctypes_var), ctypes_var)[1]
  File "/mnt/d/Code/tinygrad/tinygrad/runtime/ops_cuda.py", line 128, in <lambda>
    if options.host: return init_c_var(ctypes.c_void_p(), lambda x: check(cuda.cuMemHostAlloc(ctypes.byref(x), size, 0)))
  File "/mnt/d/Code/tinygrad/tinygrad/runtime/ops_cuda.py", line 30, in check
    if status != 0: raise RuntimeError(f"CUDA Error {status}, {ctypes.string_at(init_c_var(ctypes.POINTER(ctypes.c_char)(), lambda x: cuda.cuGetErrorString(status, ctypes.byref(x)))).decode()}")  # noqa: E501
RuntimeError: CUDA Error 2, out of memory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/d/Code/tinygrad/examples/mamba.py", line 463, in <module>
    model = Mamba.from_pretrained(args.size)
  File "/mnt/d/Code/tinygrad/examples/mamba.py", line 395, in from_pretrained
    load_state_dict(model, weights)
  File "/mnt/d/Code/tinygrad/tinygrad/nn/state.py", line 71, in load_state_dict
    v.replace(state_dict[k].shard(mlb.device, mlb.axis) if isinstance((mlb:=v.lazydata), MultiLazyBuffer) else state_dict[k].to(v.device)).realize()
  File "/mnt/d/Code/tinygrad/tinygrad/tensor.py", line 139, in realize
    Tensor.corealize([self])
  File "/mnt/d/Code/tinygrad/tinygrad/tensor.py", line 136, in corealize
    run_schedule(create_schedule(flatten([x.lazydata.lbs if isinstance(x.lazydata, MultiLazyBuffer) else [x.lazydata] for x in lst])))
  File "/mnt/d/Code/tinygrad/tinygrad/engine/realize.py", line 57, in run_schedule
    if prg: prg.exec(cast(List[Buffer], real_buffers), si.var_vals)
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 50, in exec
    et = self(rawbufs, var_vals)
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 121, in __call__
    self.copy(dest, src)
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 138, in copy
    else: super().copy(dest, src)
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 116, in copy
    def copy(self, dest, src): dest.copyin(src.as_buffer(allow_zero_copy=True))  # may allocate a CPU buffer depending on allow_zero_copy
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 107, in copyin
    self.allocator.copyin(self._buf, mv)
  File "/mnt/d/Code/tinygrad/tinygrad/runtime/ops_cuda.py", line 135, in copyin
    host_mem = self.alloc(len(src), BufferOptions(host=True))
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 166, in alloc
    return super().alloc(size, options)
  File "/mnt/d/Code/tinygrad/tinygrad/device.py", line 151, in alloc
    return self._alloc(size, options if options is not None else BufferOptions())
  File "/mnt/d/Code/tinygrad/tinygrad/runtime/ops_cuda.py", line 128, in _alloc
    if options.host: return init_c_var(ctypes.c_void_p(), lambda x: check(cuda.cuMemHostAlloc(ctypes.byref(x), size, 0)))
  File "/mnt/d/Code/tinygrad/tinygrad/helpers.py", line 214, in init_c_var
    def init_c_var(ctypes_var, creat_cb): return (creat_cb(ctypes_var), ctypes_var)[1]
  File "/mnt/d/Code/tinygrad/tinygrad/runtime/ops_cuda.py", line 128, in <lambda>
    if options.host: return init_c_var(ctypes.c_void_p(), lambda x: check(cuda.cuMemHostAlloc(ctypes.byref(x), size, 0)))
  File "/mnt/d/Code/tinygrad/tinygrad/runtime/ops_cuda.py", line 30, in check
    if status != 0: raise RuntimeError(f"CUDA Error {status}, {ctypes.string_at(init_c_var(ctypes.POINTER(ctypes.c_char)(), lambda x: cuda.cuGetErrorString(status, ctypes.byref(x)))).decode()}")  # noqa: E501
RuntimeError: CUDA Error 2, out of memory

The synced function works btw.

@reddyn12
Copy link
Contributor Author

same error with

CUDA=1 python3 examples/gpt2.py 

@reddyn12
Copy link
Contributor Author

mamba and gpt2 work when I use school's compute cluster. Have a strong feeling its single gpu system related

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants