New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
copyin() broken in ops_cuda #3943
Comments
I am seeing the same output for LLVM, CLANG, CUDA (both async and synced memcpy). What command to repro this?
|
CUDA=1 python3 examples/mamba.py With async - out of memory error within the load dict call. Also, weird your getting False for output matched on CLANG. The contiguous fixed that issue. Are you on the updated branch? When I switched async with sync, CUDA returned true |
I'm dumb. Didn't see the prompt arg, ignore the output match part. If async works on tiny15, could it be a method that works for multi gpu setups? Cuz I tested it on my single gpu. |
What gpu you have? Can you rebase to master and retry? |
I have 3080. I'll be home tmr and can check again then. I'm p sure I had cuda 12.4 when I nvidia-smi. |
https://github.com/reddyn12/tinygrad/tree/mamba_new is the fresh branch @nimlgen. Still getting the same error:
The synced function works btw. |
same error with
|
mamba and gpt2 work when I use school's compute cluster. Have a strong feeling its single gpu system related |
#3456 - repro
Method used in copyin was switched from cuda.cuMemcpyHtoD_v2 to cuda.cuMemcpyHtoDAsync_v2
The text was updated successfully, but these errors were encountered: