Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

memory problems with CUDA-based rings #138

Open
jaycedowell opened this issue Apr 1, 2020 · 2 comments
Open

memory problems with CUDA-based rings #138

jaycedowell opened this issue Apr 1, 2020 · 2 comments
Labels

Comments

@jaycedowell
Copy link
Collaborator

A couple of times now I have run into problems passing data between blocks using CUDA-based rings. If I don't force a bifrost.device.synchronize_stream() within the reserve context for the ring, I end up with inconsistent results reading from the ring in another block. I think what is happening is that the ring doesn't know about the asynchronous copies and happily marks the reserved segment as good to go when then reserve is released. Is there a better way to deal with this than sprinkling synchronize_stream() calls around?

@benbarsdell
Copy link
Collaborator

Bifrost asynchronicity is based around CPU threads each having their own CUDA stream. All GPU work in a CPU thread must be synchronous with respect to that thread, so it must be followed by a stream synchronize before things are released to other threads. (Using async CUDA APIs and then synchronizing on a per-CPU-thread stream ensures that GPU work is synchronous within the CPU thread but asynchronous between threads).

E.g., the pipeline infrastructure does this for all blocks here:
https://github.com/ledatelescope/bifrost/blob/8a059b3/python/bifrost/pipeline.py#L462

@jaycedowell
Copy link
Collaborator Author

Ok, thanks.

realtimeradio pushed a commit to realtimeradio/caltech-bifrost-dsp that referenced this issue Jun 22, 2020
Dummysource replaces the ethernet input for throughput
testing, and is enabled with the commandline switch --fakesource

Add xGPU averaging and subselection. The former has been "tested" in
that it outputs appropriate data when the pipeline is fed with the
all ones.

With all threads active, the pipeline runs at ~40Gb/s on my
old Xeon machine, seemingly processing limited by my RTX 2060 GPU.

NB: Probably some syncronization barriers are needed, certainly
on the block which copies data to the GPU. See
ledatelescope/bifrost#138
realtimeradio pushed a commit to realtimeradio/caltech-bifrost-dsp that referenced this issue Jun 23, 2020
When blocks blocks write to a ring across the CPU/GPU boundary
this copy is [I think] asynchronous, and needs to be
synchronized before marking the destination buffer as ready
for consumption by downstream consumers.

See ledatelescope/bifrost#138
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants