Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[POTENTIAL BUG]: CopyToCpu is using refs in unsafe way but there is no indication of that. #1209

Open
En3Tho opened this issue Apr 19, 2024 · 3 comments

Comments

@En3Tho
Copy link

En3Tho commented Apr 19, 2024

Describe the bug

This routine calls CopyToCPUUnsafeAsync which does not pin/fix refs and simply calls Unsafe.AsPointer on them. This could potentially lead to situations where pointer could end up pointing at invalid memory region if GC kicks in while copy operation is ongoing.

I guess there is an implicit assumtion that all refs should be pinned before use. But I personally would have expected ref taking api to pin those manually, e.g. fixed p = ref ...

I think there are few ways to deal with that:

  1. Simply tell people that there are NO ref safety guarantees so it becomes a global implicit rule
  2. Explicitly state this in the comment
  3. Find such apis and actually use fixed then route to pointer-taking apis
  4. (I guess not possible) rename api to CopyToCPUUnsafe
public static void CopyToCPU<T, TView>(
      this TView source,
      AcceleratorStream stream,
      ref T cpuData,
      long length)
      where T : unmanaged
      where TView : IContiguousArrayView<T>
    {
      source.CopyToCPUUnsafeAsync<T, TView>(stream, ref cpuData, length);
      stream.Synchronize();
    }

Environment

  • ILGPU version: main

Steps to reproduce

In the description

Expected behavior

In the description

Additional context

No response

@En3Tho En3Tho added the bug label Apr 19, 2024
@En3Tho En3Tho changed the title [BUG]: CopyToCpu is using refs in unsafe way but there is no indication of that. [POTENTIAL BUG]: CopyToCpu is using refs in unsafe way but there is no indication of that. Apr 19, 2024
@m4rs-mt
Copy link
Owner

m4rs-mt commented Apr 22, 2024

Hi @En3Tho, thank you for raising this potential bug report and welcome to the ILGPU community! Indeed, these functions are considering access to either pinned managed/or even unmanaged pointers to be targeted directly from GPU drivers at the moment. I agree that we should improve documentation of these functions to help people understand potential side effects (as in undefined behavior...) and potentially enhance this function with fixed p = ... while providing a native overload using IntPtr or void* to avoid additional pinning attempts. What do you think?

@En3Tho
Copy link
Author

En3Tho commented Apr 23, 2024

I think it would be good to have different ref T and T* overloads, former being "safer oner". I understand that basically once you go into gpu land there is no such thing as "safety" but still it would be good to eliminate a possible cause of clr crash.

My main concern is that someone might try using this on web server for example where some tasks are a perfect fit for gpu. Active memory hogs like webservers are triggering gc quite often and it could really lead to the whole app crashing. Usually using ref T means that operation is guaranteed to be safe at least memory wise.

So an ideal solution in my opinion would be 2 overloads: ref T with a pinning (which is considered super cheap as cpu operation but it might introduce gc fragmentation) and T* for people who "know what they are doing". There could be as well a convinient Span<T> overload just to utilize the same overload for basically any kind of contigious memory (stack, gcheap, native heap etc` with at least some kind of bound checking etc just like it works in bcl.

@m4rs-mt m4rs-mt added enhancement and removed bug labels May 20, 2024
@m4rs-mt
Copy link
Owner

m4rs-mt commented May 20, 2024

Sounds reasonable to me. Please note, that we can add an additional overload without any limitations. However, changing the semantics of an existing function is not possible due to backwards compatibility (since it has been used for several years now). We could deprecate the current function, while keeping its semantics. Additionally, we can introduce an overload featuring T* and add another function CopyToCPUWithPinning (or something like this) accepting ref T while making it clear that this reference will be automatically pinned. Alternatively, we can use your Span<T> proposal while marking the "old" function as obsolete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants