Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] better oom injection #1609

Open
abellina opened this issue Dec 4, 2023 · 0 comments
Open

[FEA] better oom injection #1609

abellina opened this issue Dec 4, 2023 · 0 comments
Assignees

Comments

@abellina
Copy link
Collaborator

abellina commented Dec 4, 2023

With the addition of #1543 and even before this, we have been thinking about improving the OOM injection mechanism

/**
* Force the thread with the given ID to throw a GpuRetryOOM on their next allocation attempt.
* @param threadId the ID of the thread to throw the exception (not java thread id).
* @param numOOMs the number of times the GpuRetryOOM should be thrown
*/
public void forceRetryOOM(long threadId, int numOOMs) {
forceRetryOOM(getHandle(), threadId, numOOMs);
}
/**
* Force the thread with the given ID to throw a GpuSplitAndRetryOOM on their next allocation attempt.
* @param threadId the ID of the thread to throw the exception (not java thread id).
* @param numOOMs the number of times the GpuSplitAndRetryOOM should be thrown
*/
public void forceSplitAndRetryOOM(long threadId, int numOOMs) {
forceSplitAndRetryOOM(getHandle(), threadId, numOOMs);
}
/**
* Force the thread with the given ID to throw a GpuSplitAndRetryOOM on their next allocation attempt.
* @param threadId the ID of the thread to throw the exception (not java thread id).
* @param numTimes the number of times the CudfException should be thrown
*/
public void forceCudfException(long threadId, int numTimes) {
forceCudfException(getHandle(), threadId, numTimes);
}

We would like to do two things:

  1. Ensure we can inject GPU and CPU ooms separately. Currently we inject an oom, and the first allocation that happens from the host or gpu will trigger it.
  2. We would also like to add options so we don't always inject the oom on the first allocation. An option to inject on the Nth allocation would be good. I believe there was talk about randomizing the allocation on which we fail, but I am not entirely sure how that would work if a unit test depends on it, but adding it here for consideration.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants