[FEA] better oom injection #1609

abellina · 2023-12-04T19:28:41Z

With the addition of #1543 and even before this, we have been thinking about improving the OOM injection mechanism

spark-rapids-jni/src/main/java/com/nvidia/spark/rapids/jni/SparkResourceAdaptor.java

Lines 185 to 211 in 9c3c7a6

    
             /** 
        
              * Force the thread with the given ID to throw a GpuRetryOOM on their next allocation attempt. 
        
              * @param threadId the ID of the thread to throw the exception (not java thread id). 
        
              * @param numOOMs the number of times the GpuRetryOOM should be thrown 
        
              */ 
        
             public void forceRetryOOM(long threadId, int numOOMs) { 
        
               forceRetryOOM(getHandle(), threadId, numOOMs); 
        
             } 
        
             /** 
        
              * Force the thread with the given ID to throw a GpuSplitAndRetryOOM on their next allocation attempt. 
        
              * @param threadId the ID of the thread to throw the exception (not java thread id). 
        
              * @param numOOMs the number of times the GpuSplitAndRetryOOM should be thrown 
        
              */ 
        
             public void forceSplitAndRetryOOM(long threadId, int numOOMs) { 
        
               forceSplitAndRetryOOM(getHandle(), threadId, numOOMs); 
        
             } 
        
             /** 
        
              * Force the thread with the given ID to throw a GpuSplitAndRetryOOM on their next allocation attempt. 
        
              * @param threadId the ID of the thread to throw the exception (not java thread id). 
        
              * @param numTimes the number of times the CudfException should be thrown 
        
              */ 
        
             public void forceCudfException(long threadId, int numTimes) { 
        
               forceCudfException(getHandle(), threadId, numTimes); 
        
             }

We would like to do two things:

Ensure we can inject GPU and CPU ooms separately. Currently we inject an oom, and the first allocation that happens from the host or gpu will trigger it.
We would also like to add options so we don't always inject the oom on the first allocation. An option to inject on the Nth allocation would be good. I believe there was talk about randomizing the allocation on which we fail, but I am not entirely sure how that would work if a unit test depends on it, but adding it here for consideration.

abellina added ? - Needs Triage feature request labels Dec 4, 2023

abellina mentioned this issue Dec 4, 2023

Add host memory retries for GeneratedInternalRowToCudfRowIterator NVIDIA/spark-rapids#9929

Merged

mattahrens added reliability test labels Dec 5, 2023

mattahrens assigned revans2 Dec 5, 2023

mattahrens removed ? - Needs Triage feature request labels Dec 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] better oom injection #1609

[FEA] better oom injection #1609

abellina commented Dec 4, 2023 •

edited

[FEA] better oom injection #1609

[FEA] better oom injection #1609

Comments

abellina commented Dec 4, 2023 • edited

abellina commented Dec 4, 2023 •

edited