Fix memory leak in zeroGradients() #2792

pdradx · 2023-09-29T03:28:05Z

Code in zeroGradients() of NDArray doesn't close gradient tensor after use.
The getGradient() method shares internally memory for each call, but for DJL it creates new handle every time and creates new instance of NDArray. So we must close it after use.

There is some other problem with this code which I don't fixed yet.
Zeroing tensor by subtracting himself is not working for NaNs and Infinities.
So if we receive NaN or Infinity in gradient we will not be possible to recover from that error by calling zeroGradients. We must set array instead of subtracting from it, but semantics of set() methods doesn't allow universally work with NDArray of diffrent shapes and single scallars.
Foe example array.set("*", 0) works for arrays of all shapes, but don't works with single scalar shape without dimensions.

So this commit fixes memory leak only.

--------- Co-authored-by: Administrator <Administrator@tech8> Co-authored-by: KexinFeng <fenkexin@amazon.com>

* Implement PtNDArraryEx.multiboxDetection * MultiboxDetection - code cleanup * MultiboxDetection - code cleanup * MultiboxDetection - code cleanup * MultiboxDetection - code cleanup * format code * Fix, add tests, and pass CI --------- Co-authored-by: Zach Kimberg <kimbergz@amazon.com>

codecov-commenter · 2023-09-29T03:57:22Z

Codecov Report

Attention: 1375 lines in your changes are missing coverage. Please review.

Comparison is base (bb5073f) 72.08% compared to head (c668fda) 72.23%.
Report is 886 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #2792      +/-   ##
============================================
+ Coverage     72.08%   72.23%   +0.14%     
- Complexity     5126     7113    +1987     
============================================
  Files           473      702     +229     
  Lines         21970    31682    +9712     
  Branches       2351     3284     +933     
============================================
+ Hits          15838    22884    +7046     
- Misses         4925     7236    +2311     
- Partials       1207     1562     +355

Files	Coverage Δ
...ava/ai/djl/inference/streaming/StreamingBlock.java	`100.00% <100.00%> (ø)`
api/src/main/java/ai/djl/metric/Dimension.java	`100.00% <100.00%> (ø)`
api/src/main/java/ai/djl/metric/Unit.java	`100.00% <100.00%> (ø)`
api/src/main/java/ai/djl/modality/audio/Audio.java	`100.00% <100.00%> (ø)`
api/src/main/java/ai/djl/modality/cv/Image.java	`69.23% <ø> (-4.11%)`	⬇️
...rc/main/java/ai/djl/modality/cv/MultiBoxPrior.java	`76.00% <ø> (ø)`
...ava/ai/djl/modality/cv/output/DetectedObjects.java	`96.29% <100.00%> (+1.29%)`	⬆️
...rc/main/java/ai/djl/modality/cv/output/Joints.java	`71.42% <100.00%> (ø)`
.../main/java/ai/djl/modality/cv/output/Landmark.java	`100.00% <ø> (ø)`
...i/djl/modality/cv/transform/RandomResizedCrop.java	`94.11% <100.00%> (+5.22%)`	⬆️
... and 226 more

... and 372 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

zachgk · 2023-09-29T20:59:29Z

If this is a memory leak, wouldn't it be happening in all usages of getGradient? I don't think that is the semantics we were going for with that function.

Maybe a better approach might be to cache the gradient as a property of the MxNDArray and PtNDArray. Something like:

NDArray getGradient() {
  if (this.gradient != null) {
    return this.gradient;
  }
  NDArray gradient = ...
  this.gradient = gradient;
  return gradient;
}

That way, it will fix this memory leak along with other possible gradient ones. As part of this, it may also want to close the gradient NDArray when closing the main NDArray. Although it is likely to be closed anyway by the NDManager.

As for the setting part, try with array.set("", 0). I feel like there was some error that was happening with that, but I could be misremembering.

pdradx · 2023-10-02T03:00:53Z

Yes, of course! All usages of getGradient() must be closed after...
And for array.set("", 0) it will creates an empty NDIndex under the hood, which doesn't work with scalar NDArrays.

pdradx · 2023-10-02T03:04:31Z

But changing semantic of getGradient() now will require to fix all other parts, which already use such behaviour and closes gradients.

KexinFeng · 2023-10-05T17:19:47Z

I don't know how relevant this is, but here is a previous solution regarding memory management: #2567
#2273

try(NDScope scope = new NDScope()){
  scope.suppressNotUsedWarning();
  ...
  NDSope.unregister(NDArrays_to_keep)
}

Example implementation: #2637

SidneyLann and others added 4 commits September 19, 2023 17:36

To support Yolov8 (deepjavalibrary#2776)

950340f

--------- Co-authored-by: Administrator <Administrator@tech8> Co-authored-by: KexinFeng <fenkexin@amazon.com>

[onnxruntime] Upgrades OnnxRuntime to 1.16.0 (deepjavalibrary#2784)

da15713

build ft for sm90 (deepjavalibrary#2785)

15fd0d0

pdradx requested review from zachgk, frankfliu and a team as code owners September 29, 2023 03:28

Fix memory leak in zeroGradients().

c668fda

pdradx force-pushed the fix_zero_gradients_leak branch from 7d61e5c to c668fda Compare September 29, 2023 03:30

pdradx changed the title ~~Fixes memory leak in zeroGradients()~~ Fix memory leak in zeroGradients() Sep 29, 2023

frankfliu force-pushed the master branch from ec89a66 to c68f8a7 Compare April 26, 2024 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix memory leak in zeroGradients() #2792

Fix memory leak in zeroGradients() #2792

pdradx commented Sep 29, 2023

codecov-commenter commented Sep 29, 2023 •

edited

zachgk commented Sep 29, 2023

pdradx commented Oct 2, 2023

pdradx commented Oct 2, 2023

KexinFeng commented Oct 5, 2023 •

edited

Fix memory leak in zeroGradients() #2792

Are you sure you want to change the base?

Fix memory leak in zeroGradients() #2792

Conversation

pdradx commented Sep 29, 2023

codecov-commenter commented Sep 29, 2023 • edited

Codecov Report

zachgk commented Sep 29, 2023

pdradx commented Oct 2, 2023

pdradx commented Oct 2, 2023

KexinFeng commented Oct 5, 2023 • edited

codecov-commenter commented Sep 29, 2023 •

edited

KexinFeng commented Oct 5, 2023 •

edited