Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensors are leaked when model.save() includes the optimizer #8238

Open
Vectorrent opened this issue Apr 10, 2024 · 3 comments
Open

Tensors are leaked when model.save() includes the optimizer #8238

Vectorrent opened this issue Apr 10, 2024 · 3 comments
Assignees

Comments

@Vectorrent
Copy link
Contributor

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow.js): False
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Arch Linux
  • TensorFlow.js installed from (npm or script link): NPM
  • TensorFlow.js version (use command below): 4.17.0

Describe the current behavior
When using tensorflow-node-gpu for training, I periodically save models to disk. However, my training has been crashing, and I've just learned why:

When model.save() includes the optimizer, a single tensor is leaked. This leads to the slow accumulation of unnecessary tensors, and crashes my computer after some amount of time:

await model.save(`file://saved_model`, { includeOptimizer: true })

To be clear, this is before saving a model:

{ unreliable: true, numTensors: 18, numDataBuffers: 18, numBytes: 420 }

And this is after:

{ unreliable: true, numTensors: 19, numDataBuffers: 19, numBytes: 424 }

Describe the expected behavior
I would expect model-saving to dispose of all unused tensors, after the operation is complete.

Standalone code to reproduce the issue
This bug is 100% reproducible in both tfjs-node and tfjs-node-gpu:

import fs from 'fs'
import * as tf from '@tensorflow/tfjs-node'

const model = tf.sequential()
model.add(tf.layers.dense({ units: 10, inputShape: [1] }))
model.add(tf.layers.dense({ units: 1 }))

model.compile({
    optimizer: 'adam',
    loss: 'meanSquaredError'
})

const xs = tf.tensor2d([1, 2, 3, 4], [4, 1])
const ys = tf.tensor2d([2, 4, 6, 8], [4, 1])

fs.mkdirSync('./saved_model', { recursive: true })

model.fit(xs, ys, {
    epochs: Infinity,
    verbose: 0,
    callbacks: {
        onEpochEnd: async (epoch, logs) => {
            console.clear()
            console.log(epoch)
            console.log(tf.memory())
            if (epoch % 1000 === 0 && epoch !== 0) {
                await model.save(`file://saved_model`, {
                    includeOptimizer: true
                })
            }
        }
    }
})

Other info / logs

  • There are no logs to provide, because TFJS OOM issues cause my computer to hard-freeze; they require a forcible shutdown to recover from.
  • If the includeOptimizer flag is disabled, then this does not occur.
@Vectorrent Vectorrent added the type:bug Something isn't working label Apr 10, 2024
@gaikwadrahul8 gaikwadrahul8 self-assigned this Apr 10, 2024
@gaikwadrahul8
Copy link
Contributor

Hi, @Vectorrent

Thank you for bringing this issue to our attention and I was trying to replicate the same behaviour from my end on my macOS and I'm getting below output with includeOptimizer: true flag and as you mentioned that issue not happening with includeOptimizer: false so I also observed same thing so workaround is either disable includeOptimizer flag when saving the model. This avoids saving the optimizer state, preventing the leak. However, you'll need to recreate the optimizer during model loading or TensorFlow.js provides functions for manual memory management. You can try the following approach after each save please refer official documentation for tf.tidy and tf.dispose

await model.save(`file://saved_model`, { includeOptimizer: true });

// Manually dispose of the optimizer
model.optimizer.dispose();

// Dispose of other unused tensors
tf.dispose(xs);
tf.dispose(ys);

image

Please let me know if I have missed anything here. Thank you for your cooperation and patience.

@Vectorrent
Copy link
Contributor Author

Thanks for the quick response. Sadly, tf.tidy() has no effect and tf.dispose() crashes my training session (for obvious reasons). So, neither of these are a "solution" and we should probably fix the underlying bug in the library. I might have some time to dig into the TFJS code and troubleshoot that, at some point.

Until then, my solution is to 1) create a manual training loop, 2) save the model, 3) unload the model, 4) re-load the model, 5) resume training. Not a great solution, if you ask me 🤣

@Vectorrent
Copy link
Contributor Author

Vectorrent commented Apr 11, 2024

I cannot for the life of me figure out how to build TFJS locally on my computer, so I'm not really able to debug or test this properly. Regardless, I've been digging, and this is probably where we need to apply a fix:
https://github.com/tensorflow/tfjs/blob/master/tfjs-layers/src/engine/training.ts#L2146

If I had to guess, maybe its related to the use of io.concatenateArrayBuffers here? Apparently, it's deprecated and we should be using tf.io.CompositeArrayBuffer.join() instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants