Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting OpenNMT-tf to train reproducibly #9

Open
atebbifakhr opened this issue Jan 13, 2020 · 13 comments
Open

Getting OpenNMT-tf to train reproducibly #9

atebbifakhr opened this issue Jan 13, 2020 · 13 comments

Comments

@atebbifakhr
Copy link

Hi,

I'm trying to use this patch with Tensorflow 2.0, but the learning is still non-deterministic. I guess it is due to XLA optimization. How can I disable XLA?

Bests,

@duncanriach
Copy link
Collaborator

Hi @atebbifakhr,

My understanding is that XLA JIT compilation is not currently enabled by default in TensorFlow. I assume that you're not enabling XLA and therefore that, if there is in fact a source of non-determinism, it's not an XLA-originated op.

Can you tell me more about your model and settings? There remain various sources of non-determinism in TensorFlow which are not addressed by the patch.

@atebbifakhr
Copy link
Author

atebbifakhr commented Jan 17, 2020

Hi @duncanriach,

I'm using Tensorflow-gpu==2.0.0 and my model is Transformer for seq2seq. I noticed the source of non-determinism is in tf.nn.softmax_cross_entropy_with_logits.

I decided to call tf.nn.softmax_cross_entropy_with_logits on CPU to make my code deterministic. It works for the first computed gradients, but for the following gradients is still non-deterministic. My guess is optimizer.apply_gradients() is also non-deterministic.

@duncanriach
Copy link
Collaborator

Until now, I was unaware of non-determinism issues with tf.nn.softmax_cross_entropy_with_logits, but I have started digging into it, and will add it to a list of things to look at and potentially fix.

I have personally never seen optimizer.apply_gradients() operate non-deterministically on a GPU, and many folks are now doing deterministic deep learning with TensorFlow, which makes it even less likely to be an issue.

You've also said that the computed gradients are non-deterministic. If non-determinism is appearing in the computed gradients, then it is, by definition, being injected before the gradients are applied. Another op in your model may be injecting non-determinism in back-prop.

I recommend making sure that the examples that are being fed into the model are deterministic and that your trainable variables are initialized deterministically, once you have confirmed that then it's possible to debug and locate the source of non-determinism in the model. Unfortunately, I have not had time to release the debugging tool yet, which makes it harder for others to debug.

If you can provide me with a simple-as-possible, self-contained example that clearly demonstrates non-determinism, then I may be able to debug it relatively quickly and identify the source, or sources, of non-determinism in it. Self-contained means that all the files needs are provided, including training data or code that generates synthetic data. Simple-as-possible means that it's as simple as possible while still demonstrating the issue.

Also, I assuming that the seq2seq model you're using is Google's Seq2seq. Please confirm.

@duncanriach duncanriach changed the title How to disable XLA? Getting seq2seq to operate reproducibly Jan 17, 2020
@duncanriach duncanriach changed the title Getting seq2seq to operate reproducibly [debug] Getting seq2seq to operate reproducibly Jan 17, 2020
@duncanriach duncanriach changed the title [debug] Getting seq2seq to operate reproducibly Getting seq2seq to operate reproducibly Jan 17, 2020
@atebbifakhr
Copy link
Author

I prepare this notebook that you can replicate the problem.
Actually, I'm using OpenNMT-tf toolkit. However, the problem is not related to the toolkit. If you change tf.nn.sparse_softmax_cross_entropy_with_logits to something else, the code becomes deterministic.

@duncanriach
Copy link
Collaborator

duncanriach commented Jan 21, 2020

Thanks for providing that code, @atebbifakhr! Nice and simple and self-contained. I love it. I have been able to reproduce the non-determinism, but not the determinism when the cross-entropy op is removed. It seems that the two pkl files generated in that case still differ. Perhaps I'm doing something wrong though.

Please will you run again and confirm that you're definitely seeing the pkl files matching when you remove the cross-entropy op?

In any case, this example is great because it give me something specific to run and debug.

@duncanriach
Copy link
Collaborator

duncanriach commented Jan 22, 2020

Hey, I'm running this locally so that I can instrument and debug it. My machine contains a 12GB TITAN V. I'm getting this error:

tensorflow.python.framework.errors_impl.ResourceExhaustedError:  OOM when allocating tensor with shape[12544,32001] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc

Are you familiar with this error and how to resolve it?

@duncanriach
Copy link
Collaborator

duncanriach commented Jan 22, 2020

In the model, I reduced num_units from 512 down to 32 and ffn_inner_dim from 2048 down to 128 for both the encoder and the decoder. This resolved the problem. The machine under my colab is an NVIDIA Tesla T4 with 16GB of GPU memory. I wonder if the model, as configured, fitted into 16GB but not into 12GB.

Anyway, I am able to locally reproduce the non-determinism and also the determinism (without the cross-entropy op). I'm not sure why I could not reproduce the determinism on colab; possible operator error since the process is very manual.

Well done for isolating this source of non-determinism! Thank you.

I also want to acknowledge that all of the work that has gone into TensorFlow determinism so far made it so that it was possible to isolate a single op as a source of non-determinism without using the non-determinism debugging tool. This is because removing that one op reveals the underlying determinism that we now have.

I intend to instrument this model and confirm the non-determinism and also that the cross-entropy op is the only source. Then we can look at potential fixes or work-arounds.

@atebbifakhr
Copy link
Author

Hi @duncanriach, thanks for your reply. It's strange that you couldn't reproduce the non-determinism by removing cross-entropy op, it never happened to me! It's fine to reduce the model size, if it's not fitted into memory, but sometimes you might need to run couple of times to see the non-determinism.

Anyway, thanks for your effort, looking forward to hearing from you.

@atebbifakhr
Copy link
Author

Hi @duncanriach,
Any update on this issue? could you confirm the non-deterministim?

@duncanriach
Copy link
Collaborator

Hey @atebbifakhr, Sorry, I have not gotten to this yet. Will do as soon as I can and get back to you.

@duncanriach
Copy link
Collaborator

Hi @atebbifakhr, I looked into this more deeply. Removing tf.nn.sparse_softmax_cross_entropy_with_logits from the loss function only makes the gradients reproducible for the first step. They still go non-deterministic on the second step. The trainable variables actually go non-deterministic on the first step (somehow) regardless of whether tf.nn.sparse_softmax_cross_entropy_with_logits is in the loss function.

The fact that the gradients are deterministic for the first step but the trainable variables are not suggests that non-determinism is being introduced in the gradient update step. I hope to continue investigating soon.

@duncanriach duncanriach changed the title Getting seq2seq to operate reproducibly Getting OpenNMT-tf to train reproducibly Mar 24, 2020
@duncanriach
Copy link
Collaborator

duncanriach commented Apr 2, 2020

Hi @atebbifakhr,

After further investigation, there seems to be two or three sources of non-determinism in this system.

  1. Confirmed that back-prop of tf.nn.sparse_softmax_cross_entropy_with_logits does inject non-determinism. Opened TensorFlow issue 38185.
  2. Discovered that tf.keras.optimizers.Optimizer::apply_gradients seems to inject non-determinism into the trainable state of the source and target inputters (instances of WordEmbedder) at the end of the first training step. This is mitigated by making the batch size smaller, but I don't know why. In the configuration that I am running, setting the batch size to 1 appears to make the state of the inputters deterministic at the end of the first step.
  3. Discovered that the source and target inputters also inject non-determinism in the forward path by making the samples applied to the model non-reproducible on the second step and onwards (when the state of the inputters is deterministic from the previous step).

There is more work to do on this issue, but I wanted to give you an interim update.

I've also added your name to the credits section of this repo in recognition of your effort in enabling me to reproduce and isolate the problems you've been seeing.

@duncanriach
Copy link
Collaborator

duncanriach commented Apr 8, 2020

I updated my previous comment to include additional information that come from more investigation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants