Skip to content
This repository has been archived by the owner on Nov 3, 2023. It is now read-only.

Question: Running generation with batches #3484

Open
thies1006 opened this issue Mar 2, 2021 · 13 comments
Open

Question: Running generation with batches #3484

thies1006 opened this issue Mar 2, 2021 · 13 comments

Comments

@thies1006
Copy link

Hello!
I'm generating texts with blender_3B like this (all options are default, except "model_parallel=False"):

agent = create_agent(opt, requireModelExists=True)
agent_copies = []
agent_copies.append(agent.clone())
agent_copies.append(agent.clone()) #comment this out for 2nd try

act_0 = Message({'id': 'context_0', 'text': 'hello', 'episode_done': False})
act_1 = Message({'id': 'context_1', 'text': 'hello', 'episode_done': False}) #comment this out for 2nd try

observations = []
observations.append(self.agent_copies[0].observe(act_0)) 
observations.append(self.agent_copies[1].observe(act_1)) #comment this out for 2nd try

response = self.agent.batch_act(observations)

I get the following results for batch_size=2 (both predictions are exactly the same, I just cut the rest off for readibility):

[{'id': 'TransformerGenerator', 'episode_done': False, 'text': "Hi! How are you? I just got back from a long day at work. I'm 
exhausted!", 'beam_texts': [("Hi! How are you? I just got back from a long day at work. I'm exhausted!", -9.483329772949219), 
("Hi! How are you? I just got back from a long day at work. I'm exhausted.", -9.512072563171387), ('Hi! How are you? I just got 
back from walking my dog. I love to walk.', -9.5917387008667), ....

However when I remove the second item in the batch I get:

[{'id': 'TransformerGenerator', 'episode_done': False, 'text': 'Hi! How are you? I just got back from walking my dog. I love to walk.', 
'beam_texts': [('Hi! How are you? I just got back from walking my dog. I love to walk.', -9.591983795166016), ('Hi! How are you? 
I just got back from walking my dog. I love to walk!', -9.753303527832031), ("Hi! How are you? I just got off the phone with my 
mom, she's having some health problems.", -9.938494682312012)

Now the question is of course: The predictions shouldn't they be in all cases the same given that the inputs are the same? Or is this a numerical issue? I couldn't find an example how to run generation with batches so I wasn't sure if I'm doing this actually the correct way.

The ParlAI code is from today.
Python 3.7.5
Ubuntu 18.04 LTS

@stephenroller
Copy link
Contributor

The following script produces the exact same output for me, regardless of batchsize:

#!/usr/bin/env python3

BS = 2

from parlai.core.agents import create_agent_from_model_file

agent = create_agent_from_model_file(
    "zoo:blender/blender_3B/model", {'model_parallel': False}
)
clones = [agent.clone() for _ in range(BS)]
acts = []
for index in range(BS):
    acts.append(clones[index].observe({'text': 'hello', 'episode_done': True}))
responses = agent.batch_act(acts)
for i in range(BS):
    print(responses[i]['text'])

In all instances, my output is always "Hi! How are you? I just got home from work. I work at a grocery store."

A few possible confounders you may be experiencing:

  • Is this script truly isolated, or are you running like a self-chat script or interactive or something? Is it possible personas are being set differently across the batch indices?
  • Is the script you've shown above being truly run separately? The "episode_done": False bit means that if you just run this twice in the same script, without resetting or making new clones, that the model will be responding to "hello\nhello" instead of just "hello"

@thies1006
Copy link
Author

Great, thank you for the quick reply!
Unfortunately, with your script I get the same results as before:

BS=2:
Hi! How are you? I just got back from a long day at work. I'm exhausted!
Hi! How are you? I just got back from a long day at work. I'm exhausted!

BS=1:
Hi! How are you? I just got back from walking my dog. I love to walk.

log probs are also identical (-9.483329772949219 vs. -9.591983795166016 for the 'best' decoded texts, as above), so I think my script just did the same thing. I've no idea what would possibly the problem here, any idea?

My setup:
Python 3.7.5
Ubuntu 18.04 LTS
Pytorch 1.7.1
Cuda 10.1.243
Cuda driver 455.45.01

@thies1006
Copy link
Author

Let me add:

  • Switching off the GPU seems to solve the problem. By setting CUDA_VISIBLE_DEVICES=-1 I get (nearly) the same scores across different batch sizes. Generated is always "Hi! How are you? I just got back from walking my dog. I love to walk."
  • With GPU I found in general larger variations with varying batch size, especially for BS=50 this seems a bit more dramatic, see the numbers below. Generated texts are varying.
BS=1 BS=2 BS=50 BS=100
CPU -9.590900421142578 -9.590903282165527 -9.590904235839844 -9.590904235839844
GPU -9.591983795166016 -9.483329772949219 -8.578133583068848 -9.588774681091309

@stephenroller
Copy link
Contributor

We must be using different models. Mine never said it has a dog.

I expect small floating point errors, but BS=50 is wildly off jeez. BS=2 is too. What GPU are you using?

Are you doing anything other than out-of-the box parlai? Can you replicate this on master?

BS=2:
Hi! How are you? I just got back from a long day at work. I'm exhausted!
Hi! How are you? I just got back from a long day at work. I'm exhausted!

BS=1:
Hi! How are you? I just got back from walking my dog. I love to walk.

Those are the outputs you got from the script I pasted above?

@thies1006
Copy link
Author

First, to your questions, sorry if it wasn't 100% clear.

  • yes, the results I got from your script. Copied, pasted and run.
  • ParlAI was freshly installed from scratch, the model was downloaded just before running (automatic download by ParlAI).
  • GPU is always Titan RTX.

What I did today:

  • Installed Cuda 11.1.105 on a different machine
  • Installed Nvidia driver 460.32.03
  • Checked out Pytorch v1.8.0, compiled on this machine and installed into fresh environment (Installing Pytorch via pip results in error when running ParlAI RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasCreate(handle)` during forward pass).
  • Installed ParlAI via git clone && python setup.py develop in environment. There was a dependency problem with torchtext, so I had to skip this (removed from requirement.txt):error: torch 1.8.0a0+37c1f4a is installed but torch==1.8.0 is required by {'torchtext'}. Could this potentially be important?
  • Ran your script

Results:

BS=1 BS=2 BS=50 BS=100
CPU -9.590900421142578 -9.590903282165527 -9.590904235839844 -9.590904235839844
GPU -9.585407257080078 -8.580317497253418 -9.478259086608887 -8.576322555541992
  • CPU values exactly the same
  • GPU Another different set of values. Each BS I ran several times.. Values didn't change with constant BS. Values also don't change with the position in the batch.

Now the new thing:

  • When running with model_parallel=True and 6 GPUs I get again a different set of values and the values also change within the batch. Seems to happen always at the end of the batch. Appending an example log for better readability.

log_modelparallel_true.txt

ps the corresponding text to the best score -8.57 has always been the above mentioned "Hi! How are you? I just got home from work. I work at a grocery store."

@stephenroller
Copy link
Contributor

Hm, we haven't tried pytorch 1.8 in open source land yet, so I can't vouch for that. We've used in plenty in internal use cases and not had problems, but I can't rule out that pytorch 1.8 has issues separate from yours

What if you turn off model parallelism? What if you use CUDA_VISIBLE_DEVICES to limit yourself to 4 gpus? To 2?

Can you try another power of 2? BS 4, 8, etc? It's interesting that BS 2 and 100 get the same off value. Makes me suspiciou

  • I get the -9.47 and change with BS 2 (MP on -9.474090576171875, MP off -9.476009368896484).
  • BS 8 (mp on) gets -9.476009368896484
  • With BS 50 and MP I witness the problem it in all but 2 :-O
  • With BS 48 and MP I witness it everywhere
  • Also with With BS 32 and MP, BS 16 and MP
  • With BS 16 and no-MP everything looks right as rain.

So definitely something wrong with ModelParallelism... I reverted #3326 and thinks look consistent, so I must have something wrong with that.

@stephenroller
Copy link
Contributor

With the reversion and BS-50 I still observe a weird few observations

@thies1006
Copy link
Author

I was printing tensors to find where the differences occur and the first one I found is here:
https://github.com/facebookresearch/ParlAI/blob/master/parlai/agents/transformer/modules.py#L1331
(line: x = self.lin2(x))

I was looking only at the very first occurrence (so first layer of the encoder).

Input text was always 'hello'.
model_parallel=False

Tensor before lin2:

BS=1
tensor([[[-2.9385e-05, -5.8115e-05, -4.0833e-02, 9.7595e-02, -5.5027e-04, -1.0669e-01, -1.8280e-02, -1.5366e-04, 2.7344e-01, -5.0598e-02, ...

BS=2
tensor([[[-2.9385e-05, -5.8115e-05, -4.0833e-02, 9.7595e-02, -5.5027e-04, -1.0669e-01, -1.8280e-02, -1.5366e-04, 2.7344e-01, -5.0598e-02, ...

Tensor after lin2:

BS=1
tensor([[[-0.5508, 0.0132, -2.0469, 1.8398, 1.9492, 0.3269, 3.0977, 0.7681, -1.9385, -1.0479, ..., -1.3574, 1.6406, -0.0542, ...

BS=2
tensor([[[-0.5508, 0.0132, -2.0488, 1.8408, 1.9492, 0.3284, 3.0996, 0.7676, -1.9404, -1.0488, ..., -1.3564, 1.6416, -0.0548, ...

Note, the scores have been for BS=1 -9.57.. and for BS=2 -8.58.. (according to the second table.)

On CPU all tensors seem exactly identical, not a single different digit found. However the CPU tensors are also all different to the GPU ones.

So apparently this is just due to normal floating point precision. Under the hood pytorch (or CUBLAS, whatever) seems to handle this linear layer differently for different batch sizes (?)

@stephenroller
Copy link
Contributor

@stephenroller
Copy link
Contributor

If you have time (I don't immediately), can you trace through with model parallel and non-model parallel and see where things diverge?

@github-actions
Copy link

github-actions bot commented Apr 9, 2021

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.

@stephenroller
Copy link
Contributor

Bump to keep this open

@github-actions
Copy link

This issue has not had activity in 30 days. Please feel free to reopen if you have more issues. You may apply the "never-stale" tag to prevent this from happening.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants