Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when converting NMT model with ALiBi or RoPe #1657

Open
randomicity opened this issue Apr 7, 2024 · 15 comments
Open

Error when converting NMT model with ALiBi or RoPe #1657

randomicity opened this issue Apr 7, 2024 · 15 comments
Labels
enhancement New feature or request question Further information is requested

Comments

@randomicity
Copy link

Hello, thank you for a great project!

I am getting this error when using ALiBi or RoPe positional encoding in a tranformer NMT model from OpenNMT-py:

KeyError: 'encoder.embeddings.make_embedding.pe.pe'

Absolute and relative positional encodings are working with the converter script.

@minhthuc2502
Copy link
Collaborator

Can you provide more in detail how to run the converter?

@randomicity
Copy link
Author

ct2-opennmt-py-converter --model_path test_step_100.pt --output_dir test
Traceback (most recent call last):
File "/home/username/anaconda3/envs/translator/bin/ct2-opennmt-py-converter", line 8, in
sys.exit(main())
File "/home/username/anaconda3/envs/translator/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 355, in main
OpenNMTPyConverter(args.model_path).convert_from_args(args)
File "/home/username/anaconda3/envs/translator/lib/python3.10/site-packages/ctranslate2/converters/converter.py", line 50, in convert_from_args
return self.convert(
File "/home/username/anaconda3/envs/translator/lib/python3.10/site-packages/ctranslate2/converters/converter.py", line 89, in convert
model_spec = self._load()
File "/home/username/anaconda3/envs/translator/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 200, in _load
return _get_model_spec_seq2seq(
File "/home/username/anaconda3/envs/translator/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 90, in _get_model_spec_seq2seq
set_transformer_spec(model_spec, variables)
File "/home/username/anaconda3/envs/translator/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 210, in set_transformer_spec
set_transformer_encoder(spec.encoder, variables)
File "/home/username/anaconda3/envs/translator/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 215, in set_transformer_encoder
set_input_layers(spec, variables, "encoder")
File "/home/username/anaconda3/envs/translator/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 241, in set_input_layers
set_position_encodings(
File "/home/username/anaconda3/envs/translator/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 341, in set_position_encodings
spec.encodings = _get_variable(variables, "%s.pe" % scope).squeeze()
File "/home/username/anaconda3/envs/translator/lib/python3.10/site-packages/ctranslate2/converters/opennmt_py.py", line 345, in _get_variable
return variables[name]
KeyError: 'encoder.embeddings.make_embedding.pe.pe'

@LynxPDA
Copy link

LynxPDA commented Apr 27, 2024

I faced the same problem. Unfortunately python/ctranslate2/converters/opennmt_py.py currently only supports ALiBi or RoPe for decoder_type == "transformer_lm" (LLMs) and does not support for seq2seq.

Also, unfortunately, there is no support for activating gated-gelu (it is added in literally two lines, but ALiBi or RoPe is much more difficult to add in my opinion).

In my experiments, seq2seq with a combination of RoPe and gated-gelu converges much faster when trained via OpenNMT-py. And the learning rate (tok/sec) is also 10% higher than that of relative positional embeddings.

If someone can add python/ctranslate2/converters/opennmt_py.py for seq2seq ALiBi, RoPe and gated-gelu to the converter, then this will make it possible to speed up the inference of much higher quality advanced translator models.

@vince62s
Copy link
Member

vince62s commented Apr 29, 2024

It should not be that difficult to add rope / alibi to the encoder. But do you have numbers to support the benefit?

If you use "silu" it uses gated ffn, it should be suported out of the box, no?
EDIT: it's missing in the encoder too.

I'll add them

@LynxPDA
Copy link

LynxPDA commented Apr 29, 2024

Here are more specific numbers and observations:

  1. Ultimately, the models with RPE (relative positional embeddings) and RoPE converged.
  2. However, as I said earlier, the tok/s speed of both training and validation is higher for the model with RoPE.
  3. But the model with GELU activation turned out to be noticeably worse than with gated-gelu and I stopped its training ahead of schedule.
  4. The experiment with the RPE (relative positional embeddings) model is not very clean, since its training did not start from scratch, but from the 1000 step.

Here is a description of the models on screenshots:

  1. All models were trained on the same dataset.
  2. The effective batch size was the same.
  3. The learning rate and all other parameters were the same, only the activation functions and the type of positional embeddings changed.
  4. The models themselves:
    • Apr-21 - (red color) - RoPE + gated-gelu
    • Apr-23 (yellow color) - RoPE + gelu
    • Apr-26 (purple color) - RPE (relative positional embeddings) + gated-gelu

progress acc
progress acc 2024-04-29 180631

progress speed (tok/s)
progress speed 2024-04-29 180442

valid speed (tok/s)
valid speed 2024-04-29 180902

Based on the above experiments, I think the optimal solution would be the combination of RoPE + gated-gelu, in terms of convergence speed and learning and operating speed (tok/s).

Also, as far as I understand, RoPE embeddings and GLU activation options are now actually a standard and are used in the most modern models.

Edited: fix mistakes, and I’ll add, I also tried silu activation, but it turned out worse than gelu or relu, I assume there is also a beta parameter that needs to be configured and silu in this case is more suitable for converting already ordinary LLMs with silu type activation (SwiGLU).
And there seems to be some confusion in the names. It seems gated silu/swish should be called SwiGLU, if I'm not mistaken.

@vince62s
Copy link
Member

vince62s commented Apr 29, 2024

there is no beta for silu, there is rope theta but should not impact the perf in nmt since it is length related.
gated-gelu adds an extra-layer, hence more parameters so I am not sure it is 100% comparable vs non gated gelu or relu

@LynxPDA
Copy link

LynxPDA commented Apr 29, 2024

there is no beta for silu

I apologize here, I got a little mixed up. Beta meant the coefficient of the Swish function. When beta = 1 it will be equal to the Silu function.

gated-gelu adds an extra-layer, hence more parameters so I am not sure it is 100% comparable vs non gated gelu or relu

Everything is correct here, GLU increases the number of parameters of the FF layer by 50%. I took this into account and reduced the size of the FF layer by 50%.
The total number of parameters in all my models in comparison is the same and equals 457M parameters.

I relied on this paper: GLU Variants Improve Transformer

image

@vince62s
Copy link
Member

can you test to convert your onmt-py model with #1687 (the rope one with gated-gelu) and tell me if it works ?

@vince62s vince62s added enhancement New feature or request question Further information is requested labels Apr 29, 2024
@vince62s
Copy link
Member

And there seems to be some confusion in the names. It seems gated silu/swish should be called SwiGLU, if I'm not mistaken.

so did you try "silu" of onmt-py repo (hence gated silu) or basic silu ?

@LynxPDA
Copy link

LynxPDA commented Apr 29, 2024

can you test to convert your onmt-py model with #1687 (the rope one with gated-gelu) and tell me if it works ?

Converting to ctranslate2
Traceback (most recent call last):
  File "/mnt/DeepLearning/Locomotive/venv/bin/ct2-opennmt-py-converter", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/converters/opennmt_py.py", line 375, in main
    OpenNMTPyConverter(args.model_path).convert_from_args(args)
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/converters/converter.py", line 50, in convert_from_args
    return self.convert(
           ^^^^^^^^^^^^^
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/converters/converter.py", line 89, in convert
    model_spec = self._load()
                 ^^^^^^^^^^^^
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/converters/opennmt_py.py", line 201, in _load
    check_opt(checkpoint["opt"], num_source_embeddings=len(src_vocabs))
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/converters/opennmt_py.py", line 57, in check_opt
    check.validate()
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/converters/utils.py", line 106, in validate
    raise_unsupported(self._unsupported_reasons)
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/converters/utils.py", line 93, in raise_unsupported
    raise ValueError(message)
ValueError: The model you are trying to convert is not supported by CTranslate2. We identified the following reasons:

- Option --pos_ffn_activation_fn gated-gelu is not supported (supported activations are: gelu, fast_gelu, relu, silu)

I tried to convert the micro model, but got an error, maybe I need to add gated-gelu to SUPPORTED_ACTIVATIONS:

_SUPPORTED_ACTIVATIONS = {
    "gelu": common_spec.Activation.GELU,
    "fast_gelu": common_spec.Activation.GELUTanh,
    "relu": common_spec.Activation.RELU,
    "silu": common_spec.Activation.SWISH,
    "gated-gelu": common_spec.Activation.GELU,
}

Now a new error has appeared:

Traceback (most recent call last):
  File "/mnt/DeepLearning/Locomotive/venv/bin/ct2-opennmt-py-converter", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/converters/opennmt_py.py", line 376, in main
    OpenNMTPyConverter(args.model_path).convert_from_args(args)
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/converters/converter.py", line 50, in convert_from_args
    return self.convert(
           ^^^^^^^^^^^^^
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/converters/converter.py", line 89, in convert
    model_spec = self._load()
                 ^^^^^^^^^^^^
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/converters/opennmt_py.py", line 221, in _load
    return _get_model_spec_seq2seq(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/converters/opennmt_py.py", line 89, in _get_model_spec_seq2seq
    model_spec = transformer_spec.TransformerSpec.from_config(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/specs/transformer_spec.py", line 481, in from_config
    encoder = TransformerEncoderSpec(
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/specs/model_spec.py", line 84, in __call__
    instance = super().__call__(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/specs/transformer_spec.py", line 107, in __init__
    self.layer = [
                 ^
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/specs/transformer_spec.py", line 108, in <listcomp>
    TransformerEncoderLayerSpec(
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/specs/model_spec.py", line 84, in __call__
    instance = super().__call__(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/specs/transformer_spec.py", line 294, in __init__
    self.self_attention = attention_spec.MultiHeadAttentionSpec(
                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/specs/model_spec.py", line 84, in __call__
    instance = super().__call__(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: MultiHeadAttentionSpec.__init__() got an unexpected keyword argument 'head_dim'

@LynxPDA
Copy link

LynxPDA commented Apr 29, 2024

And there seems to be some confusion in the names. It seems gated silu/swish should be called SwiGLU, if I'm not mistaken.

so did you try "silu" of onmt-py repo (hence gated silu) or basic silu ?

I try "silu" of onmt-py repo (hence gated silu)

@vince62s
Copy link
Member

TypeError: MultiHeadAttentionSpec.init() got an unexpected keyword argument 'head_dim'

this one does not make sense. check your /mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/specs/attention_spec.py file
and let me know it's up to date with head_dim in the arguments.

@LynxPDA
Copy link

LynxPDA commented Apr 29, 2024

TypeError: MultiHeadAttentionSpec.init() got an unexpected keyword argument 'head_dim'

this one does not make sense. check your /mnt/DeepLearning/Locomotive/venv/lib/python3.11/site-packages/ctranslate2/specs/attention_spec.py file and let me know it's up to date with head_dim in the arguments.

I think I got it, I just added two files from the PR to my venv, but my version of Ctranslate is pretty far behind:

"""Version information."""

__version__ = "3.20.0"

I updated Ctranslate and the conversion went without errors.
The only thing is that I saved the changes by adding the line "gated-gelu" to _SUPPORTED_ACTIVATIONS: common_spec.Activation.GELU, without this it still gave the error:
- Option --pos_ffn_activation_fn gated-gelu is not supported (supported activations are: gelu, fast_gelu, relu, silu)

However, now I cannot yet confirm the functionality of the full-fledged converted model, since I only tested it on a “100-step toy model.” I removed the weights of the previous model with RoPE + gated-gelu and put the model with RPE + gated-gelu (purple in the graphs above) for long-term training, since I did not think that this function would be implemented so quickly, thank you!
I will train the next model with RoPE!)

@vince62s
Copy link
Member

I trained a Gelu and Gated-Gelu "base transformer" and I am not getting the same training curve as yours.
Both are very similar.
Not sure why you are seeing such a gap between the two, maybe because of a bigger model.

As for validation there is a small improvement.

Here are some logs:

Gate-Gelu (with Rope) ffn 512-2048

[2024-04-29 22:00:57,108 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(24936, 512, padding_idx=1)
        )
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.0, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
          (w_3): Linear(in_features=512, out_features=2048, bias=True)
        )
        (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(24936, 512, padding_idx=1)
        )
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
    (transformer_layers): ModuleList(
      (0-5): 6 x TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.0, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=2048, bias=True)
          (w_2): Linear(in_features=2048, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
          (w_3): Linear(in_features=512, out_features=2048, bias=True)
        )
        (layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.0, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=False)
        )
        (layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
      )
    )
  )
  (generator): Linear(in_features=512, out_features=24936, bias=True)
)
[2024-04-29 22:00:57,110 INFO] encoder: 37974016
[2024-04-29 22:00:57,110 INFO] decoder: 31529320
[2024-04-29 22:00:57,110 INFO] * number of parameters: 69503336
[2024-04-29 22:00:57,111 INFO] Trainable parameters = {'torch.float32': 0, 'torch.float16': 69503336, 'torch.uint8': 0, 'torch.int8': 0}
[2024-04-29 22:00:57,111 INFO] Non trainable parameters = {'torch.float32': 0, 'torch.float16': 0, 'torch.uint8': 0, 'torch.int8': 0}
[2024-04-29 22:00:57,111 INFO]  * src vocab size = 24936
[2024-04-29 22:00:57,111 INFO]  * tgt vocab size = 24936
[2024-04-29 22:00:58,342 INFO] Starting training on GPU: [0]
[2024-04-29 22:00:58,342 INFO] Start training loop and validate every 10000 steps...
[2024-04-29 22:00:58,342 INFO] Scoring with: ['onmt_tokenize', 'prefix']
[2024-04-29 22:02:34,363 INFO] Step 100/100000; acc: 13.5; ppl: 13661.8; xent: 9.5; lr: 0.00002; sents:  201720; bsz: 6070/7530/252; 50571/62737 tok/s;     96 sec;
[2024-04-29 22:03:18,855 INFO] Step 200/100000; acc: 20.8; ppl: 3530.7; xent: 8.2; lr: 0.00004; sents:  211218; bsz: 5995/7561/264; 107791/135951 tok/s;    141 sec;
[2024-04-29 22:04:03,232 INFO] Step 300/100000; acc: 23.5; ppl: 1057.1; xent: 7.0; lr: 0.00006; sents:  200048; bsz: 5987/7467/250; 107929/134621 tok/s;    185 sec;
[2024-04-29 22:04:48,137 INFO] Step 400/100000; acc: 27.6; ppl: 502.3; xent: 6.2; lr: 0.00008; sents:  200686; bsz: 6027/7502/251; 107381/133649 tok/s;    230 sec;
[2024-04-29 22:05:32,535 INFO] Step 500/100000; acc: 29.6; ppl: 347.2; xent: 5.8; lr: 0.00010; sents:  193882; bsz: 6056/7513/242; 109117/135378 tok/s;    274 sec;
[2024-04-29 22:06:16,951 INFO] Step 600/100000; acc: 31.1; ppl: 278.8; xent: 5.6; lr: 0.00011; sents:  193073; bsz: 6007/7539/241; 108203/135799 tok/s;    319 sec;
[2024-04-29 22:07:01,509 INFO] Step 700/100000; acc: 32.6; ppl: 223.7; xent: 5.4; lr: 0.00013; sents:  190356; bsz: 6039/7464/238; 108432/134009 tok/s;    363 sec;
[2024-04-29 22:07:46,058 INFO] Step 800/100000; acc: 35.5; ppl: 173.2; xent: 5.2; lr: 0.00015; sents:  210868; bsz: 6054/7556/264; 108721/135686 tok/s;    408 sec;
[2024-04-29 22:08:30,814 INFO] Step 900/100000; acc: 38.6; ppl: 132.8; xent: 4.9; lr: 0.00017; sents:  198438; bsz: 5989/7526/248; 107045/134531 tok/s;    452 sec;
[2024-04-29 22:09:15,645 INFO] Step 1000/100000; acc: 44.7; ppl:  86.6; xent: 4.5; lr: 0.00019; sents:  200431; bsz: 6094/7534/251; 108752/134451 tok/s;    497 sec;
[2024-04-29 22:09:59,881 INFO] Step 1100/100000; acc: 52.1; ppl:  54.7; xent: 4.0; lr: 0.00021; sents:  217880; bsz: 6158/7564/272; 111363/136797 tok/s;    542 sec;
[2024-04-29 22:10:43,826 INFO] Step 1200/100000; acc: 56.5; ppl:  41.3; xent: 3.7; lr: 0.00023; sents:  186685; bsz: 5866/7445/233; 106794/135533 tok/s;    585 sec;
[2024-04-29 22:11:28,023 INFO] Step 1300/100000; acc: 61.0; ppl:  31.0; xent: 3.4; lr: 0.00025; sents:  197183; bsz: 6036/7547/246; 109251/136613 tok/s;    630 sec;
[2024-04-29 22:12:12,367 INFO] Step 1400/100000; acc: 64.6; ppl:  24.6; xent: 3.2; lr: 0.00027; sents:  208573; bsz: 6166/7561/261; 111235/136414 tok/s;    674 sec;
[2024-04-29 22:12:57,238 INFO] Step 1500/100000; acc: 66.8; ppl:  21.3; xent: 3.1; lr: 0.00029; sents:  195141; bsz: 6047/7501/244; 107810/133739 tok/s;    719 sec;
[2024-04-29 22:13:41,251 INFO] Step 1600/100000; acc: 68.7; ppl:  18.7; xent: 2.9; lr: 0.00030; sents:  201155; bsz: 5999/7544/251; 109051/137131 tok/s;    763 sec;
[2024-04-29 22:14:25,293 INFO] Step 1700/100000; acc: 70.0; ppl:  17.0; xent: 2.8; lr: 0.00032; sents:  196365; bsz: 5953/7421/245; 108137/134795 tok/s;    807 sec;
[2024-04-29 22:15:09,341 INFO] Step 1800/100000; acc: 71.1; ppl:  15.8; xent: 2.8; lr: 0.00034; sents:  195200; bsz: 6032/7548/244; 109554/137082 tok/s;    851 sec;
[2024-04-29 22:15:53,898 INFO] Step 1900/100000; acc: 72.2; ppl:  14.7; xent: 2.7; lr: 0.00036; sents:  198988; bsz: 6072/7494/249; 109227/134820 tok/s;    896 sec;
[2024-04-29 22:16:38,519 INFO] Step 2000/100000; acc: 73.0; ppl:  13.9; xent: 2.6; lr: 0.00038; sents:  196605; bsz: 6023/7509/246; 107988/134626 tok/s;    940 sec;
[2024-04-29 22:17:22,734 INFO] Step 2100/100000; acc: 73.9; ppl:  13.1; xent: 2.6; lr: 0.00040; sents:  208786; bsz: 6010/7492/261; 108747/135557 tok/s;    984 sec;
[2024-04-29 22:18:06,966 INFO] Step 2200/100000; acc: 74.3; ppl:  12.7; xent: 2.5; lr: 0.00042; sents:  189905; bsz: 6033/7559/237; 109123/136706 tok/s;   1029 sec;
[2024-04-29 22:18:50,887 INFO] Step 2300/100000; acc: 74.8; ppl:  12.3; xent: 2.5; lr: 0.00044; sents:  204467; bsz: 6033/7497/256; 109881/136552 tok/s;   1073 sec;
[2024-04-29 22:19:35,418 INFO] Step 2400/100000; acc: 75.6; ppl:  11.8; xent: 2.5; lr: 0.00046; sents:  203899; bsz: 6060/7568/255; 108872/135962 tok/s;   1117 sec;
[2024-04-29 22:20:30,771 INFO] Step 2500/100000; acc: 75.8; ppl:  11.5; xent: 2.4; lr: 0.00048; sents:  198046; bsz: 6005/7474/248; 86782/108025 tok/s;   1172 sec;
[2024-04-29 22:21:15,099 INFO] Step 2600/100000; acc: 76.4; ppl:  11.2; xent: 2.4; lr: 0.00049; sents:  206377; bsz: 6034/7526/258; 108891/135825 tok/s;   1217 sec;
[2024-04-29 22:22:01,513 INFO] Step 2700/100000; acc: 76.6; ppl:  11.0; xent: 2.4; lr: 0.00051; sents:  201792; bsz: 5930/7474/252; 102219/128830 tok/s;   1263 sec;
[2024-04-29 22:22:45,486 INFO] Step 2800/100000; acc: 76.9; ppl:  10.8; xent: 2.4; lr: 0.00053; sents:  196737; bsz: 5976/7478/246; 108721/136049 tok/s;   1307 sec;
[2024-04-29 22:23:29,688 INFO] Step 2900/100000; acc: 77.2; ppl:  10.6; xent: 2.4; lr: 0.00055; sents:  205984; bsz: 6054/7513/257; 109562/135968 tok/s;   1351 sec;
[2024-04-29 22:24:14,016 INFO] Step 3000/100000; acc: 77.4; ppl:  10.5; xent: 2.3; lr: 0.00057; sents:  192816; bsz: 6041/7482/241; 109027/135023 tok/s;   1396 sec;
[2024-04-29 22:24:58,570 INFO] Step 3100/100000; acc: 78.0; ppl:  10.1; xent: 2.3; lr: 0.00059; sents:  203300; bsz: 6054/7495/254; 108703/134574 tok/s;   1440 sec;
[2024-04-29 22:25:42,683 INFO] Step 3200/100000; acc: 78.1; ppl:  10.1; xent: 2.3; lr: 0.00061; sents:  192996; bsz: 6041/7523/241; 109557/136433 tok/s;   1484 sec;
[2024-04-29 22:26:26,736 INFO] Step 3300/100000; acc: 78.1; ppl:  10.0; xent: 2.3; lr: 0.00063; sents:  194055; bsz: 5963/7486/243; 108282/135947 tok/s;   1528 sec;
[2024-04-29 22:27:10,624 INFO] Step 3400/100000; acc: 78.4; ppl:   9.9; xent: 2.3; lr: 0.00065; sents:  202468; bsz: 5956/7499/253; 108566/136692 tok/s;   1572 sec;
[2024-04-29 22:27:56,523 INFO] Step 3500/100000; acc: 78.5; ppl:   9.8; xent: 2.3; lr: 0.00067; sents:  201227; bsz: 6097/7473/252; 106263/130257 tok/s;   1618 sec;
[2024-04-29 22:28:41,147 INFO] Step 3600/100000; acc: 78.7; ppl:   9.7; xent: 2.3; lr: 0.00068; sents:  207933; bsz: 6020/7542/260; 107920/135204 tok/s;   1663 sec;
[2024-04-29 22:29:25,382 INFO] Step 3700/100000; acc: 78.8; ppl:   9.6; xent: 2.3; lr: 0.00070; sents:  194469; bsz: 6027/7513/243; 108992/135882 tok/s;   1707 sec;
[2024-04-29 22:30:09,452 INFO] Step 3800/100000; acc: 78.7; ppl:   9.7; xent: 2.3; lr: 0.00072; sents:  202157; bsz: 5969/7471/253; 108358/135618 tok/s;   1751 sec;
[2024-04-29 22:30:56,406 INFO] Step 3900/100000; acc: 79.2; ppl:   9.4; xent: 2.2; lr: 0.00074; sents:  207481; bsz: 6073/7560/259; 103467/128812 tok/s;   1798 sec;
[2024-04-29 22:31:40,559 INFO] Step 4000/100000; acc: 79.2; ppl:   9.4; xent: 2.2; lr: 0.00076; sents:  190034; bsz: 6063/7495/238; 109860/135792 tok/s;   1842 sec;
[2024-04-29 22:32:25,230 INFO] Step 4100/100000; acc: 79.4; ppl:   9.3; xent: 2.2; lr: 0.00078; sents:  204370; bsz: 6035/7513/255; 108079/134552 tok/s;   1887 sec;
[2024-04-29 22:33:09,185 INFO] Step 4200/100000; acc: 79.4; ppl:   9.3; xent: 2.2; lr: 0.00080; sents:  207459; bsz: 5992/7521/259; 109066/136897 tok/s;   1931 sec;
[2024-04-29 22:33:53,289 INFO] Step 4300/100000; acc: 79.7; ppl:   9.2; xent: 2.2; lr: 0.00082; sents:  195268; bsz: 6163/7537/244; 111792/136715 tok/s;   1975 sec;
[2024-04-29 22:34:37,016 INFO] Step 4400/100000; acc: 79.3; ppl:   9.3; xent: 2.2; lr: 0.00084; sents:  199151; bsz: 5984/7504/249; 109485/137298 tok/s;   2019 sec;
[2024-04-29 22:35:21,086 INFO] Step 4500/100000; acc: 79.7; ppl:   9.1; xent: 2.2; lr: 0.00086; sents:  200332; bsz: 5969/7525/250; 108362/136596 tok/s;   2063 sec;
[2024-04-29 22:36:05,479 INFO] Step 4600/100000; acc: 79.8; ppl:   9.1; xent: 2.2; lr: 0.00088; sents:  214139; bsz: 5933/7453/268; 106916/134302 tok/s;   2107 sec;
[2024-04-29 22:36:51,290 INFO] Step 4700/100000; acc: 79.9; ppl:   9.0; xent: 2.2; lr: 0.00089; sents:  198081; bsz: 6090/7522/248; 106355/131368 tok/s;   2153 sec;
[2024-04-29 22:37:35,540 INFO] Step 4800/100000; acc: 80.1; ppl:   8.9; xent: 2.2; lr: 0.00091; sents:  206503; bsz: 6073/7567/258; 109803/136798 tok/s;   2197 sec;
[2024-04-29 22:38:19,720 INFO] Step 4900/100000; acc: 79.8; ppl:   9.0; xent: 2.2; lr: 0.00093; sents:  194923; bsz: 5990/7534/244; 108461/136416 tok/s;   2241 sec;
[2024-04-29 22:39:03,918 INFO] Step 5000/100000; acc: 80.0; ppl:   9.0; xent: 2.2; lr: 0.00095; sents:  191346; bsz: 5993/7461/239; 108475/135047 tok/s;   2286 sec;
[2024-04-29 22:39:48,575 INFO] Step 5100/100000; acc: 80.2; ppl:   8.9; xent: 2.2; lr: 0.00097; sents:  200868; bsz: 6048/7541/251; 108343/135095 tok/s;   2330 sec;
[2024-04-29 22:40:33,424 INFO] Step 5200/100000; acc: 80.3; ppl:   8.8; xent: 2.2; lr: 0.00099; sents:  204524; bsz: 6067/7481/256; 108217/133453 tok/s;   2375 sec;
[2024-04-29 22:41:17,523 INFO] Step 5300/100000; acc: 80.5; ppl:   8.8; xent: 2.2; lr: 0.00101; sents:  213585; bsz: 5999/7451/267; 108835/135165 tok/s;   2419 sec;
[2024-04-29 22:42:01,645 INFO] Step 5400/100000; acc: 80.3; ppl:   8.8; xent: 2.2; lr: 0.00103; sents:  201466; bsz: 6001/7536/252; 108798/136630 tok/s;   2463 sec;
[2024-04-29 22:42:46,093 INFO] Step 5500/100000; acc: 80.5; ppl:   8.7; xent: 2.2; lr: 0.00105; sents:  197495; bsz: 6089/7493/247; 109590/134864 tok/s;   2508 sec;
[2024-04-29 22:43:30,603 INFO] Step 5600/100000; acc: 80.5; ppl:   8.7; xent: 2.2; lr: 0.00107; sents:  207307; bsz: 6049/7535/259; 108723/135430 tok/s;   2552 sec;
[2024-04-29 22:44:15,368 INFO] Step 5700/100000; acc: 80.4; ppl:   8.7; xent: 2.2; lr: 0.00108; sents:  198250; bsz: 5997/7481/248; 107170/133701 tok/s;   2597 sec;
[2024-04-29 22:44:59,687 INFO] Step 5800/100000; acc: 80.6; ppl:   8.7; xent: 2.2; lr: 0.00110; sents:  192640; bsz: 6045/7551/241; 109112/136295 tok/s;   2641 sec;
[2024-04-29 22:45:43,889 INFO] Step 5900/100000; acc: 80.7; ppl:   8.6; xent: 2.2; lr: 0.00112; sents:  202917; bsz: 5964/7489/254; 107946/135548 tok/s;   2686 sec;
[2024-04-29 22:46:28,086 INFO] Step 6000/100000; acc: 80.6; ppl:   8.6; xent: 2.2; lr: 0.00114; sents:  211222; bsz: 6060/7575/264; 109689/137123 tok/s;   2730 sec;
[2024-04-29 22:47:12,493 INFO] Step 6100/100000; acc: 80.7; ppl:   8.6; xent: 2.1; lr: 0.00113; sents:  204155; bsz: 5972/7402/255; 107589/133347 tok/s;   2774 sec;
[2024-04-29 22:47:57,294 INFO] Step 6200/100000; acc: 80.8; ppl:   8.5; xent: 2.1; lr: 0.00112; sents:  197101; bsz: 6053/7556/246; 108091/134925 tok/s;   2819 sec;
[2024-04-29 22:48:41,458 INFO] Step 6300/100000; acc: 80.9; ppl:   8.5; xent: 2.1; lr: 0.00111; sents:  197635; bsz: 6018/7504/247; 109009/135928 tok/s;   2863 sec;
[2024-04-29 22:49:25,740 INFO] Step 6400/100000; acc: 81.1; ppl:   8.4; xent: 2.1; lr: 0.00110; sents:  201422; bsz: 6136/7556/252; 110859/136510 tok/s;   2907 sec;
[2024-04-29 22:50:09,821 INFO] Step 6500/100000; acc: 81.0; ppl:   8.5; xent: 2.1; lr: 0.00110; sents:  197942; bsz: 5961/7485/247; 108180/135841 tok/s;   2951 sec;
[2024-04-29 22:50:54,103 INFO] Step 6600/100000; acc: 81.1; ppl:   8.4; xent: 2.1; lr: 0.00109; sents:  200575; bsz: 6047/7492/251; 109252/135356 tok/s;   2996 sec;
[2024-04-29 22:51:39,045 INFO] Step 6700/100000; acc: 81.2; ppl:   8.4; xent: 2.1; lr: 0.00108; sents:  201314; bsz: 6002/7489/252; 106844/133304 tok/s;   3041 sec;
[2024-04-29 22:52:23,626 INFO] Step 6800/100000; acc: 81.3; ppl:   8.3; xent: 2.1; lr: 0.00107; sents:  204925; bsz: 6026/7552/256; 108132/135516 tok/s;   3085 sec;
[2024-04-29 22:53:07,907 INFO] Step 6900/100000; acc: 81.4; ppl:   8.3; xent: 2.1; lr: 0.00106; sents:  204869; bsz: 6039/7520/256; 109114/135871 tok/s;   3130 sec;
[2024-04-29 22:53:52,018 INFO] Step 7000/100000; acc: 81.3; ppl:   8.3; xent: 2.1; lr: 0.00106; sents:  191226; bsz: 5913/7411/239; 107243/134407 tok/s;   3174 sec;
[2024-04-29 22:54:36,186 INFO] Step 7100/100000; acc: 81.5; ppl:   8.2; xent: 2.1; lr: 0.00105; sents:  206385; bsz: 5986/7425/258; 108431/134481 tok/s;   3218 sec;
[2024-04-29 22:55:21,216 INFO] Step 7200/100000; acc: 81.7; ppl:   8.1; xent: 2.1; lr: 0.00104; sents:  204492; bsz: 6101/7596/256; 108399/134956 tok/s;   3263 sec;
[2024-04-29 22:56:05,899 INFO] Step 7300/100000; acc: 81.6; ppl:   8.2; xent: 2.1; lr: 0.00103; sents:  201533; bsz: 6044/7545/252; 108221/135080 tok/s;   3308 sec;
[2024-04-29 22:56:50,116 INFO] Step 7400/100000; acc: 81.8; ppl:   8.1; xent: 2.1; lr: 0.00103; sents:  197938; bsz: 6042/7527/247; 109311/136178 tok/s;   3352 sec;
[2024-04-29 22:57:34,318 INFO] Step 7500/100000; acc: 81.4; ppl:   8.2; xent: 2.1; lr: 0.00102; sents:  202339; bsz: 6017/7513/253; 108900/135969 tok/s;   3396 sec;
[2024-04-29 22:58:18,594 INFO] Step 7600/100000; acc: 81.7; ppl:   8.1; xent: 2.1; lr: 0.00101; sents:  191350; bsz: 5974/7519/239; 107948/135868 tok/s;   3440 sec;
[2024-04-29 22:59:03,392 INFO] Step 7700/100000; acc: 81.8; ppl:   8.1; xent: 2.1; lr: 0.00101; sents:  210811; bsz: 6088/7542/264; 108711/134692 tok/s;   3485 sec;
[2024-04-29 22:59:48,201 INFO] Step 7800/100000; acc: 82.0; ppl:   8.0; xent: 2.1; lr: 0.00100; sents:  205768; bsz: 6062/7488/257; 108226/133687 tok/s;   3530 sec;
[2024-04-29 23:00:32,294 INFO] Step 7900/100000; acc: 81.8; ppl:   8.1; xent: 2.1; lr: 0.00099; sents:  210300; bsz: 5997/7505/263; 108806/136162 tok/s;   3574 sec;
[2024-04-29 23:01:16,517 INFO] Step 8000/100000; acc: 81.8; ppl:   8.1; xent: 2.1; lr: 0.00099; sents:  208598; bsz: 6051/7530/261; 109461/136213 tok/s;   3618 sec;
[2024-04-29 23:02:00,821 INFO] Step 8100/100000; acc: 82.0; ppl:   8.0; xent: 2.1; lr: 0.00098; sents:  192413; bsz: 6077/7517/241; 109739/135728 tok/s;   3662 sec;
[2024-04-29 23:02:45,287 INFO] Step 8200/100000; acc: 82.1; ppl:   8.0; xent: 2.1; lr: 0.00098; sents:  194882; bsz: 6003/7447/244; 107997/133979 tok/s;   3707 sec;
[2024-04-29 23:03:30,080 INFO] Step 8300/100000; acc: 82.0; ppl:   8.0; xent: 2.1; lr: 0.00097; sents:  204864; bsz: 5987/7518/256; 106934/134279 tok/s;   3752 sec;
[2024-04-29 23:04:14,143 INFO] Step 8400/100000; acc: 82.0; ppl:   8.0; xent: 2.1; lr: 0.00096; sents:  193079; bsz: 5916/7466/241; 107406/135554 tok/s;   3796 sec;
[2024-04-29 23:04:58,516 INFO] Step 8500/100000; acc: 82.2; ppl:   7.9; xent: 2.1; lr: 0.00096; sents:  209459; bsz: 6036/7490/262; 108818/135037 tok/s;   3840 sec;
[2024-04-29 23:05:42,722 INFO] Step 8600/100000; acc: 82.1; ppl:   7.9; xent: 2.1; lr: 0.00095; sents:  204029; bsz: 6020/7510/255; 108943/135918 tok/s;   3884 sec;
[2024-04-29 23:06:27,100 INFO] Step 8700/100000; acc: 82.3; ppl:   7.9; xent: 2.1; lr: 0.00095; sents:  211731; bsz: 6073/7561/265; 109482/136302 tok/s;   3929 sec;
[2024-04-29 23:07:11,951 INFO] Step 8800/100000; acc: 82.2; ppl:   7.9; xent: 2.1; lr: 0.00094; sents:  200945; bsz: 6007/7506/251; 107151/133878 tok/s;   3974 sec;
[2024-04-29 23:07:56,177 INFO] Step 8900/100000; acc: 82.1; ppl:   7.9; xent: 2.1; lr: 0.00094; sents:  182456; bsz: 6029/7477/228; 109064/135253 tok/s;   4018 sec;
[2024-04-29 23:08:40,303 INFO] Step 9000/100000; acc: 82.4; ppl:   7.8; xent: 2.1; lr: 0.00093; sents:  203504; bsz: 5999/7442/254; 108771/134924 tok/s;   4062 sec;
[2024-04-29 23:09:24,683 INFO] Step 9100/100000; acc: 82.3; ppl:   7.8; xent: 2.1; lr: 0.00093; sents:  199770; bsz: 6029/7522/250; 108675/135598 tok/s;   4106 sec;
[2024-04-29 23:10:09,133 INFO] Step 9200/100000; acc: 82.3; ppl:   7.9; xent: 2.1; lr: 0.00092; sents:  197339; bsz: 6054/7555/247; 108960/135982 tok/s;   4151 sec;
[2024-04-29 23:10:54,061 INFO] Step 9300/100000; acc: 82.2; ppl:   7.9; xent: 2.1; lr: 0.00092; sents:  206232; bsz: 6000/7512/258; 106829/133758 tok/s;   4196 sec;
[2024-04-29 23:11:38,486 INFO] Step 9400/100000; acc: 82.4; ppl:   7.8; xent: 2.1; lr: 0.00091; sents:  206444; bsz: 6054/7507/258; 109024/135190 tok/s;   4240 sec;
[2024-04-29 23:12:22,806 INFO] Step 9500/100000; acc: 82.4; ppl:   7.8; xent: 2.1; lr: 0.00091; sents:  194757; bsz: 6074/7464/243; 109646/134732 tok/s;   4284 sec;
[2024-04-29 23:13:06,933 INFO] Step 9600/100000; acc: 82.3; ppl:   7.8; xent: 2.1; lr: 0.00090; sents:  188490; bsz: 5990/7550/236; 108603/136879 tok/s;   4329 sec;
[2024-04-29 23:13:51,148 INFO] Step 9700/100000; acc: 82.6; ppl:   7.7; xent: 2.0; lr: 0.00090; sents:  210488; bsz: 6027/7477/263; 109045/135284 tok/s;   4373 sec;
[2024-04-29 23:14:35,716 INFO] Step 9800/100000; acc: 82.3; ppl:   7.8; xent: 2.1; lr: 0.00089; sents:  210374; bsz: 6029/7548/263; 108222/135486 tok/s;   4417 sec;
[2024-04-29 23:15:20,225 INFO] Step 9900/100000; acc: 82.5; ppl:   7.8; xent: 2.0; lr: 0.00089; sents:  200842; bsz: 5976/7504/251; 107420/134870 tok/s;   4462 sec;
[2024-04-29 23:16:04,390 INFO] Step 10000/100000; acc: 82.7; ppl:   7.7; xent: 2.0; lr: 0.00088; sents:  206167; bsz: 6074/7558/258; 110016/136909 tok/s;   4506 sec;
[2024-04-29 23:16:59,026 INFO] valid stats calculation
                           took: 54.634299755096436 s.
[2024-04-29 23:17:05,621 INFO] The translation of the valid dataset for dynamic scoring
                               took : 6.594810962677002 s.
[2024-04-29 23:17:05,622 INFO] UPDATING VALIDATION BLEU
[2024-04-29 23:17:05,795 INFO] validation BLEU: 26.226248756880565
[2024-04-29 23:17:05,796 INFO] Train perplexity: 14.5944
[2024-04-29 23:17:05,796 INFO] Train accuracy: 73.719
[2024-04-29 23:17:05,796 INFO] Sentences processed: 2.01002e+07
[2024-04-29 23:17:05,796 INFO] Average bsz: 6026/7510/251
[2024-04-29 23:17:05,796 INFO] Validation perplexity: 9.78329
[2024-04-29 23:17:05,796 INFO] Validation accuracy: 76.8894
[2024-04-29 23:17:05,800 INFO] Saving checkpoint /media/vincent/Crucial X6/NMT_work/en-de/runs/6-6-8-512-2048/6-6-8-512-2048-glu_step_10000.pt
[2024-04-29 23:17:51,851 INFO] Step 10100/100000; acc: 82.6; ppl:   7.7; xent: 2.0; lr: 0.00088; sents:  204932; bsz: 6039/7497/256; 44956/55814 tok/s;   4614 sec;
[2024-04-29 23:18:35,552 INFO] Step 10200/100000; acc: 82.7; ppl:   7.7; xent: 2.0; lr: 0.00088; sents:  197642; bsz: 5992/7521/247; 109692/137685 tok/s;   4657 sec;
[2024-04-29 23:19:19,616 INFO] Step 10300/100000; acc: 82.8; ppl:   7.7; xent: 2.0; lr: 0.00087; sents:  199151; bsz: 6054/7474/249; 109911/135686 tok/s;   4701 sec;
[2024-04-29 23:20:04,017 INFO] Step 10400/100000; acc: 82.6; ppl:   7.7; xent: 2.0; lr: 0.00087; sents:  200405; bsz: 5945/7488/251; 107112/134910 tok/s;   4746 sec;
[2024-04-29 23:20:47,948 INFO] Step 10500/100000; acc: 82.8; ppl:   7.6; xent: 2.0; lr: 0.00086; sents:  190713; bsz: 6122/7568/238; 111481/137823 tok/s;   4790 sec;
[2024-04-29 23:21:31,628 INFO] Step 10600/100000; acc: 82.8; ppl:   7.6; xent: 2.0; lr: 0.00086; sents:  200841; bsz: 6017/7572/251; 110204/138689 tok/s;   4833 sec;
[2024-04-29 23:22:15,500 INFO] Step 10700/100000; acc: 82.9; ppl:   7.6; xent: 2.0; lr: 0.00085; sents:  193629; bsz: 6076/7531/242; 110800/137319 tok/s;   4877 sec;
[2024-04-29 23:22:59,455 INFO] Step 10800/100000; acc: 82.8; ppl:   7.6; xent: 2.0; lr: 0.00085; sents:  196285; bsz: 6014/7481/245; 109456/136165 tok/s;   4921 sec;
[2024-04-29 23:23:43,731 INFO] Step 10900/100000; acc: 82.8; ppl:   7.6; xent: 2.0; lr: 0.00085; sents:  208474; bsz: 5998/7496/261; 108374/135447 tok/s;   4965 sec;
[2024-04-29 23:24:27,412 INFO] Step 11000/100000; acc: 82.8; ppl:   7.6; xent: 2.0; lr: 0.00084; sents:  202466; bsz: 5958/7451/253; 109117/136461 tok/s;   5009 sec;
[2024-04-29 23:25:11,128 INFO] Step 11100/100000; acc: 83.0; ppl:   7.5; xent: 2.0; lr: 0.00084; sents:  200782; bsz: 6047/7549/251; 110665/138153 tok/s;   5053 sec;
[2024-04-29 23:25:54,963 INFO] Step 11200/100000; acc: 83.1; ppl:   7.5; xent: 2.0; lr: 0.00084; sents:  195963; bsz: 6089/7555/245; 111124/137890 tok/s;   5097 sec;
[2024-04-29 23:26:38,942 INFO] Step 11300/100000; acc: 83.0; ppl:   7.6; xent: 2.0; lr: 0.00083; sents:  206312; bsz: 6072/7539/258; 110448/137132 tok/s;   5141 sec;
[2024-04-29 23:27:23,033 INFO] Step 11400/100000; acc: 83.0; ppl:   7.6; xent: 2.0; lr: 0.00083; sents:  203784; bsz: 5947/7405/255; 107908/134366 tok/s;   5185 sec;
[2024-04-29 23:28:06,767 INFO] Step 11500/100000; acc: 83.0; ppl:   7.6; xent: 2.0; lr: 0.00082; sents:  193775; bsz: 6016/7502/242; 110049/137239 tok/s;   5228 sec;
[2024-04-29 23:28:50,579 INFO] Step 11600/100000; acc: 83.2; ppl:   7.5; xent: 2.0; lr: 0.00082; sents:  205746; bsz: 6095/7520/257; 111290/137321 tok/s;   5272 sec;
[2024-04-29 23:29:34,205 INFO] Step 11700/100000; acc: 82.9; ppl:   7.6; xent: 2.0; lr: 0.00082; sents:  212557; bsz: 5975/7498/266; 109571/137492 tok/s;   5316 sec;
[2024-04-29 23:30:18,044 INFO] Step 11800/100000; acc: 83.0; ppl:   7.5; xent: 2.0; lr: 0.00081; sents:  193935; bsz: 6052/7558/242; 110435/137930 tok/s;   5360 sec;
[2024-04-29 23:31:02,277 INFO] Step 11900/100000; acc: 83.1; ppl:   7.5; xent: 2.0; lr: 0.00081; sents:  197399; bsz: 5989/7470/247; 108309/135104 tok/s;   5404 sec;
[2024-04-29 23:31:46,396 INFO] Step 12000/100000; acc: 83.3; ppl:   7.4; xent: 2.0; lr: 0.00081; sents:  196812; bsz: 6060/7548/246; 109894/136863 tok/s;   5448 sec;
[2024-04-29 23:32:30,144 INFO] Step 12100/100000; acc: 83.2; ppl:   7.5; xent: 2.0; lr: 0.00080; sents:  209173; bsz: 6133/7572/261; 112157/138468 tok/s;   5492 sec;
[2024-04-29 23:33:13,796 INFO] Step 12200/100000; acc: 83.2; ppl:   7.5; xent: 2.0; lr: 0.00080; sents:  201492; bsz: 5961/7481/252; 109246/137100 tok/s;   5535 sec;
[2024-04-29 23:33:57,476 INFO] Step 12300/100000; acc: 83.1; ppl:   7.5; xent: 2.0; lr: 0.00080; sents:  195067; bsz: 5973/7527/244; 109406/137855 tok/s;   5579 sec;
[2024-04-29 23:34:41,570 INFO] Step 12400/100000; acc: 83.2; ppl:   7.5; xent: 2.0; lr: 0.00079; sents:  210660; bsz: 6045/7523/263; 109667/136497 tok/s;   5623 sec;
[2024-04-29 23:35:25,923 INFO] Step 12500/100000; acc: 83.3; ppl:   7.4; xent: 2.0; lr: 0.00079; sents:  194107; bsz: 6023/7461/243; 108630/134576 tok/s;   5668 sec;
[2024-04-29 23:36:09,570 INFO] Step 12600/100000; acc: 83.2; ppl:   7.5; xent: 2.0; lr: 0.00079; sents:  192615; bsz: 5964/7474/241; 109309/136986 tok/s;   5711 sec;
[2024-04-29 23:36:53,352 INFO] Step 12700/100000; acc: 83.3; ppl:   7.4; xent: 2.0; lr: 0.00078; sents:  199159; bsz: 6093/7479/249; 111335/136671 tok/s;   5755 sec;
[2024-04-29 23:37:37,122 INFO] Step 12800/100000; acc: 83.3; ppl:   7.4; xent: 2.0; lr: 0.00078; sents:  206690; bsz: 6068/7539/258; 110905/137801 tok/s;   5799 sec;
[2024-04-29 23:38:21,186 INFO] Step 12900/100000; acc: 83.2; ppl:   7.4; xent: 2.0; lr: 0.00078; sents:  190689; bsz: 6032/7566/238; 109512/137364 tok/s;   5843 sec;
[2024-04-29 23:39:05,430 INFO] Step 13000/100000; acc: 83.2; ppl:   7.5; xent: 2.0; lr: 0.00078; sents:  201053; bsz: 5918/7459/251; 107003/134877 tok/s;   5887 sec;
[2024-04-29 23:39:49,282 INFO] Step 13100/100000; acc: 83.4; ppl:   7.4; xent: 2.0; lr: 0.00077; sents:  206031; bsz: 6106/7553/258; 111402/137787 tok/s;   5931 sec;
[2024-04-29 23:40:32,966 INFO] Step 13200/100000; acc: 83.4; ppl:   7.4; xent: 2.0; lr: 0.00077; sents:  218555; bsz: 6019/7465/273; 110222/136702 tok/s;   5975 sec;
[2024-04-29 23:41:16,812 INFO] Step 13300/100000; acc: 83.2; ppl:   7.5; xent: 2.0; lr: 0.00077; sents:  189343; bsz: 6050/7594/237; 110392/138552 tok/s;   6018 sec;
[2024-04-29 23:42:00,824 INFO] Step 13400/100000; acc: 83.3; ppl:   7.4; xent: 2.0; lr: 0.00076; sents:  193520; bsz: 5966/7453/242; 108451/135475 tok/s;   6062 sec;
[2024-04-29 23:42:45,281 INFO] Step 13500/100000; acc: 83.4; ppl:   7.4; xent: 2.0; lr: 0.00076; sents:  198386; bsz: 6012/7488/248; 108190/134747 tok/s;   6107 sec;
[2024-04-29 23:43:29,023 INFO] Step 13600/100000; acc: 83.2; ppl:   7.4; xent: 2.0; lr: 0.00076; sents:  199194; bsz: 6049/7593/249; 110624/138863 tok/s;   6151 sec;
[2024-04-29 23:44:12,713 INFO] Step 13700/100000; acc: 83.4; ppl:   7.4; xent: 2.0; lr: 0.00076; sents:  201305; bsz: 6023/7459/252; 110292/136588 tok/s;   6194 sec;
[2024-04-29 23:44:56,412 INFO] Step 13800/100000; acc: 83.4; ppl:   7.4; xent: 2.0; lr: 0.00075; sents:  191024; bsz: 6014/7514/239; 110098/137556 tok/s;   6238 sec;
[2024-04-29 23:45:40,156 INFO] Step 13900/100000; acc: 83.4; ppl:   7.4; xent: 2.0; lr: 0.00075; sents:  195971; bsz: 5977/7502/245; 109307/137195 tok/s;   6282 sec;
[2024-04-29 23:46:24,309 INFO] Step 14000/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00075; sents:  208361; bsz: 6106/7521/260; 110638/136265 tok/s;   6326 sec;
[2024-04-29 23:47:08,506 INFO] Step 14100/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00074; sents:  208694; bsz: 6012/7529/261; 108823/136275 tok/s;   6370 sec;
[2024-04-29 23:47:52,181 INFO] Step 14200/100000; acc: 83.5; ppl:   7.3; xent: 2.0; lr: 0.00074; sents:  196704; bsz: 5943/7500/246; 108866/137381 tok/s;   6414 sec;
[2024-04-29 23:48:36,138 INFO] Step 14300/100000; acc: 83.5; ppl:   7.3; xent: 2.0; lr: 0.00074; sents:  198244; bsz: 6135/7504/248; 111655/136565 tok/s;   6458 sec;
[2024-04-29 23:49:19,805 INFO] Step 14400/100000; acc: 83.3; ppl:   7.4; xent: 2.0; lr: 0.00074; sents:  195187; bsz: 5994/7526/244; 109818/137887 tok/s;   6501 sec;
[2024-04-29 23:50:03,702 INFO] Step 14500/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00073; sents:  214335; bsz: 5983/7487/268; 109033/136440 tok/s;   6545 sec;
[2024-04-29 23:50:47,929 INFO] Step 14600/100000; acc: 83.5; ppl:   7.3; xent: 2.0; lr: 0.00073; sents:  201405; bsz: 6082/7521/252; 110021/136050 tok/s;   6590 sec;
[2024-04-29 23:51:31,686 INFO] Step 14700/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00073; sents:  202122; bsz: 6049/7512/253; 110597/137339 tok/s;   6633 sec;
[2024-04-29 23:52:15,385 INFO] Step 14800/100000; acc: 83.4; ppl:   7.4; xent: 2.0; lr: 0.00073; sents:  199888; bsz: 5971/7562/250; 109318/138434 tok/s;   6677 sec;
[2024-04-29 23:52:59,291 INFO] Step 14900/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00072; sents:  193799; bsz: 6107/7509/242; 111275/136820 tok/s;   6721 sec;
[2024-04-29 23:53:43,360 INFO] Step 15000/100000; acc: 83.7; ppl:   7.3; xent: 2.0; lr: 0.00072; sents:  196345; bsz: 6083/7462/245; 110431/135453 tok/s;   6765 sec;
[2024-04-29 23:53:43,363 INFO] Updated dropout/attn dropout to 0.100000 0.000000 at step 15001
[2024-04-29 23:54:27,514 INFO] Step 15100/100000; acc: 83.5; ppl:   7.4; xent: 2.0; lr: 0.00072; sents:  200738; bsz: 5938/7505/251; 107597/135975 tok/s;   6809 sec;
[2024-04-29 23:55:11,189 INFO] Step 15200/100000; acc: 83.5; ppl:   7.3; xent: 2.0; lr: 0.00072; sents:  205004; bsz: 5975/7480/256; 109451/137009 tok/s;   6853 sec;
[2024-04-29 23:55:54,960 INFO] Step 15300/100000; acc: 83.5; ppl:   7.3; xent: 2.0; lr: 0.00071; sents:  192753; bsz: 5988/7466/241; 109447/136454 tok/s;   6897 sec;
[2024-04-29 23:56:38,532 INFO] Step 15400/100000; acc: 83.5; ppl:   7.3; xent: 2.0; lr: 0.00071; sents:  206119; bsz: 5955/7496/258; 109331/137631 tok/s;   6940 sec;
[2024-04-29 23:57:22,374 INFO] Step 15500/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00071; sents:  208709; bsz: 6022/7510/261; 109893/137047 tok/s;   6984 sec;
[2024-04-29 23:58:06,642 INFO] Step 15600/100000; acc: 83.7; ppl:   7.3; xent: 2.0; lr: 0.00071; sents:  197870; bsz: 6114/7564/247; 110484/136689 tok/s;   7028 sec;
[2024-04-29 23:58:50,305 INFO] Step 15700/100000; acc: 83.7; ppl:   7.3; xent: 2.0; lr: 0.00071; sents:  203647; bsz: 6035/7548/255; 110574/138295 tok/s;   7072 sec;
[2024-04-29 23:59:34,170 INFO] Step 15800/100000; acc: 83.7; ppl:   7.3; xent: 2.0; lr: 0.00070; sents:  201460; bsz: 6086/7494/252; 110988/136678 tok/s;   7116 sec;
[2024-04-30 00:00:17,821 INFO] Step 15900/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00070; sents:  200151; bsz: 5995/7524/250; 109871/137904 tok/s;   7159 sec;
[2024-04-30 00:01:01,643 INFO] Step 16000/100000; acc: 83.5; ppl:   7.3; xent: 2.0; lr: 0.00070; sents:  202760; bsz: 6032/7480/253; 110113/136546 tok/s;   7203 sec;
[2024-04-30 00:01:45,951 INFO] Step 16100/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00070; sents:  202618; bsz: 6010/7520/253; 108509/135775 tok/s;   7248 sec;
[2024-04-30 00:02:29,711 INFO] Step 16200/100000; acc: 83.7; ppl:   7.2; xent: 2.0; lr: 0.00069; sents:  198067; bsz: 6032/7537/248; 110271/137790 tok/s;   7291 sec;
[2024-04-30 00:03:13,381 INFO] Step 16300/100000; acc: 83.7; ppl:   7.3; xent: 2.0; lr: 0.00069; sents:  197980; bsz: 6047/7516/247; 110775/137684 tok/s;   7335 sec;
[2024-04-30 00:03:57,151 INFO] Step 16400/100000; acc: 83.7; ppl:   7.2; xent: 2.0; lr: 0.00069; sents:  207932; bsz: 6065/7514/260; 110846/137344 tok/s;   7379 sec;
[2024-04-30 00:04:40,986 INFO] Step 16500/100000; acc: 83.7; ppl:   7.2; xent: 2.0; lr: 0.00069; sents:  202317; bsz: 6076/7577/253; 110890/138291 tok/s;   7423 sec;
[2024-04-30 00:05:25,104 INFO] Step 16600/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00069; sents:  196965; bsz: 5958/7458/246; 108037/135239 tok/s;   7467 sec;
[2024-04-30 00:06:09,091 INFO] Step 16700/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00068; sents:  207064; bsz: 5981/7506/259; 108776/136518 tok/s;   7511 sec;
[2024-04-30 00:06:52,881 INFO] Step 16800/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00068; sents:  191348; bsz: 6067/7483/239; 110848/136707 tok/s;   7555 sec;
[2024-04-30 00:07:36,572 INFO] Step 16900/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00068; sents:  212024; bsz: 6018/7531/265; 110187/137887 tok/s;   7598 sec;
[2024-04-30 00:08:20,325 INFO] Step 17000/100000; acc: 83.7; ppl:   7.3; xent: 2.0; lr: 0.00068; sents:  200320; bsz: 6046/7516/250; 110545/137426 tok/s;   7642 sec;
[2024-04-30 00:09:04,305 INFO] Step 17100/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00068; sents:  200510; bsz: 5989/7525/251; 108942/136890 tok/s;   7686 sec;
[2024-04-30 00:09:48,620 INFO] Step 17200/100000; acc: 83.9; ppl:   7.2; xent: 2.0; lr: 0.00067; sents:  199841; bsz: 6032/7485/250; 108901/135123 tok/s;   7730 sec;
[2024-04-30 00:10:32,266 INFO] Step 17300/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00067; sents:  208980; bsz: 6020/7599/261; 110336/139290 tok/s;   7774 sec;
[2024-04-30 00:11:15,872 INFO] Step 17400/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00067; sents:  203835; bsz: 5955/7438/255; 109247/136459 tok/s;   7818 sec;
[2024-04-30 00:11:59,557 INFO] Step 17500/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00067; sents:  198740; bsz: 6014/7481/248; 110144/137007 tok/s;   7861 sec;
[2024-04-30 00:12:43,550 INFO] Step 17600/100000; acc: 83.7; ppl:   7.2; xent: 2.0; lr: 0.00067; sents:  202580; bsz: 5961/7520/253; 108406/136747 tok/s;   7905 sec;
[2024-04-30 00:13:28,070 INFO] Step 17700/100000; acc: 83.9; ppl:   7.2; xent: 2.0; lr: 0.00066; sents:  197938; bsz: 6132/7512/247; 110192/134981 tok/s;   7950 sec;
[2024-04-30 00:14:11,900 INFO] Step 17800/100000; acc: 83.9; ppl:   7.2; xent: 2.0; lr: 0.00066; sents:  198167; bsz: 6059/7541/248; 110583/137636 tok/s;   7994 sec;
[2024-04-30 00:14:55,625 INFO] Step 17900/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00066; sents:  209727; bsz: 6011/7432/262; 109982/135975 tok/s;   8037 sec;
[2024-04-30 00:15:39,331 INFO] Step 18000/100000; acc: 83.9; ppl:   7.2; xent: 2.0; lr: 0.00066; sents:  199678; bsz: 6025/7523/250; 110276/137706 tok/s;   8081 sec;
[2024-04-30 00:16:23,265 INFO] Step 18100/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00066; sents:  205470; bsz: 6044/7512/257; 110056/136788 tok/s;   8125 sec;
[2024-04-30 00:17:07,898 INFO] Step 18200/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00066; sents:  198623; bsz: 6031/7559/248; 108093/135484 tok/s;   8170 sec;
[2024-04-30 00:17:51,594 INFO] Step 18300/100000; acc: 83.9; ppl:   7.2; xent: 2.0; lr: 0.00065; sents:  198913; bsz: 5988/7465/249; 109623/136666 tok/s;   8213 sec;
[2024-04-30 00:18:35,419 INFO] Step 18400/100000; acc: 83.9; ppl:   7.2; xent: 2.0; lr: 0.00065; sents:  200702; bsz: 6066/7464/251; 110722/136251 tok/s;   8257 sec;
[2024-04-30 00:19:18,986 INFO] Step 18500/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00065; sents:  198832; bsz: 6007/7521/249; 110308/138098 tok/s;   8301 sec;
[2024-04-30 00:20:02,921 INFO] Step 18600/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00065; sents:  204525; bsz: 5991/7522/256; 109088/136957 tok/s;   8345 sec;
[2024-04-30 00:20:47,209 INFO] Step 18700/100000; acc: 83.9; ppl:   7.2; xent: 2.0; lr: 0.00065; sents:  196480; bsz: 5998/7494/246; 108348/135372 tok/s;   8389 sec;
[2024-04-30 00:21:31,044 INFO] Step 18800/100000; acc: 83.9; ppl:   7.2; xent: 2.0; lr: 0.00064; sents:  200451; bsz: 6049/7551/251; 110399/137814 tok/s;   8433 sec;
[2024-04-30 00:22:14,693 INFO] Step 18900/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00064; sents:  189173; bsz: 6010/7468/236; 110161/136876 tok/s;   8476 sec;
[2024-04-30 00:22:58,239 INFO] Step 19000/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00064; sents:  206220; bsz: 5944/7404/258; 109208/136030 tok/s;   8520 sec;
[2024-04-30 00:23:41,984 INFO] Step 19100/100000; acc: 83.9; ppl:   7.2; xent: 2.0; lr: 0.00064; sents:  210848; bsz: 6053/7517/264; 110703/137462 tok/s;   8564 sec;
[2024-04-30 00:24:26,370 INFO] Step 19200/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00064; sents:  203448; bsz: 6063/7568/254; 109285/136402 tok/s;   8608 sec;
[2024-04-30 00:25:10,489 INFO] Step 19300/100000; acc: 84.0; ppl:   7.2; xent: 2.0; lr: 0.00064; sents:  201114; bsz: 6039/7542/251; 109502/136765 tok/s;   8652 sec;
[2024-04-30 00:25:54,328 INFO] Step 19400/100000; acc: 84.0; ppl:   7.2; xent: 2.0; lr: 0.00063; sents:  196814; bsz: 6038/7537/246; 110184/137539 tok/s;   8696 sec;
[2024-04-30 00:26:38,048 INFO] Step 19500/100000; acc: 83.9; ppl:   7.2; xent: 2.0; lr: 0.00063; sents:  202763; bsz: 6045/7519/253; 110606/137586 tok/s;   8740 sec;
[2024-04-30 00:27:21,579 INFO] Step 19600/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00063; sents:  201899; bsz: 5994/7463/252; 110158/137152 tok/s;   8783 sec;
[2024-04-30 00:28:05,713 INFO] Step 19700/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00063; sents:  197130; bsz: 5975/7496/246; 108313/135873 tok/s;   8827 sec;
[2024-04-30 00:28:50,098 INFO] Step 19800/100000; acc: 84.0; ppl:   7.1; xent: 2.0; lr: 0.00063; sents:  210978; bsz: 6065/7508/264; 109312/135322 tok/s;   8872 sec;
[2024-04-30 00:29:33,893 INFO] Step 19900/100000; acc: 83.9; ppl:   7.2; xent: 2.0; lr: 0.00063; sents:  200050; bsz: 6048/7556/250; 110479/138032 tok/s;   8916 sec;
[2024-04-30 00:30:17,592 INFO] Step 20000/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00062; sents:  195977; bsz: 6002/7526/245; 109876/137777 tok/s;   8959 sec;
[2024-04-30 00:30:30,747 INFO] valid stats calculation
                           took: 13.154314517974854 s.
[2024-04-30 00:30:34,724 INFO] The translation of the valid dataset for dynamic scoring
                               took : 3.975785732269287 s.
[2024-04-30 00:30:34,724 INFO] UPDATING VALIDATION BLEU
[2024-04-30 00:30:35,037 INFO] validation BLEU: 28.322220216544746
[2024-04-30 00:30:35,038 INFO] Train perplexity: 10.3578
[2024-04-30 00:30:35,038 INFO] Train accuracy: 78.595
[2024-04-30 00:30:35,038 INFO] Sentences processed: 4.01872e+07
[2024-04-30 00:30:35,038 INFO] Average bsz: 6026/7510/251
[2024-04-30 00:30:35,038 INFO] Validation perplexity: 8.72975
[2024-04-30 00:30:35,038 INFO] Validation accuracy: 78.8362

And the non gated Gelu (512-3072)

[2024-04-30 09:04:18,885 INFO] NMTModel(
  (encoder): TransformerEncoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(24936, 512, padding_idx=1)
        )
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): ModuleList(
      (0-5): 6 x TransformerEncoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.0, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=3072, bias=True)
          (w_2): Linear(in_features=3072, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
    (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
  )
  (decoder): TransformerDecoder(
    (embeddings): Embeddings(
      (make_embedding): Sequential(
        (emb_luts): Elementwise(
          (0): Embedding(24936, 512, padding_idx=1)
        )
      )
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
    (transformer_layers): ModuleList(
      (0-5): 6 x TransformerDecoderLayer(
        (self_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.0, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=False)
        )
        (feed_forward): PositionwiseFeedForward(
          (w_1): Linear(in_features=512, out_features=3072, bias=True)
          (w_2): Linear(in_features=3072, out_features=512, bias=True)
          (layer_norm): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
          (dropout_1): Dropout(p=0.1, inplace=False)
          (dropout_2): Dropout(p=0.1, inplace=False)
        )
        (layer_norm_1): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
        (dropout): Dropout(p=0.1, inplace=False)
        (context_attn): MultiHeadedAttention(
          (linear_keys): Linear(in_features=512, out_features=512, bias=False)
          (linear_values): Linear(in_features=512, out_features=512, bias=False)
          (linear_query): Linear(in_features=512, out_features=512, bias=False)
          (softmax): Softmax(dim=-1)
          (dropout): Dropout(p=0.0, inplace=False)
          (final_linear): Linear(in_features=512, out_features=512, bias=False)
        )
        (layer_norm_2): LayerNorm((512,), eps=1e-06, elementwise_affine=True)
      )
    )
  )
  (generator): Linear(in_features=512, out_features=24936, bias=True)
)
[2024-04-30 09:04:18,886 INFO] encoder: 37967872
[2024-04-30 09:04:18,886 INFO] decoder: 31523176
[2024-04-30 09:04:18,886 INFO] * number of parameters: 69491048
[2024-04-30 09:04:18,887 INFO] Trainable parameters = {'torch.float32': 0, 'torch.float16': 69491048, 'torch.uint8': 0, 'torch.int8': 0}
[2024-04-30 09:04:18,887 INFO] Non trainable parameters = {'torch.float32': 0, 'torch.float16': 0, 'torch.uint8': 0, 'torch.int8': 0}
[2024-04-30 09:04:18,887 INFO]  * src vocab size = 24936
[2024-04-30 09:04:18,887 INFO]  * tgt vocab size = 24936
[2024-04-30 09:04:19,980 INFO] Starting training on GPU: [0]
[2024-04-30 09:04:19,980 INFO] Start training loop and validate every 10000 steps...
[2024-04-30 09:04:19,980 INFO] Scoring with: ['onmt_tokenize', 'prefix']
[2024-04-30 09:05:55,512 INFO] Step 100/100000; acc: 14.4; ppl: 12973.6; xent: 9.5; lr: 0.00002; sents:  201720; bsz: 6070/7530/252; 50830/63058 tok/s;     96 sec;
[2024-04-30 09:06:39,063 INFO] Step 200/100000; acc: 20.4; ppl: 3522.2; xent: 8.2; lr: 0.00004; sents:  211218; bsz: 5995/7561/264; 110120/138889 tok/s;    139 sec;
[2024-04-30 09:07:22,767 INFO] Step 300/100000; acc: 23.3; ppl: 1069.7; xent: 7.0; lr: 0.00006; sents:  200048; bsz: 5987/7467/250; 109589/136692 tok/s;    183 sec;
[2024-04-30 09:08:07,178 INFO] Step 400/100000; acc: 27.2; ppl: 519.4; xent: 6.3; lr: 0.00008; sents:  200686; bsz: 6027/7502/251; 108576/135136 tok/s;    227 sec;
[2024-04-30 09:08:51,285 INFO] Step 500/100000; acc: 29.1; ppl: 358.1; xent: 5.9; lr: 0.00010; sents:  193882; bsz: 6056/7513/242; 109840/136274 tok/s;    271 sec;
[2024-04-30 09:09:35,113 INFO] Step 600/100000; acc: 30.5; ppl: 291.0; xent: 5.7; lr: 0.00011; sents:  193073; bsz: 6007/7539/241; 109652/137617 tok/s;    315 sec;
[2024-04-30 09:10:18,903 INFO] Step 700/100000; acc: 32.0; ppl: 235.9; xent: 5.5; lr: 0.00013; sents:  190356; bsz: 6039/7464/238; 110336/136362 tok/s;    359 sec;
[2024-04-30 09:11:02,720 INFO] Step 800/100000; acc: 34.7; ppl: 184.0; xent: 5.2; lr: 0.00015; sents:  210868; bsz: 6054/7556/264; 110537/137953 tok/s;    403 sec;
[2024-04-30 09:11:47,032 INFO] Step 900/100000; acc: 38.6; ppl: 134.4; xent: 4.9; lr: 0.00017; sents:  198438; bsz: 5989/7526/248; 108116/135878 tok/s;    447 sec;
[2024-04-30 09:12:31,841 INFO] Step 1000/100000; acc: 45.3; ppl:  86.4; xent: 4.5; lr: 0.00019; sents:  200431; bsz: 6094/7534/251; 108807/134519 tok/s;    492 sec;
[2024-04-30 09:13:15,915 INFO] Step 1100/100000; acc: 52.4; ppl:  55.2; xent: 4.0; lr: 0.00021; sents:  217880; bsz: 6158/7564/272; 111771/137299 tok/s;    536 sec;
[2024-04-30 09:13:59,615 INFO] Step 1200/100000; acc: 56.1; ppl:  42.5; xent: 3.7; lr: 0.00023; sents:  186685; bsz: 5866/7445/233; 107392/136292 tok/s;    580 sec;
[2024-04-30 09:14:43,501 INFO] Step 1300/100000; acc: 60.1; ppl:  32.6; xent: 3.5; lr: 0.00025; sents:  197183; bsz: 6036/7547/246; 110025/137580 tok/s;    624 sec;
[2024-04-30 09:15:27,812 INFO] Step 1400/100000; acc: 63.4; ppl:  26.2; xent: 3.3; lr: 0.00027; sents:  208573; bsz: 6166/7561/261; 111317/136515 tok/s;    668 sec;
[2024-04-30 09:16:12,423 INFO] Step 1500/100000; acc: 65.5; ppl:  22.9; xent: 3.1; lr: 0.00029; sents:  195141; bsz: 6047/7501/244; 108440/134521 tok/s;    712 sec;
[2024-04-30 09:16:56,158 INFO] Step 1600/100000; acc: 67.3; ppl:  20.2; xent: 3.0; lr: 0.00030; sents:  201155; bsz: 5999/7544/251; 109743/138000 tok/s;    756 sec;
[2024-04-30 09:17:40,010 INFO] Step 1700/100000; acc: 68.7; ppl:  18.4; xent: 2.9; lr: 0.00032; sents:  196365; bsz: 5953/7421/245; 108604/135377 tok/s;    800 sec;
[2024-04-30 09:18:23,971 INFO] Step 1800/100000; acc: 69.7; ppl:  17.1; xent: 2.8; lr: 0.00034; sents:  195200; bsz: 6032/7548/244; 109770/137352 tok/s;    844 sec;
[2024-04-30 09:19:07,897 INFO] Step 1900/100000; acc: 70.9; ppl:  15.9; xent: 2.8; lr: 0.00036; sents:  198988; bsz: 6072/7494/249; 110578/136488 tok/s;    888 sec;
[2024-04-30 09:19:52,589 INFO] Step 2000/100000; acc: 71.8; ppl:  15.0; xent: 2.7; lr: 0.00038; sents:  196605; bsz: 6023/7509/246; 107818/134414 tok/s;    933 sec;
[2024-04-30 09:20:36,770 INFO] Step 2100/100000; acc: 72.7; ppl:  14.1; xent: 2.6; lr: 0.00040; sents:  208786; bsz: 6010/7492/261; 108829/135660 tok/s;    977 sec;
[2024-04-30 09:21:20,752 INFO] Step 2200/100000; acc: 73.1; ppl:  13.7; xent: 2.6; lr: 0.00042; sents:  189905; bsz: 6033/7559/237; 109744/137484 tok/s;   1021 sec;
[2024-04-30 09:22:04,693 INFO] Step 2300/100000; acc: 73.7; ppl:  13.2; xent: 2.6; lr: 0.00044; sents:  204467; bsz: 6033/7497/256; 109833/136493 tok/s;   1065 sec;
[2024-04-30 09:22:48,548 INFO] Step 2400/100000; acc: 74.7; ppl:  12.5; xent: 2.5; lr: 0.00046; sents:  203899; bsz: 6060/7568/255; 110547/138053 tok/s;   1109 sec;
[2024-04-30 09:23:32,985 INFO] Step 2500/100000; acc: 74.9; ppl:  12.2; xent: 2.5; lr: 0.00048; sents:  198046; bsz: 6005/7474/248; 108103/134565 tok/s;   1153 sec;
[2024-04-30 09:24:17,229 INFO] Step 2600/100000; acc: 75.5; ppl:  11.8; xent: 2.5; lr: 0.00049; sents:  206377; bsz: 6034/7526/258; 109094/136078 tok/s;   1197 sec;
[2024-04-30 09:25:00,778 INFO] Step 2700/100000; acc: 75.7; ppl:  11.6; xent: 2.5; lr: 0.00051; sents:  201792; bsz: 5930/7474/252; 108945/137308 tok/s;   1241 sec;
[2024-04-30 09:25:44,517 INFO] Step 2800/100000; acc: 76.1; ppl:  11.3; xent: 2.4; lr: 0.00053; sents:  196737; bsz: 5976/7478/246; 109304/136777 tok/s;   1285 sec;
[2024-04-30 09:26:28,259 INFO] Step 2900/100000; acc: 76.5; ppl:  11.1; xent: 2.4; lr: 0.00055; sents:  205984; bsz: 6054/7513/257; 110713/137397 tok/s;   1328 sec;
[2024-04-30 09:27:12,374 INFO] Step 3000/100000; acc: 76.6; ppl:  10.9; xent: 2.4; lr: 0.00057; sents:  192816; bsz: 6041/7482/241; 109555/135676 tok/s;   1372 sec;
[2024-04-30 09:27:56,998 INFO] Step 3100/100000; acc: 77.4; ppl:  10.5; xent: 2.4; lr: 0.00059; sents:  203300; bsz: 6054/7495/254; 108533/134364 tok/s;   1417 sec;
[2024-04-30 09:28:40,734 INFO] Step 3200/100000; acc: 77.6; ppl:  10.4; xent: 2.3; lr: 0.00061; sents:  192996; bsz: 6041/7523/241; 110499/137607 tok/s;   1461 sec;
[2024-04-30 09:29:24,316 INFO] Step 3300/100000; acc: 77.6; ppl:  10.3; xent: 2.3; lr: 0.00063; sents:  194055; bsz: 5963/7486/243; 109452/137416 tok/s;   1504 sec;
[2024-04-30 09:30:07,948 INFO] Step 3400/100000; acc: 77.9; ppl:  10.2; xent: 2.3; lr: 0.00065; sents:  202468; bsz: 5956/7499/253; 109204/137495 tok/s;   1548 sec;
[2024-04-30 09:30:51,929 INFO] Step 3500/100000; acc: 78.0; ppl:  10.1; xent: 2.3; lr: 0.00067; sents:  201227; bsz: 6097/7473/252; 110897/135937 tok/s;   1592 sec;
[2024-04-30 09:31:36,138 INFO] Step 3600/100000; acc: 78.2; ppl:  10.0; xent: 2.3; lr: 0.00068; sents:  207933; bsz: 6020/7542/260; 108934/136475 tok/s;   1636 sec;
[2024-04-30 09:32:19,894 INFO] Step 3700/100000; acc: 78.4; ppl:   9.9; xent: 2.3; lr: 0.00070; sents:  194469; bsz: 6027/7513/243; 110184/137369 tok/s;   1680 sec;
[2024-04-30 09:33:03,583 INFO] Step 3800/100000; acc: 78.2; ppl:  10.0; xent: 2.3; lr: 0.00072; sents:  202157; bsz: 5969/7471/253; 109302/136799 tok/s;   1724 sec;
[2024-04-30 09:33:47,371 INFO] Step 3900/100000; acc: 78.8; ppl:   9.7; xent: 2.3; lr: 0.00074; sents:  207481; bsz: 6073/7560/259; 110949/138127 tok/s;   1767 sec;
[2024-04-30 09:34:31,312 INFO] Step 4000/100000; acc: 78.8; ppl:   9.6; xent: 2.3; lr: 0.00076; sents:  190034; bsz: 6063/7495/238; 110390/136447 tok/s;   1811 sec;
[2024-04-30 09:35:15,534 INFO] Step 4100/100000; acc: 79.0; ppl:   9.5; xent: 2.3; lr: 0.00078; sents:  204370; bsz: 6035/7513/255; 109175/135917 tok/s;   1856 sec;
[2024-04-30 09:35:59,198 INFO] Step 4200/100000; acc: 79.0; ppl:   9.5; xent: 2.3; lr: 0.00080; sents:  207459; bsz: 5992/7521/259; 109792/137808 tok/s;   1899 sec;
[2024-04-30 09:36:43,036 INFO] Step 4300/100000; acc: 79.3; ppl:   9.4; xent: 2.2; lr: 0.00082; sents:  195268; bsz: 6163/7537/244; 112472/137547 tok/s;   1943 sec;
[2024-04-30 09:37:26,657 INFO] Step 4400/100000; acc: 79.0; ppl:   9.5; xent: 2.3; lr: 0.00084; sents:  199151; bsz: 5984/7504/249; 109749/137630 tok/s;   1987 sec;
[2024-04-30 09:38:10,335 INFO] Step 4500/100000; acc: 79.3; ppl:   9.3; xent: 2.2; lr: 0.00086; sents:  200332; bsz: 5969/7525/250; 109335/137822 tok/s;   2030 sec;
[2024-04-30 09:38:54,641 INFO] Step 4600/100000; acc: 79.4; ppl:   9.3; xent: 2.2; lr: 0.00088; sents:  214139; bsz: 5933/7453/268; 107126/134566 tok/s;   2075 sec;
[2024-04-30 09:39:39,102 INFO] Step 4700/100000; acc: 79.6; ppl:   9.2; xent: 2.2; lr: 0.00089; sents:  198081; bsz: 6090/7522/248; 109582/135354 tok/s;   2119 sec;
[2024-04-30 09:40:23,001 INFO] Step 4800/100000; acc: 79.7; ppl:   9.1; xent: 2.2; lr: 0.00091; sents:  206503; bsz: 6073/7567/258; 110681/137893 tok/s;   2163 sec;
[2024-04-30 09:41:06,761 INFO] Step 4900/100000; acc: 79.5; ppl:   9.2; xent: 2.2; lr: 0.00093; sents:  194923; bsz: 5990/7534/244; 109505/137729 tok/s;   2207 sec;
[2024-04-30 09:41:50,359 INFO] Step 5000/100000; acc: 79.7; ppl:   9.1; xent: 2.2; lr: 0.00095; sents:  191346; bsz: 5993/7461/239; 109967/136904 tok/s;   2250 sec;
[2024-04-30 09:42:34,563 INFO] Step 5100/100000; acc: 79.9; ppl:   9.0; xent: 2.2; lr: 0.00097; sents:  200868; bsz: 6048/7541/251; 109453/136480 tok/s;   2295 sec;
[2024-04-30 09:43:18,667 INFO] Step 5200/100000; acc: 80.0; ppl:   9.0; xent: 2.2; lr: 0.00099; sents:  204524; bsz: 6067/7481/256; 110045/135707 tok/s;   2339 sec;
[2024-04-30 09:44:02,334 INFO] Step 5300/100000; acc: 80.2; ppl:   8.9; xent: 2.2; lr: 0.00101; sents:  213585; bsz: 5999/7451/267; 109910/136501 tok/s;   2382 sec;
[2024-04-30 09:44:46,013 INFO] Step 5400/100000; acc: 80.1; ppl:   8.9; xent: 2.2; lr: 0.00103; sents:  201466; bsz: 6001/7536/252; 109902/138016 tok/s;   2426 sec;
[2024-04-30 09:45:29,602 INFO] Step 5500/100000; acc: 80.3; ppl:   8.8; xent: 2.2; lr: 0.00105; sents:  197495; bsz: 6089/7493/247; 111749/137522 tok/s;   2470 sec;
[2024-04-30 09:46:13,587 INFO] Step 5600/100000; acc: 80.2; ppl:   8.8; xent: 2.2; lr: 0.00107; sents:  207307; bsz: 6049/7535/259; 110023/137050 tok/s;   2514 sec;
[2024-04-30 09:46:58,030 INFO] Step 5700/100000; acc: 80.1; ppl:   8.9; xent: 2.2; lr: 0.00108; sents:  198250; bsz: 5997/7481/248; 107945/134667 tok/s;   2558 sec;
[2024-04-30 09:47:41,678 INFO] Step 5800/100000; acc: 80.3; ppl:   8.8; xent: 2.2; lr: 0.00110; sents:  192640; bsz: 6045/7551/241; 110790/138390 tok/s;   2602 sec;
[2024-04-30 09:48:25,176 INFO] Step 5900/100000; acc: 80.5; ppl:   8.7; xent: 2.2; lr: 0.00112; sents:  202917; bsz: 5964/7489/254; 109692/137741 tok/s;   2645 sec;
[2024-04-30 09:49:08,740 INFO] Step 6000/100000; acc: 80.4; ppl:   8.7; xent: 2.2; lr: 0.00114; sents:  211222; bsz: 6060/7575/264; 111285/139118 tok/s;   2689 sec;
[2024-04-30 09:49:52,542 INFO] Step 6100/100000; acc: 80.5; ppl:   8.7; xent: 2.2; lr: 0.00113; sents:  204155; bsz: 5972/7402/255; 109075/135188 tok/s;   2733 sec;
[2024-04-30 09:50:36,849 INFO] Step 6200/100000; acc: 80.6; ppl:   8.6; xent: 2.2; lr: 0.00112; sents:  197101; bsz: 6053/7556/246; 109295/136428 tok/s;   2777 sec;
[2024-04-30 09:51:20,592 INFO] Step 6300/100000; acc: 80.7; ppl:   8.6; xent: 2.2; lr: 0.00111; sents:  197635; bsz: 6018/7504/247; 110062/137240 tok/s;   2821 sec;
[2024-04-30 09:52:04,563 INFO] Step 6400/100000; acc: 80.9; ppl:   8.5; xent: 2.1; lr: 0.00110; sents:  201422; bsz: 6136/7556/252; 111641/137473 tok/s;   2865 sec;
[2024-04-30 09:52:48,412 INFO] Step 6500/100000; acc: 80.8; ppl:   8.6; xent: 2.1; lr: 0.00110; sents:  197942; bsz: 5961/7485/247; 108750/136556 tok/s;   2908 sec;
[2024-04-30 09:53:32,444 INFO] Step 6600/100000; acc: 81.0; ppl:   8.5; xent: 2.1; lr: 0.00109; sents:  200575; bsz: 6047/7492/251; 109875/136127 tok/s;   2952 sec;
[2024-04-30 09:54:16,743 INFO] Step 6700/100000; acc: 81.0; ppl:   8.5; xent: 2.1; lr: 0.00108; sents:  201314; bsz: 6002/7489/252; 108395/135239 tok/s;   2997 sec;
[2024-04-30 09:55:00,801 INFO] Step 6800/100000; acc: 81.1; ppl:   8.4; xent: 2.1; lr: 0.00107; sents:  204925; bsz: 6026/7552/256; 109418/137127 tok/s;   3041 sec;
[2024-04-30 09:55:44,533 INFO] Step 6900/100000; acc: 81.2; ppl:   8.4; xent: 2.1; lr: 0.00106; sents:  204869; bsz: 6039/7520/256; 110481/137573 tok/s;   3085 sec;
[2024-04-30 09:56:28,194 INFO] Step 7000/100000; acc: 81.1; ppl:   8.4; xent: 2.1; lr: 0.00106; sents:  191226; bsz: 5913/7411/239; 108347/135789 tok/s;   3128 sec;
[2024-04-30 09:57:11,933 INFO] Step 7100/100000; acc: 81.3; ppl:   8.3; xent: 2.1; lr: 0.00105; sents:  206385; bsz: 5986/7425/258; 109497/135803 tok/s;   3172 sec;
[2024-04-30 09:57:56,532 INFO] Step 7200/100000; acc: 81.5; ppl:   8.2; xent: 2.1; lr: 0.00104; sents:  204492; bsz: 6101/7596/256; 109445/136259 tok/s;   3217 sec;
[2024-04-30 09:58:40,715 INFO] Step 7300/100000; acc: 81.4; ppl:   8.2; xent: 2.1; lr: 0.00103; sents:  201533; bsz: 6044/7545/252; 109444/136607 tok/s;   3261 sec;
[2024-04-30 09:59:24,497 INFO] Step 7400/100000; acc: 81.6; ppl:   8.2; xent: 2.1; lr: 0.00103; sents:  197938; bsz: 6042/7527/247; 110400/137535 tok/s;   3305 sec;
[2024-04-30 10:00:08,455 INFO] Step 7500/100000; acc: 81.2; ppl:   8.3; xent: 2.1; lr: 0.00102; sents:  202339; bsz: 6017/7513/253; 109503/136722 tok/s;   3348 sec;
[2024-04-30 10:00:52,242 INFO] Step 7600/100000; acc: 81.5; ppl:   8.2; xent: 2.1; lr: 0.00101; sents:  191350; bsz: 5974/7519/239; 109152/137384 tok/s;   3392 sec;
[2024-04-30 10:01:36,396 INFO] Step 7700/100000; acc: 81.7; ppl:   8.1; xent: 2.1; lr: 0.00101; sents:  210811; bsz: 6088/7542/264; 110297/136657 tok/s;   3436 sec;
[2024-04-30 10:02:20,581 INFO] Step 7800/100000; acc: 81.8; ppl:   8.1; xent: 2.1; lr: 0.00100; sents:  205768; bsz: 6062/7488/257; 109757/135578 tok/s;   3481 sec;
[2024-04-30 10:03:04,009 INFO] Step 7900/100000; acc: 81.6; ppl:   8.2; xent: 2.1; lr: 0.00099; sents:  210300; bsz: 5997/7505/263; 110471/138246 tok/s;   3524 sec;
[2024-04-30 10:03:47,598 INFO] Step 8000/100000; acc: 81.7; ppl:   8.1; xent: 2.1; lr: 0.00099; sents:  208598; bsz: 6051/7530/261; 111053/138193 tok/s;   3568 sec;
[2024-04-30 10:04:31,349 INFO] Step 8100/100000; acc: 81.9; ppl:   8.0; xent: 2.1; lr: 0.00098; sents:  192413; bsz: 6077/7517/241; 111124/137442 tok/s;   3611 sec;
[2024-04-30 10:05:15,413 INFO] Step 8200/100000; acc: 81.9; ppl:   8.0; xent: 2.1; lr: 0.00098; sents:  194882; bsz: 6003/7447/244; 108983/135202 tok/s;   3655 sec;
[2024-04-30 10:05:59,927 INFO] Step 8300/100000; acc: 81.8; ppl:   8.1; xent: 2.1; lr: 0.00097; sents:  204864; bsz: 5987/7518/256; 107603/135119 tok/s;   3700 sec;
[2024-04-30 10:06:43,426 INFO] Step 8400/100000; acc: 81.8; ppl:   8.0; xent: 2.1; lr: 0.00096; sents:  193079; bsz: 5916/7466/241; 108800/137313 tok/s;   3743 sec;
[2024-04-30 10:07:27,107 INFO] Step 8500/100000; acc: 82.0; ppl:   8.0; xent: 2.1; lr: 0.00096; sents:  209459; bsz: 6036/7490/262; 110542/137177 tok/s;   3787 sec;
[2024-04-30 10:08:10,718 INFO] Step 8600/100000; acc: 82.0; ppl:   8.0; xent: 2.1; lr: 0.00095; sents:  204029; bsz: 6020/7510/255; 110430/137773 tok/s;   3831 sec;
[2024-04-30 10:08:54,758 INFO] Step 8700/100000; acc: 82.1; ppl:   7.9; xent: 2.1; lr: 0.00095; sents:  211731; bsz: 6073/7561/265; 110320/137346 tok/s;   3875 sec;
[2024-04-30 10:09:39,376 INFO] Step 8800/100000; acc: 82.0; ppl:   8.0; xent: 2.1; lr: 0.00094; sents:  200945; bsz: 6007/7506/251; 107711/134578 tok/s;   3919 sec;
[2024-04-30 10:10:23,035 INFO] Step 8900/100000; acc: 82.0; ppl:   8.0; xent: 2.1; lr: 0.00094; sents:  182456; bsz: 6029/7477/228; 110480/137010 tok/s;   3963 sec;
[2024-04-30 10:11:06,704 INFO] Step 9000/100000; acc: 82.2; ppl:   7.9; xent: 2.1; lr: 0.00093; sents:  203504; bsz: 5999/7442/254; 109907/136333 tok/s;   4007 sec;
[2024-04-30 10:11:50,355 INFO] Step 9100/100000; acc: 82.2; ppl:   7.9; xent: 2.1; lr: 0.00093; sents:  199770; bsz: 6029/7522/250; 110490/137863 tok/s;   4050 sec;
[2024-04-30 10:12:34,284 INFO] Step 9200/100000; acc: 82.1; ppl:   7.9; xent: 2.1; lr: 0.00092; sents:  197339; bsz: 6054/7555/247; 110253/137596 tok/s;   4094 sec;
[2024-04-30 10:13:18,494 INFO] Step 9300/100000; acc: 82.1; ppl:   7.9; xent: 2.1; lr: 0.00092; sents:  206232; bsz: 6000/7512/258; 108564/135932 tok/s;   4139 sec;
[2024-04-30 10:14:02,573 INFO] Step 9400/100000; acc: 82.3; ppl:   7.9; xent: 2.1; lr: 0.00091; sents:  206444; bsz: 6054/7507/258; 109878/136249 tok/s;   4183 sec;
[2024-04-30 10:14:46,470 INFO] Step 9500/100000; acc: 82.3; ppl:   7.9; xent: 2.1; lr: 0.00091; sents:  194757; bsz: 6074/7464/243; 110703/136031 tok/s;   4226 sec;
[2024-04-30 10:15:30,139 INFO] Step 9600/100000; acc: 82.2; ppl:   7.9; xent: 2.1; lr: 0.00090; sents:  188490; bsz: 5990/7550/236; 109743/138316 tok/s;   4270 sec;
[2024-04-30 10:16:13,929 INFO] Step 9700/100000; acc: 82.4; ppl:   7.8; xent: 2.1; lr: 0.00090; sents:  210488; bsz: 6027/7477/263; 110103/136597 tok/s;   4314 sec;
[2024-04-30 10:16:58,422 INFO] Step 9800/100000; acc: 82.2; ppl:   7.9; xent: 2.1; lr: 0.00089; sents:  210374; bsz: 6029/7548/263; 108405/135715 tok/s;   4358 sec;
[2024-04-30 10:17:42,753 INFO] Step 9900/100000; acc: 82.4; ppl:   7.8; xent: 2.1; lr: 0.00089; sents:  200842; bsz: 5976/7504/251; 107850/135410 tok/s;   4403 sec;
[2024-04-30 10:18:26,627 INFO] Step 10000/100000; acc: 82.5; ppl:   7.7; xent: 2.0; lr: 0.00088; sents:  206167; bsz: 6074/7558/258; 110746/137817 tok/s;   4447 sec;
[2024-04-30 10:18:38,336 INFO] valid stats calculation
                           took: 11.707928657531738 s.
[2024-04-30 10:18:42,194 INFO] The translation of the valid dataset for dynamic scoring
                               took : 3.858090877532959 s.
[2024-04-30 10:18:42,194 INFO] UPDATING VALIDATION BLEU
[2024-04-30 10:18:42,361 INFO] validation BLEU: 26.232149964931843
[2024-04-30 10:18:42,362 INFO] Train perplexity: 14.9452
[2024-04-30 10:18:42,362 INFO] Train accuracy: 73.329
[2024-04-30 10:18:42,362 INFO] Sentences processed: 2.01002e+07
[2024-04-30 10:18:42,362 INFO] Average bsz: 6026/7510/251
[2024-04-30 10:18:42,362 INFO] Validation perplexity: 9.93693
[2024-04-30 10:18:42,362 INFO] Validation accuracy: 76.7652
[2024-04-30 10:18:42,365 INFO] Saving checkpoint /media/vincent/Crucial X6/NMT_work/en-de/runs/6-6-8-512-2048/6-6-8-512-2048-gelu_step_10000.pt
[2024-04-30 10:19:26,833 INFO] Step 10100/100000; acc: 82.5; ppl:   7.8; xent: 2.0; lr: 0.00088; sents:  204932; bsz: 6039/7497/256; 80241/99622 tok/s;   4507 sec;
[2024-04-30 10:20:10,011 INFO] Step 10200/100000; acc: 82.5; ppl:   7.7; xent: 2.0; lr: 0.00088; sents:  197642; bsz: 5992/7521/247; 111022/139355 tok/s;   4550 sec;
[2024-04-30 10:20:53,811 INFO] Step 10300/100000; acc: 82.6; ppl:   7.7; xent: 2.0; lr: 0.00087; sents:  199151; bsz: 6054/7474/249; 110574/136505 tok/s;   4594 sec;
[2024-04-30 10:21:37,359 INFO] Step 10400/100000; acc: 82.5; ppl:   7.8; xent: 2.0; lr: 0.00087; sents:  200405; bsz: 5945/7488/251; 109210/137553 tok/s;   4637 sec;
[2024-04-30 10:22:20,147 INFO] Step 10500/100000; acc: 82.7; ppl:   7.7; xent: 2.0; lr: 0.00086; sents:  190713; bsz: 6122/7568/238; 114498/141552 tok/s;   4680 sec;
[2024-04-30 10:23:02,767 INFO] Step 10600/100000; acc: 82.7; ppl:   7.7; xent: 2.0; lr: 0.00086; sents:  200841; bsz: 6017/7572/251; 112943/142136 tok/s;   4723 sec;
[2024-04-30 10:23:45,789 INFO] Step 10700/100000; acc: 82.7; ppl:   7.7; xent: 2.0; lr: 0.00085; sents:  193629; bsz: 6076/7531/242; 112991/140034 tok/s;   4766 sec;
[2024-04-30 10:24:28,632 INFO] Step 10800/100000; acc: 82.6; ppl:   7.7; xent: 2.0; lr: 0.00085; sents:  196285; bsz: 6014/7481/245; 112297/139699 tok/s;   4809 sec;
[2024-04-30 10:25:11,536 INFO] Step 10900/100000; acc: 82.7; ppl:   7.7; xent: 2.0; lr: 0.00085; sents:  208474; bsz: 5998/7496/261; 111838/139776 tok/s;   4852 sec;
[2024-04-30 10:25:54,661 INFO] Step 11000/100000; acc: 82.7; ppl:   7.7; xent: 2.0; lr: 0.00084; sents:  202466; bsz: 5958/7451/253; 110526/138224 tok/s;   4895 sec;
[2024-04-30 10:26:37,721 INFO] Step 11100/100000; acc: 82.9; ppl:   7.6; xent: 2.0; lr: 0.00084; sents:  200782; bsz: 6047/7549/251; 112353/140259 tok/s;   4938 sec;
[2024-04-30 10:27:20,309 INFO] Step 11200/100000; acc: 82.9; ppl:   7.6; xent: 2.0; lr: 0.00084; sents:  195963; bsz: 6089/7555/245; 114378/141928 tok/s;   4980 sec;
[2024-04-30 10:28:02,700 INFO] Step 11300/100000; acc: 82.8; ppl:   7.6; xent: 2.0; lr: 0.00083; sents:  206312; bsz: 6072/7539/258; 114586/142269 tok/s;   5023 sec;
[2024-04-30 10:28:45,420 INFO] Step 11400/100000; acc: 82.9; ppl:   7.6; xent: 2.0; lr: 0.00083; sents:  203784; bsz: 5947/7405/255; 111371/138679 tok/s;   5065 sec;
[2024-04-30 10:30:02,740 INFO] Step 11500/100000; acc: 82.8; ppl:   7.6; xent: 2.0; lr: 0.00082; sents:  193775; bsz: 6016/7502/242; 62246/77625 tok/s;   5143 sec;
[2024-04-30 10:30:45,323 INFO] Step 11600/100000; acc: 83.0; ppl:   7.5; xent: 2.0; lr: 0.00082; sents:  205746; bsz: 6095/7520/257; 114503/141285 tok/s;   5185 sec;
[2024-04-30 10:31:28,528 INFO] Step 11700/100000; acc: 82.8; ppl:   7.6; xent: 2.0; lr: 0.00082; sents:  212557; bsz: 5975/7498/266; 110639/138833 tok/s;   5229 sec;
[2024-04-30 10:32:11,221 INFO] Step 11800/100000; acc: 82.9; ppl:   7.6; xent: 2.0; lr: 0.00081; sents:  193935; bsz: 6052/7558/242; 113397/141629 tok/s;   5271 sec;
[2024-04-30 10:33:03,661 INFO] Step 11900/100000; acc: 82.9; ppl:   7.6; xent: 2.0; lr: 0.00081; sents:  197399; bsz: 5989/7470/247; 91359/113961 tok/s;   5324 sec;
[2024-04-30 10:33:47,600 INFO] Step 12000/100000; acc: 83.2; ppl:   7.5; xent: 2.0; lr: 0.00081; sents:  196812; bsz: 6060/7548/246; 110343/137423 tok/s;   5368 sec;
[2024-04-30 10:34:30,150 INFO] Step 12100/100000; acc: 83.1; ppl:   7.5; xent: 2.0; lr: 0.00080; sents:  209173; bsz: 6133/7572/261; 115316/142369 tok/s;   5410 sec;
[2024-04-30 10:35:13,342 INFO] Step 12200/100000; acc: 83.0; ppl:   7.5; xent: 2.0; lr: 0.00080; sents:  201492; bsz: 5961/7481/252; 110411/138561 tok/s;   5453 sec;
[2024-04-30 10:35:57,385 INFO] Step 12300/100000; acc: 82.9; ppl:   7.6; xent: 2.0; lr: 0.00080; sents:  195067; bsz: 5973/7527/244; 108502/136716 tok/s;   5497 sec;
[2024-04-30 10:36:41,125 INFO] Step 12400/100000; acc: 83.1; ppl:   7.5; xent: 2.0; lr: 0.00079; sents:  210660; bsz: 6045/7523/263; 110555/137603 tok/s;   5541 sec;
[2024-04-30 10:37:25,887 INFO] Step 12500/100000; acc: 83.2; ppl:   7.5; xent: 2.0; lr: 0.00079; sents:  194107; bsz: 6023/7461/243; 107640/133350 tok/s;   5586 sec;
[2024-04-30 10:38:09,319 INFO] Step 12600/100000; acc: 83.1; ppl:   7.5; xent: 2.0; lr: 0.00079; sents:  192615; bsz: 5964/7474/241; 109852/137667 tok/s;   5629 sec;
[2024-04-30 10:38:51,963 INFO] Step 12700/100000; acc: 83.2; ppl:   7.5; xent: 2.0; lr: 0.00078; sents:  199159; bsz: 6093/7479/249; 114304/140316 tok/s;   5672 sec;
[2024-04-30 10:39:34,593 INFO] Step 12800/100000; acc: 83.2; ppl:   7.5; xent: 2.0; lr: 0.00078; sents:  206690; bsz: 6068/7539/258; 113873/141489 tok/s;   5715 sec;
[2024-04-30 10:40:17,689 INFO] Step 12900/100000; acc: 83.1; ppl:   7.5; xent: 2.0; lr: 0.00078; sents:  190689; bsz: 6032/7566/238; 111970/140448 tok/s;   5758 sec;
[2024-04-30 10:41:00,486 INFO] Step 13000/100000; acc: 83.1; ppl:   7.5; xent: 2.0; lr: 0.00078; sents:  201053; bsz: 5918/7459/251; 110621/139437 tok/s;   5801 sec;
[2024-04-30 10:42:12,972 INFO] Step 13100/100000; acc: 83.2; ppl:   7.4; xent: 2.0; lr: 0.00077; sents:  206031; bsz: 6106/7553/258; 67395/83357 tok/s;   5873 sec;
[2024-04-30 10:42:58,487 INFO] Step 13200/100000; acc: 83.3; ppl:   7.4; xent: 2.0; lr: 0.00077; sents:  218555; bsz: 6019/7465/273; 105789/131204 tok/s;   5919 sec;
[2024-04-30 10:43:41,403 INFO] Step 13300/100000; acc: 83.1; ppl:   7.5; xent: 2.0; lr: 0.00077; sents:  189343; bsz: 6050/7594/237; 112783/141554 tok/s;   5961 sec;
[2024-04-30 10:44:31,293 INFO] Step 13400/100000; acc: 83.2; ppl:   7.5; xent: 2.0; lr: 0.00076; sents:  193520; bsz: 5966/7453/242; 112153/140099 tok/s;   6011 sec;
[2024-04-30 10:45:16,334 INFO] Step 13500/100000; acc: 83.3; ppl:   7.4; xent: 2.0; lr: 0.00076; sents:  198386; bsz: 6012/7488/248; 106788/133001 tok/s;   6056 sec;
[2024-04-30 10:45:59,806 INFO] Step 13600/100000; acc: 83.1; ppl:   7.5; xent: 2.0; lr: 0.00076; sents:  199194; bsz: 6049/7593/249; 111311/139725 tok/s;   6100 sec;
[2024-04-30 10:46:44,603 INFO] Step 13700/100000; acc: 83.2; ppl:   7.4; xent: 2.0; lr: 0.00076; sents:  201305; bsz: 6023/7459/252; 107566/133212 tok/s;   6145 sec;
[2024-04-30 10:47:28,010 INFO] Step 13800/100000; acc: 83.3; ppl:   7.4; xent: 2.0; lr: 0.00075; sents:  191024; bsz: 6014/7514/239; 110841/138484 tok/s;   6188 sec;
[2024-04-30 10:48:11,540 INFO] Step 13900/100000; acc: 83.3; ppl:   7.4; xent: 2.0; lr: 0.00075; sents:  195971; bsz: 5977/7502/245; 109841/137866 tok/s;   6232 sec;
[2024-04-30 10:48:55,525 INFO] Step 14000/100000; acc: 83.5; ppl:   7.4; xent: 2.0; lr: 0.00075; sents:  208361; bsz: 6106/7521/260; 111063/136788 tok/s;   6276 sec;
[2024-04-30 10:49:39,032 INFO] Step 14100/100000; acc: 83.4; ppl:   7.4; xent: 2.0; lr: 0.00074; sents:  208694; bsz: 6012/7529/261; 110551/138438 tok/s;   6319 sec;
[2024-04-30 10:50:22,407 INFO] Step 14200/100000; acc: 83.4; ppl:   7.4; xent: 2.0; lr: 0.00074; sents:  196704; bsz: 5943/7500/246; 109619/138331 tok/s;   6362 sec;
[2024-04-30 10:51:06,021 INFO] Step 14300/100000; acc: 83.4; ppl:   7.4; xent: 2.0; lr: 0.00074; sents:  198244; bsz: 6135/7504/248; 112531/137637 tok/s;   6406 sec;
[2024-04-30 10:51:49,559 INFO] Step 14400/100000; acc: 83.2; ppl:   7.5; xent: 2.0; lr: 0.00074; sents:  195187; bsz: 5994/7526/244; 110144/138296 tok/s;   6450 sec;
[2024-04-30 10:52:33,449 INFO] Step 14500/100000; acc: 83.5; ppl:   7.4; xent: 2.0; lr: 0.00073; sents:  214335; bsz: 5983/7487/268; 109050/136461 tok/s;   6493 sec;
[2024-04-30 10:53:17,493 INFO] Step 14600/100000; acc: 83.4; ppl:   7.4; xent: 2.0; lr: 0.00073; sents:  201405; bsz: 6082/7521/252; 110478/136615 tok/s;   6538 sec;
[2024-04-30 10:54:00,830 INFO] Step 14700/100000; acc: 83.5; ppl:   7.4; xent: 2.0; lr: 0.00073; sents:  202122; bsz: 6049/7512/253; 111670/138672 tok/s;   6581 sec;
[2024-04-30 10:54:44,168 INFO] Step 14800/100000; acc: 83.3; ppl:   7.4; xent: 2.0; lr: 0.00073; sents:  199888; bsz: 5971/7562/250; 110227/139585 tok/s;   6624 sec;
[2024-04-30 10:55:27,690 INFO] Step 14900/100000; acc: 83.5; ppl:   7.3; xent: 2.0; lr: 0.00072; sents:  193799; bsz: 6107/7509/242; 112258/138029 tok/s;   6668 sec;
[2024-04-30 10:56:11,682 INFO] Step 15000/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00072; sents:  196345; bsz: 6083/7462/245; 110626/135692 tok/s;   6712 sec;
[2024-04-30 10:56:11,685 INFO] Updated dropout/attn dropout to 0.100000 0.000000 at step 15001
[2024-04-30 10:56:55,683 INFO] Step 15100/100000; acc: 83.3; ppl:   7.4; xent: 2.0; lr: 0.00072; sents:  200738; bsz: 5938/7505/251; 107970/136447 tok/s;   6756 sec;
[2024-04-30 10:57:39,040 INFO] Step 15200/100000; acc: 83.4; ppl:   7.4; xent: 2.0; lr: 0.00072; sents:  205004; bsz: 5975/7480/256; 110256/138016 tok/s;   6799 sec;
[2024-04-30 10:58:22,026 INFO] Step 15300/100000; acc: 83.4; ppl:   7.4; xent: 2.0; lr: 0.00071; sents:  192753; bsz: 5988/7466/241; 111443/138943 tok/s;   6842 sec;
[2024-04-30 10:59:05,110 INFO] Step 15400/100000; acc: 83.3; ppl:   7.4; xent: 2.0; lr: 0.00071; sents:  206119; bsz: 5955/7496/258; 110569/139189 tok/s;   6885 sec;
[2024-04-30 10:59:48,507 INFO] Step 15500/100000; acc: 83.5; ppl:   7.3; xent: 2.0; lr: 0.00071; sents:  208709; bsz: 6022/7510/261; 111021/138453 tok/s;   6929 sec;
[2024-04-30 11:00:32,786 INFO] Step 15600/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00071; sents:  197870; bsz: 6114/7564/247; 110457/136657 tok/s;   6973 sec;
[2024-04-30 11:01:15,945 INFO] Step 15700/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00071; sents:  203647; bsz: 6035/7548/255; 111866/139910 tok/s;   7016 sec;
[2024-04-30 11:01:59,244 INFO] Step 15800/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00070; sents:  201460; bsz: 6086/7494/252; 112439/138466 tok/s;   7059 sec;
[2024-04-30 11:02:42,358 INFO] Step 15900/100000; acc: 83.5; ppl:   7.3; xent: 2.0; lr: 0.00070; sents:  200151; bsz: 5995/7524/250; 111236/139618 tok/s;   7102 sec;
[2024-04-30 11:03:25,854 INFO] Step 16000/100000; acc: 83.4; ppl:   7.3; xent: 2.0; lr: 0.00070; sents:  202760; bsz: 6032/7480/253; 110938/137569 tok/s;   7146 sec;
[2024-04-30 11:04:09,838 INFO] Step 16100/100000; acc: 83.5; ppl:   7.3; xent: 2.0; lr: 0.00070; sents:  202618; bsz: 6010/7520/253; 109310/136777 tok/s;   7190 sec;
[2024-04-30 11:04:53,124 INFO] Step 16200/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00069; sents:  198067; bsz: 6032/7537/248; 111477/139297 tok/s;   7233 sec;
[2024-04-30 11:05:36,372 INFO] Step 16300/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00069; sents:  197980; bsz: 6047/7516/247; 111856/139028 tok/s;   7276 sec;
[2024-04-30 11:06:19,939 INFO] Step 16400/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00069; sents:  207932; bsz: 6065/7514/260; 111361/137982 tok/s;   7320 sec;
[2024-04-30 11:07:03,628 INFO] Step 16500/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00069; sents:  202317; bsz: 6076/7577/253; 111263/138756 tok/s;   7364 sec;
[2024-04-30 11:07:47,490 INFO] Step 16600/100000; acc: 83.5; ppl:   7.3; xent: 2.0; lr: 0.00069; sents:  196965; bsz: 5958/7458/246; 108666/136026 tok/s;   7408 sec;
[2024-04-30 11:08:31,319 INFO] Step 16700/100000; acc: 83.5; ppl:   7.3; xent: 2.0; lr: 0.00068; sents:  207064; bsz: 5981/7506/259; 109169/137011 tok/s;   7451 sec;
[2024-04-30 11:09:14,818 INFO] Step 16800/100000; acc: 83.7; ppl:   7.3; xent: 2.0; lr: 0.00068; sents:  191348; bsz: 6067/7483/239; 111588/137620 tok/s;   7495 sec;
[2024-04-30 11:09:57,953 INFO] Step 16900/100000; acc: 83.5; ppl:   7.3; xent: 2.0; lr: 0.00068; sents:  212024; bsz: 6018/7531/265; 111611/139668 tok/s;   7538 sec;
[2024-04-30 11:10:41,130 INFO] Step 17000/100000; acc: 83.5; ppl:   7.3; xent: 2.0; lr: 0.00068; sents:  200320; bsz: 6046/7516/250; 112018/139258 tok/s;   7581 sec;
[2024-04-30 11:11:24,538 INFO] Step 17100/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00068; sents:  200510; bsz: 5989/7525/251; 110378/138696 tok/s;   7625 sec;
[2024-04-30 11:12:08,355 INFO] Step 17200/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00067; sents:  199841; bsz: 6032/7485/250; 110138/136657 tok/s;   7668 sec;
[2024-04-30 11:12:51,555 INFO] Step 17300/100000; acc: 83.7; ppl:   7.3; xent: 2.0; lr: 0.00067; sents:  208980; bsz: 6020/7599/261; 111477/140730 tok/s;   7712 sec;
[2024-04-30 11:13:34,804 INFO] Step 17400/100000; acc: 83.7; ppl:   7.3; xent: 2.0; lr: 0.00067; sents:  203835; bsz: 5955/7438/255; 110149/137586 tok/s;   7755 sec;
[2024-04-30 11:14:18,016 INFO] Step 17500/100000; acc: 83.7; ppl:   7.3; xent: 2.0; lr: 0.00067; sents:  198740; bsz: 6014/7481/248; 111348/138506 tok/s;   7798 sec;
[2024-04-30 11:15:01,284 INFO] Step 17600/100000; acc: 83.6; ppl:   7.3; xent: 2.0; lr: 0.00067; sents:  202580; bsz: 5961/7520/253; 110224/139041 tok/s;   7841 sec;
[2024-04-30 11:15:45,738 INFO] Step 17700/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00066; sents:  197938; bsz: 6132/7512/247; 110355/135180 tok/s;   7886 sec;
[2024-04-30 11:16:28,934 INFO] Step 17800/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00066; sents:  198167; bsz: 6059/7541/248; 112205/139655 tok/s;   7929 sec;
[2024-04-30 11:17:11,941 INFO] Step 17900/100000; acc: 83.7; ppl:   7.2; xent: 2.0; lr: 0.00066; sents:  209727; bsz: 6011/7432/262; 111818/138245 tok/s;   7972 sec;
[2024-04-30 11:17:55,006 INFO] Step 18000/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00066; sents:  199678; bsz: 6025/7523/250; 111917/139756 tok/s;   8015 sec;
[2024-04-30 11:18:38,454 INFO] Step 18100/100000; acc: 83.7; ppl:   7.2; xent: 2.0; lr: 0.00066; sents:  205470; bsz: 6044/7512/257; 111287/138318 tok/s;   8058 sec;
[2024-04-30 11:19:22,106 INFO] Step 18200/100000; acc: 83.7; ppl:   7.3; xent: 2.0; lr: 0.00066; sents:  198623; bsz: 6031/7559/248; 110520/138525 tok/s;   8102 sec;
[2024-04-30 11:20:05,242 INFO] Step 18300/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00065; sents:  198913; bsz: 5988/7465/249; 111048/138443 tok/s;   8145 sec;
[2024-04-30 11:20:48,298 INFO] Step 18400/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00065; sents:  200702; bsz: 6066/7464/251; 112700/138686 tok/s;   8188 sec;
[2024-04-30 11:21:31,416 INFO] Step 18500/100000; acc: 83.7; ppl:   7.3; xent: 2.0; lr: 0.00065; sents:  198832; bsz: 6007/7521/249; 111457/139537 tok/s;   8231 sec;
[2024-04-30 11:22:14,761 INFO] Step 18600/100000; acc: 83.7; ppl:   7.3; xent: 2.0; lr: 0.00065; sents:  204525; bsz: 5991/7522/256; 110571/138819 tok/s;   8275 sec;
[2024-04-30 11:22:58,542 INFO] Step 18700/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00065; sents:  196480; bsz: 5998/7494/246; 109604/136942 tok/s;   8319 sec;
[2024-04-30 11:23:42,129 INFO] Step 18800/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00064; sents:  200451; bsz: 6049/7551/251; 111025/138596 tok/s;   8362 sec;
[2024-04-30 11:24:25,516 INFO] Step 18900/100000; acc: 83.7; ppl:   7.2; xent: 2.0; lr: 0.00064; sents:  189173; bsz: 6010/7468/236; 110826/137703 tok/s;   8406 sec;
[2024-04-30 11:25:08,772 INFO] Step 19000/100000; acc: 83.7; ppl:   7.3; xent: 2.0; lr: 0.00064; sents:  206220; bsz: 5944/7404/258; 109940/136942 tok/s;   8449 sec;
[2024-04-30 11:25:52,256 INFO] Step 19100/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00064; sents:  210848; bsz: 6053/7517/264; 111370/138290 tok/s;   8492 sec;
[2024-04-30 11:26:36,329 INFO] Step 19200/100000; acc: 83.7; ppl:   7.2; xent: 2.0; lr: 0.00064; sents:  203448; bsz: 6063/7568/254; 110059/137368 tok/s;   8536 sec;
[2024-04-30 11:27:19,941 INFO] Step 19300/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00064; sents:  201114; bsz: 6039/7542/251; 110777/138357 tok/s;   8580 sec;
[2024-04-30 11:28:03,212 INFO] Step 19400/100000; acc: 83.9; ppl:   7.2; xent: 2.0; lr: 0.00063; sents:  196814; bsz: 6038/7537/246; 111628/139341 tok/s;   8623 sec;
[2024-04-30 11:28:46,471 INFO] Step 19500/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00063; sents:  202763; bsz: 6045/7519/253; 111784/139053 tok/s;   8666 sec;
[2024-04-30 11:29:29,655 INFO] Step 19600/100000; acc: 83.7; ppl:   7.2; xent: 2.0; lr: 0.00063; sents:  201899; bsz: 5994/7463/252; 111045/138256 tok/s;   8710 sec;
[2024-04-30 11:30:13,366 INFO] Step 19700/100000; acc: 83.7; ppl:   7.2; xent: 2.0; lr: 0.00063; sents:  197130; bsz: 5975/7496/246; 109361/137188 tok/s;   8753 sec;
[2024-04-30 11:30:57,312 INFO] Step 19800/100000; acc: 83.9; ppl:   7.2; xent: 2.0; lr: 0.00063; sents:  210978; bsz: 6065/7508/264; 110404/136674 tok/s;   8797 sec;
[2024-04-30 11:31:40,677 INFO] Step 19900/100000; acc: 83.8; ppl:   7.2; xent: 2.0; lr: 0.00063; sents:  200050; bsz: 6048/7556/250; 111573/139399 tok/s;   8841 sec;
[2024-04-30 11:32:24,085 INFO] Step 20000/100000; acc: 83.7; ppl:   7.3; xent: 2.0; lr: 0.00062; sents:  195977; bsz: 6002/7526/245; 110616/138705 tok/s;   8884 sec;
[2024-04-30 11:32:36,898 INFO] valid stats calculation
                           took: 12.812361478805542 s.
[2024-04-30 11:32:40,889 INFO] The translation of the valid dataset for dynamic scoring
                               took : 3.991027355194092 s.
[2024-04-30 11:32:40,890 INFO] UPDATING VALIDATION BLEU
[2024-04-30 11:32:41,061 INFO] validation BLEU: 28.418102236234134
[2024-04-30 11:32:41,062 INFO] Train perplexity: 10.5139
[2024-04-30 11:32:41,062 INFO] Train accuracy: 78.3409
[2024-04-30 11:32:41,062 INFO] Sentences processed: 4.01872e+07
[2024-04-30 11:32:41,062 INFO] Average bsz: 6026/7510/251
[2024-04-30 11:32:41,062 INFO] Validation perplexity: 8.86539
[2024-04-30 11:32:41,062 INFO] Validation accuracy: 78.5833

@LynxPDA
Copy link

LynxPDA commented Apr 30, 2024

Yes, I think you're right! The models on which I saw such a difference differ quite significantly from the standard Transformer base.
Here's a general description:

  1. Effective batch size - 140k (1500 * 2 * 47) (batch size world size multiplier)
  2. Dataset size - 158,794,207 pairs of sentences.
    3.Model size:
  • Encoder layers - 20
  • Decoder layers - 20
  • D_model - 512
  • FF - 6144 (9216) (for gated and non gated, respectively).

For myself, I discovered that deepening the model while simultaneously increasing the FF layer improves the translation (there are many preprints on arxive confirming the guess).

For the Transformer base, the maximum I was able to get was about 28 BLEU for the EN_RU pair at 150,000 steps, but the new deep model with a wide FF layer with gated gelu already has about 31.5 BLEU at 18,000 steps.

A few explanations about the BLEU chart:
Base - 512 (8 heads)
Medium 768 (12 heads)
4k, 9k - FF layer size (effective, reduced to non gated)
20/20, 20/6 - number of encoder and decoder layers
BS - effective batch size

Models Mar-12 and Apr-17 - 330M parameters each, the rest - 457M parameters.
The Apr-17 model was first trained on BS70k (up to 13,000 steps, then I increased the BS to the standard 140k) - it is clear that the model with this architecture outperforms the Mar-12 model, although it has the same parameters (although this is probably not directly correct compare, taking into account changes in BS during training).

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request question Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants