Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Out-of-memory when using target factors #1106

Open
AmitMY opened this issue Mar 2, 2024 · 4 comments
Open

Out-of-memory when using target factors #1106

AmitMY opened this issue Mar 2, 2024 · 4 comments

Comments

@AmitMY
Copy link

AmitMY commented Mar 2, 2024

I have an experiment, where I can factorize the source, and target tokens, such that the context length becomes very small.
To test how effective this is, I tried running four configurations, on an A100 GPU with 80GB VRAM:

  • no factors
  • source factors, no target factors
  • target factors, no source factors
  • source and target factors

Surprisingly, the third configuration fails in an OOM error. I would have assumed that if it will be a problem it will either be a problem for "no factors" (larger sequences).

While I could reduce the batch size, I don't understand the logic behind why it would fail with target factors, but not "no factors" or "source and target factors".

Here are the commands I run:

mkdir -p /shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors

python -m sockeye.train \
-d /shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/train_data \
--weight-tying-type none --batch-size 1028 --num-layers 4:4 \
--source-factors-combine sum --target-factors-combine sum \
--validation-source /scratch/amoryo/transcription/parallel/dev/source.txt \
--validation-target /scratch/amoryo/transcription/parallel/dev/target_0.txt --validation-target-factors /scratch/amoryo/transcription/parallel/dev/target_1.txt /scratch/amoryo/transcription/parallel/dev/target_2.txt /scratch/amoryo/transcription/parallel/dev/target_3.txt /scratch/amoryo/transcription/parallel/dev/target_4.txt \
--optimized-metric signwriting-similarity --decode-and-evaluate 500 --checkpoint-interval 500 --max-num-checkpoint-not-improved 20 \
--output /shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/model
Output
[INFO:sockeye.utils] Sockeye: 3.1.38, commit 93099c7ba7695a0f39f9d3e3a7b035664ae94fca, path /data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/__init__.py
[INFO:sockeye.utils] PyTorch: 1.13.1+cu117 (/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/__init__.py)
[INFO:sockeye.utils] Command: /data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/train.py -d /shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/train_data --weight-tying-type none --batch-size 1028 --num-layers 4:4 --source-factors-combine sum --target-factors-combine sum --validation-source /scratch/amoryo/transcription/parallel/dev/source.txt --validation-target /scratch/amoryo/transcription/parallel/dev/target_0.txt --validation-target-factors /scratch/amoryo/transcription/parallel/dev/target_1.txt /scratch/amoryo/transcription/parallel/dev/target_2.txt /scratch/amoryo/transcription/parallel/dev/target_3.txt /scratch/amoryo/transcription/parallel/dev/target_4.txt --optimized-metric signwriting-similarity --decode-and-evaluate 500 --checkpoint-interval 500 --max-num-checkpoint-not-improved 20 --output /shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/model
[INFO:sockeye.utils] Arguments: Namespace(config=None, source=None, source_factors=[], source_factors_use_source_vocab=[], target_factors=[], target_factors_use_target_vocab=[], target=None, end_of_prepending_tag=None, prepared_data='/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/train_data', validation_source='/scratch/amoryo/transcription/parallel/dev/source.txt', validation_source_factors=[], validation_target='/scratch/amoryo/transcription/parallel/dev/target_0.txt', validation_target_factors=['/scratch/amoryo/transcription/parallel/dev/target_1.txt', '/scratch/amoryo/transcription/parallel/dev/target_2.txt', '/scratch/amoryo/transcription/parallel/dev/target_3.txt', '/scratch/amoryo/transcription/parallel/dev/target_4.txt'], no_bucketing=False, bucket_width=8, bucket_scaling=False, max_seq_len=(95, 95), source_vocab=None, target_vocab=None, source_factor_vocabs=[], target_factor_vocabs=[], shared_vocab=False, num_words=(0, 0), word_min_count=(1, 1), pad_vocab_to_multiple_of=8, output='/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/model', overwrite_output=False, params=None, allow_missing_params=False, ignore_extra_params=False, encoder='transformer', decoder='transformer', num_layers=(4, 4), transformer_model_size=(512, 512), transformer_attention_heads=(8, 8), transformer_feed_forward_num_hidden=(2048, 2048), transformer_feed_forward_use_glu=False, transformer_activation_type=('relu', 'relu'), transformer_positional_embedding_type='fixed', transformer_block_prepended_cross_attention=False, transformer_preprocess=('n', 'n'), transformer_postprocess=('dr', 'dr'), lhuc=None, num_embed=(None, None), source_factors_num_embed=[], target_factors_num_embed=[], source_factors_combine=[], target_factors_combine=['sum', 'sum', 'sum', 'sum'], source_factors_share_embedding=[], target_factors_share_embedding=[False, False, False, False], weight_tying_type='none', dtype='float32', clamp_to_dtype=False, amp=False, apex_amp=False, neural_vocab_selection=None, neural_vocab_selection_block_loss=False, batch_size=1028, batch_type='word', batch_sentences_multiple_of=8, update_interval=1, label_smoothing=0.1, label_smoothing_impl='mxnet', length_task=None, length_task_weight=1.0, length_task_layers=1, bow_task_weight=1.0, bow_task_pos_weight=10, target_factors_weight=[1.0], optimized_metric='signwriting-similarity', checkpoint_interval=500, min_samples=None, max_samples=None, min_updates=None, max_updates=None, max_seconds=None, max_checkpoints=None, max_num_checkpoint_not_improved=20, checkpoint_improvement_threshold=0.0, min_num_epochs=None, max_num_epochs=None, embed_dropout=(0.0, 0.0), transformer_dropout_attention=(0.1, 0.1), transformer_dropout_act=(0.1, 0.1), transformer_dropout_prepost=(0.1, 0.1), optimizer='adam', optimizer_betas=(0.9, 0.999), optimizer_eps=1e-08, dist=False, initial_learning_rate=0.0002, weight_decay=0.0, momentum=0.0, gradient_clipping_threshold=1.0, gradient_clipping_type='none', learning_rate_scheduler_type='plateau-reduce', learning_rate_reduce_factor=0.9, learning_rate_reduce_num_not_improved=8, learning_rate_warmup=0, no_reload_on_learning_rate_reduce=False, fixed_param_strategy=None, fixed_param_names=[], local_rank=None, deepspeed_fp16=False, deepspeed_bf16=False, decode_and_evaluate=500, stop_training_on_decoder_failure=False, seed=1, keep_last_params=-1, keep_initializations=False, cache_last_best_params=0, cache_strategy='best', cache_metric='perplexity', dry_run=False, device_id=0, use_cpu=False, env=None, tf32=True, quiet=False, quiet_secondary_workers=False, no_logfile=False, loglevel='INFO', loglevel_secondary_workers='INFO')
[INFO:__main__] Adjusting maximum length to reserve space for a BOS/EOS marker. New maximum length: (96, 96)
[INFO:sockeye.utils] CUDA: allow tf32 (float32 but with 10 bits precision)
[INFO:__main__] Training Device: cuda:0
[INFO:sockeye.utils] Random seed: 1
[INFO:sockeye.utils] PyTorch seed: 1
[INFO:sockeye.data_io] ===============================
[INFO:sockeye.data_io] Creating training data iterator
[INFO:sockeye.data_io] ===============================
[INFO:sockeye.vocab] Vocabulary (1008 words) loaded from "/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/train_data/vocab.src.0.json"
[INFO:sockeye.vocab] Vocabulary (416 words) loaded from "/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/train_data/vocab.trg.0.json"
[INFO:sockeye.vocab] Vocabulary (16 words) loaded from "/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/train_data/vocab.trg.1.json"
[INFO:sockeye.vocab] Vocabulary (24 words) loaded from "/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/train_data/vocab.trg.2.json"
[INFO:sockeye.vocab] Vocabulary (272 words) loaded from "/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/train_data/vocab.trg.3.json"
[INFO:sockeye.vocab] Vocabulary (328 words) loaded from "/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/train_data/vocab.trg.4.json"
[INFO:sockeye.data_io] Tokens: source 4492534 target 91511
[INFO:sockeye.data_io] Number of <unk> tokens: source 0 target 0
[INFO:sockeye.data_io] Vocabulary coverage: source 100% target 100%
[INFO:sockeye.data_io] 9482 sequences across 129 buckets
[INFO:sockeye.data_io] 20 sequences did not fit into buckets and were discarded
[INFO:sockeye.data_io] Bucket (176, 176): 3 samples in 1 batches of 192, ~1024.0 target tokens/batch, trg/src length ratio: 0.03 (+-0.00)
[INFO:sockeye.data_io] Bucket (184, 184): 6 samples in 1 batches of 192, ~1024.0 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (192, 192): 11 samples in 1 batches of 184, ~1037.1 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (200, 200): 24 samples in 1 batches of 176, ~1012.0 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (208, 208): 44 samples in 1 batches of 160, ~1014.5 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (216, 216): 50 samples in 1 batches of 168, ~1018.1 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (224, 224): 97 samples in 1 batches of 152, ~1024.8 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (232, 232): 157 samples in 2 batches of 152, ~1032.1 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (240, 240): 131 samples in 1 batches of 152, ~1004.8 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (248, 248): 231 samples in 2 batches of 152, ~1031.1 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (256, 256): 243 samples in 2 batches of 152, ~1030.2 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (264, 264): 284 samples in 2 batches of 152, ~1039.4 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (272, 272): 304 samples in 3 batches of 144, ~1006.6 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (280, 280): 348 samples in 3 batches of 144, ~1017.9 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (288, 288): 381 samples in 3 batches of 144, ~1035.6 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (296, 296): 311 samples in 3 batches of 144, ~1056.6 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (304, 304): 339 samples in 3 batches of 136, ~1009.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (312, 312): 235 samples in 2 batches of 136, ~1009.3 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (320, 320): 190 samples in 2 batches of 136, ~1038.6 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (328, 328): 166 samples in 2 batches of 136, ~1052.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (336, 336): 142 samples in 2 batches of 128, ~1007.8 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (344, 344): 106 samples in 1 batches of 120, ~1008.7 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (352, 352): 99 samples in 1 batches of 120, ~1023.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (360, 360): 62 samples in 1 batches of 112, ~1015.2 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (368, 368): 72 samples in 1 batches of 104, ~1053.0 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (376, 376): 46 samples in 1 batches of 112, ~998.3 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (384, 384): 40 samples in 1 batches of 112, ~1064.0 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (392, 392): 40 samples in 1 batches of 112, ~1058.4 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (400, 400): 39 samples in 1 batches of 104, ~1021.3 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (408, 408): 37 samples in 1 batches of 112, ~1001.9 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (416, 416): 33 samples in 1 batches of 104, ~1055.8 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (424, 424): 32 samples in 1 batches of 104, ~1007.5 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (432, 432): 29 samples in 1 batches of 96, ~996.4 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (440, 440): 46 samples in 1 batches of 104, ~1021.9 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (448, 448): 48 samples in 1 batches of 112, ~1010.3 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (456, 456): 52 samples in 1 batches of 104, ~1018.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (464, 464): 59 samples in 1 batches of 104, ~1015.3 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (472, 472): 68 samples in 1 batches of 104, ~1017.1 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (480, 480): 80 samples in 1 batches of 104, ~1033.5 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (488, 488): 80 samples in 1 batches of 104, ~1019.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (496, 496): 88 samples in 1 batches of 104, ~1021.1 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (504, 504): 131 samples in 2 batches of 104, ~1040.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (512, 512): 106 samples in 2 batches of 104, ~1047.8 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (520, 520): 114 samples in 2 batches of 104, ~1003.5 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (528, 528): 127 samples in 2 batches of 96, ~1012.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (536, 536): 114 samples in 2 batches of 96, ~991.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (544, 544): 142 samples in 2 batches of 104, ~1025.4 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (552, 552): 141 samples in 2 batches of 96, ~1028.1 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (560, 560): 137 samples in 2 batches of 104, ~1033.9 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (568, 568): 136 samples in 2 batches of 104, ~1065.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (576, 576): 148 samples in 2 batches of 88, ~987.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (584, 584): 177 samples in 2 batches of 96, ~1062.5 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (592, 592): 126 samples in 2 batches of 88, ~1045.5 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (600, 600): 143 samples in 2 batches of 88, ~1004.9 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (608, 608): 163 samples in 2 batches of 88, ~997.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (616, 616): 125 samples in 2 batches of 96, ~1058.3 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (624, 624): 130 samples in 2 batches of 88, ~1033.7 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (632, 632): 129 samples in 2 batches of 88, ~988.5 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (640, 640): 123 samples in 2 batches of 88, ~1051.7 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (648, 648): 139 samples in 2 batches of 88, ~1022.4 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (656, 656): 120 samples in 2 batches of 88, ~1036.9 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (664, 664): 143 samples in 2 batches of 88, ~993.8 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (672, 672): 112 samples in 2 batches of 88, ~995.5 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (680, 680): 128 samples in 2 batches of 80, ~988.7 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (688, 688): 99 samples in 2 batches of 88, ~1034.7 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (696, 696): 116 samples in 2 batches of 88, ~1068.1 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (704, 704): 101 samples in 2 batches of 88, ~1069.9 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (712, 712): 94 samples in 2 batches of 88, ~1036.3 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (720, 720): 84 samples in 2 batches of 80, ~990.5 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (728, 728): 93 samples in 2 batches of 80, ~1005.6 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (736, 736): 62 samples in 1 batches of 80, ~1000.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (744, 744): 72 samples in 1 batches of 80, ~1056.7 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (752, 752): 86 samples in 2 batches of 80, ~1033.5 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (760, 760): 89 samples in 2 batches of 72, ~977.3 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (768, 768): 67 samples in 1 batches of 80, ~1037.6 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (776, 776): 65 samples in 1 batches of 72, ~995.8 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (784, 784): 47 samples in 1 batches of 72, ~983.5 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (792, 792): 52 samples in 1 batches of 72, ~991.4 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (800, 800): 41 samples in 1 batches of 80, ~1067.3 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (808, 808): 41 samples in 1 batches of 72, ~995.7 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (816, 816): 45 samples in 1 batches of 80, ~1050.7 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (824, 824): 36 samples in 1 batches of 80, ~1057.8 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (832, 832): 35 samples in 1 batches of 72, ~979.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (840, 840): 21 samples in 1 batches of 72, ~1035.4 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (848, 848): 29 samples in 1 batches of 72, ~975.7 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (856, 856): 11 samples in 1 batches of 72, ~1040.7 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (864, 864): 22 samples in 1 batches of 80, ~1040.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (872, 872): 29 samples in 1 batches of 64, ~1050.5 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (880, 880): 15 samples in 1 batches of 72, ~1075.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (888, 888): 21 samples in 1 batches of 72, ~1076.6 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (896, 896): 13 samples in 1 batches of 56, ~964.9 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (904, 904): 6 samples in 1 batches of 64, ~992.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (912, 912): 10 samples in 1 batches of 64, ~1043.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (920, 920): 9 samples in 1 batches of 64, ~1038.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (928, 928): 10 samples in 1 batches of 72, ~1065.6 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (936, 936): 5 samples in 1 batches of 80, ~1056.0 target tokens/batch, trg/src length ratio: 0.01 (+-0.01)
[INFO:sockeye.data_io] Bucket (944, 944): 2 samples in 1 batches of 112, ~1064.0 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (952, 952): 7 samples in 1 batches of 72, ~1028.6 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (960, 960): 10 samples in 1 batches of 64, ~1011.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (968, 968): 5 samples in 1 batches of 56, ~1008.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (976, 976): 4 samples in 1 batches of 48, ~1008.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (984, 984): 6 samples in 1 batches of 56, ~961.3 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (992, 992): 2 samples in 1 batches of 104, ~1040.0 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (1000, 1000): 1 samples in 1 batches of 72, ~1008.0 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (1008, 1008): 4 samples in 1 batches of 72, ~1044.0 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (1016, 1016): 6 samples in 1 batches of 56, ~998.7 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (1024, 1024): 1 samples in 1 batches of 104, ~1040.0 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (1025, 1025): 1 samples in 1 batches of 72, ~1080.0 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Loading shard /shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/train_data/shard.00000.
[INFO:sockeye.data_io] =================================
[INFO:sockeye.data_io] Creating validation data iterator
[INFO:sockeye.data_io] =================================
[INFO:sockeye.data_io] 210 sequences of maximum length (1025, 1025) in '/scratch/amoryo/transcription/parallel/dev/source.txt' and '/scratch/amoryo/transcription/parallel/dev/target_0.txt'.
[INFO:sockeye.data_io] Mean training target/source length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Tokens: source 103986 target 2077
[INFO:sockeye.data_io] Number of <unk> tokens: source 0 target 1
[INFO:sockeye.data_io] Vocabulary coverage: source 100% target 100%
[INFO:sockeye.data_io] 210 sequences across 129 buckets
[INFO:sockeye.data_io] 0 sequences did not fit into buckets and were discarded
[INFO:sockeye.data_io] Bucket (168, 168): 1 samples in 1 batches of 8, ~1344.0 target tokens/batch, trg/src length ratio: 0.04 (+-0.00)
[INFO:sockeye.data_io] Bucket (216, 216): 1 samples in 1 batches of 168, ~1018.1 target tokens/batch, trg/src length ratio: 0.03 (+-0.00)
[INFO:sockeye.data_io] Bucket (224, 224): 1 samples in 1 batches of 152, ~1024.8 target tokens/batch, trg/src length ratio: 0.03 (+-0.00)
[INFO:sockeye.data_io] Bucket (232, 232): 2 samples in 1 batches of 152, ~1032.1 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (240, 240): 4 samples in 1 batches of 152, ~1004.8 target tokens/batch, trg/src length ratio: 0.04 (+-0.01)
[INFO:sockeye.data_io] Bucket (248, 248): 3 samples in 1 batches of 152, ~1031.1 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (256, 256): 5 samples in 1 batches of 152, ~1030.2 target tokens/batch, trg/src length ratio: 0.03 (+-0.00)
[INFO:sockeye.data_io] Bucket (264, 264): 5 samples in 1 batches of 152, ~1039.4 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (272, 272): 6 samples in 1 batches of 144, ~1006.6 target tokens/batch, trg/src length ratio: 0.03 (+-0.00)
[INFO:sockeye.data_io] Bucket (280, 280): 8 samples in 1 batches of 144, ~1017.9 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (288, 288): 8 samples in 1 batches of 144, ~1035.6 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (296, 296): 6 samples in 1 batches of 144, ~1056.6 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (304, 304): 5 samples in 1 batches of 136, ~1009.0 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (312, 312): 4 samples in 1 batches of 136, ~1009.3 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (320, 320): 4 samples in 1 batches of 136, ~1038.6 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (328, 328): 5 samples in 1 batches of 136, ~1052.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (336, 336): 1 samples in 1 batches of 128, ~1007.8 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (344, 344): 3 samples in 1 batches of 120, ~1008.7 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (352, 352): 5 samples in 1 batches of 120, ~1023.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (360, 360): 1 samples in 1 batches of 112, ~1015.2 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (376, 376): 4 samples in 1 batches of 112, ~998.3 target tokens/batch, trg/src length ratio: 0.03 (+-0.00)
[INFO:sockeye.data_io] Bucket (384, 384): 2 samples in 1 batches of 112, ~1064.0 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (392, 392): 4 samples in 1 batches of 112, ~1058.4 target tokens/batch, trg/src length ratio: 0.03 (+-0.00)
[INFO:sockeye.data_io] Bucket (400, 400): 1 samples in 1 batches of 104, ~1021.3 target tokens/batch, trg/src length ratio: 0.03 (+-0.00)
[INFO:sockeye.data_io] Bucket (408, 408): 1 samples in 1 batches of 112, ~1001.9 target tokens/batch, trg/src length ratio: 0.03 (+-0.00)
[INFO:sockeye.data_io] Bucket (424, 424): 1 samples in 1 batches of 104, ~1007.5 target tokens/batch, trg/src length ratio: 0.03 (+-0.00)
[INFO:sockeye.data_io] Bucket (432, 432): 1 samples in 1 batches of 96, ~996.4 target tokens/batch, trg/src length ratio: 0.03 (+-0.00)
[INFO:sockeye.data_io] Bucket (440, 440): 1 samples in 1 batches of 104, ~1021.9 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (448, 448): 1 samples in 1 batches of 112, ~1010.3 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (456, 456): 2 samples in 1 batches of 104, ~1018.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (464, 464): 1 samples in 1 batches of 104, ~1015.3 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (472, 472): 1 samples in 1 batches of 104, ~1017.1 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (480, 480): 1 samples in 1 batches of 104, ~1033.5 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (488, 488): 3 samples in 1 batches of 104, ~1019.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (496, 496): 2 samples in 1 batches of 104, ~1021.1 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (504, 504): 2 samples in 1 batches of 104, ~1040.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (512, 512): 1 samples in 1 batches of 104, ~1047.8 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (520, 520): 1 samples in 1 batches of 104, ~1003.5 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (528, 528): 3 samples in 1 batches of 96, ~1012.2 target tokens/batch, trg/src length ratio: 0.03 (+-0.01)
[INFO:sockeye.data_io] Bucket (536, 536): 2 samples in 1 batches of 96, ~991.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (544, 544): 3 samples in 1 batches of 104, ~1025.4 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (552, 552): 2 samples in 1 batches of 96, ~1028.1 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (560, 560): 4 samples in 1 batches of 104, ~1033.9 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (568, 568): 5 samples in 1 batches of 104, ~1065.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (576, 576): 1 samples in 1 batches of 88, ~987.0 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (584, 584): 6 samples in 1 batches of 96, ~1062.5 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (592, 592): 3 samples in 1 batches of 88, ~1045.5 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (600, 600): 3 samples in 1 batches of 88, ~1004.9 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (608, 608): 4 samples in 1 batches of 88, ~997.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (616, 616): 2 samples in 1 batches of 96, ~1058.3 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (624, 624): 3 samples in 1 batches of 88, ~1033.7 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (632, 632): 4 samples in 1 batches of 88, ~988.5 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (640, 640): 2 samples in 1 batches of 88, ~1051.7 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (648, 648): 3 samples in 1 batches of 88, ~1022.4 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (656, 656): 2 samples in 1 batches of 88, ~1036.9 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (664, 664): 2 samples in 1 batches of 88, ~993.8 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (680, 680): 3 samples in 1 batches of 80, ~988.7 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (688, 688): 2 samples in 1 batches of 88, ~1034.7 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (696, 696): 4 samples in 1 batches of 88, ~1068.1 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (704, 704): 3 samples in 1 batches of 88, ~1069.9 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (712, 712): 3 samples in 1 batches of 88, ~1036.3 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (728, 728): 1 samples in 1 batches of 80, ~1005.6 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (736, 736): 4 samples in 1 batches of 80, ~1000.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (744, 744): 1 samples in 1 batches of 80, ~1056.7 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (752, 752): 4 samples in 1 batches of 80, ~1033.5 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (760, 760): 2 samples in 1 batches of 72, ~977.3 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (768, 768): 4 samples in 1 batches of 80, ~1037.6 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (776, 776): 1 samples in 1 batches of 72, ~995.8 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (784, 784): 1 samples in 1 batches of 72, ~983.5 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (792, 792): 1 samples in 1 batches of 72, ~991.4 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (800, 800): 1 samples in 1 batches of 80, ~1067.3 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (824, 824): 1 samples in 1 batches of 80, ~1057.8 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (832, 832): 2 samples in 1 batches of 72, ~979.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (856, 856): 1 samples in 1 batches of 72, ~1040.7 target tokens/batch, trg/src length ratio: 0.03 (+-0.00)
[INFO:sockeye.data_io] Bucket (872, 872): 1 samples in 1 batches of 64, ~1050.5 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (880, 880): 2 samples in 1 batches of 72, ~1075.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.01)
[INFO:sockeye.data_io] Bucket (896, 896): 1 samples in 1 batches of 56, ~964.9 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (920, 920): 1 samples in 1 batches of 64, ~1038.2 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (936, 936): 2 samples in 1 batches of 80, ~1056.0 target tokens/batch, trg/src length ratio: 0.01 (+-0.00)
[INFO:sockeye.data_io] Bucket (984, 984): 1 samples in 1 batches of 56, ~961.3 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Bucket (1000, 1000): 1 samples in 1 batches of 72, ~1008.0 target tokens/batch, trg/src length ratio: 0.02 (+-0.00)
[INFO:sockeye.data_io] Created bucketed parallel data set. Introduced padding: source=1.0% target=98.0%)
[INFO:__main__] Maximum source length determined by prepared data. Using 1025 instead of 96
[INFO:__main__] Maximum target length determined by prepared data. Using 1025 instead of 96
[INFO:sockeye.vocab] Vocabulary saved to "/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/model/vocab.src.0.json"
[INFO:sockeye.vocab] Vocabulary saved to "/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/model/vocab.trg.0.json"
[INFO:sockeye.vocab] Vocabulary saved to "/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/model/vocab.trg.1.json"
[INFO:sockeye.vocab] Vocabulary saved to "/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/model/vocab.trg.2.json"
[INFO:sockeye.vocab] Vocabulary saved to "/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/model/vocab.trg.3.json"
[INFO:sockeye.vocab] Vocabulary saved to "/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/model/vocab.trg.4.json"
[INFO:__main__] Vocabulary sizes: source=[1008] target=[416|16|24|272|328]
[INFO:__main__] Source embedding size was not set it will automatically be adjusted to match the Transformer source model size (512).
[INFO:__main__] Target embedding size was not set it will automatically be adjusted to match the Transformer target model size (512).
[INFO:__main__] Setting all target factor embedding sizes to `num_embed` ('512')
[INFO:__main__] OptimizerConfig(name='adam', running_on_gpu=True, lr=0.0002, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0, momentum=0.0, gradient_clipping_type='none', gradient_clipping_threshold=1.0, update_interval=1)
[INFO:__main__] Gradient accumulation over 1 batch(es) by 1 worker(s). Effective batch size: 1028
[INFO:sockeye.model] ModelConfig(config_data=DataConfig(data_statistics=DataStatistics(num_sents=9482, num_discarded=20, num_tokens_source=4492534, num_tokens_target=91511, num_unks_source=0, num_unks_target=0, max_observed_len_source=1025, max_observed_len_target=35, size_vocab_source=1008, size_vocab_target=416, length_ratio_mean=0.021893625473090785, length_ratio_std=0.008001311166046098, buckets=[(8, 8), (16, 16), (24, 24), (32, 32), (40, 40), (48, 48), (56, 56), (64, 64), (72, 72), (80, 80), (88, 88), (96, 96), (104, 104), (112, 112), (120, 120), (128, 128), (136, 136), (144, 144), (152, 152), (160, 160), (168, 168), (176, 176), (184, 184), (192, 192), (200, 200), (208, 208), (216, 216), (224, 224), (232, 232), (240, 240), (248, 248), (256, 256), (264, 264), (272, 272), (280, 280), (288, 288), (296, 296), (304, 304), (312, 312), (320, 320), (328, 328), (336, 336), (344, 344), (352, 352), (360, 360), (368, 368), (376, 376), (384, 384), (392, 392), (400, 400), (408, 408), (416, 416), (424, 424), (432, 432), (440, 440), (448, 448), (456, 456), (464, 464), (472, 472), (480, 480), (488, 488), (496, 496), (504, 504), (512, 512), (520, 520), (528, 528), (536, 536), (544, 544), (552, 552), (560, 560), (568, 568), (576, 576), (584, 584), (592, 592), (600, 600), (608, 608), (616, 616), (624, 624), (632, 632), (640, 640), (648, 648), (656, 656), (664, 664), (672, 672), (680, 680), (688, 688), (696, 696), (704, 704), (712, 712), (720, 720), (728, 728), (736, 736), (744, 744), (752, 752), (760, 760), (768, 768), (776, 776), (784, 784), (792, 792), (800, 800), (808, 808), (816, 816), (824, 824), (832, 832), (840, 840), (848, 848), (856, 856), (864, 864), (872, 872), (880, 880), (888, 888), (896, 896), (904, 904), (912, 912), (920, 920), (928, 928), (936, 936), (944, 944), (952, 952), (960, 960), (968, 968), (976, 976), (984, 984), (992, 992), (1000, 1000), (1008, 1008), (1016, 1016), (1024, 1024), (1025, 1025)], num_sents_per_bucket=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 6, 11, 24, 44, 50, 97, 157, 131, 231, 243, 284, 304, 348, 381, 311, 339, 235, 190, 166, 142, 106, 99, 62, 72, 46, 40, 40, 39, 37, 33, 32, 29, 46, 48, 52, 59, 68, 80, 80, 88, 131, 106, 114, 127, 114, 142, 141, 137, 136, 148, 177, 126, 143, 163, 125, 130, 129, 123, 139, 120, 143, 112, 128, 99, 116, 101, 94, 84, 93, 62, 72, 86, 89, 67, 65, 47, 52, 41, 41, 45, 36, 35, 21, 29, 11, 22, 29, 15, 21, 13, 6, 10, 9, 10, 5, 2, 7, 10, 5, 4, 6, 2, 1, 4, 6, 1, 1], average_len_target_per_bucket=[None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 5.333333333333333, 5.333333333333334, 5.636363636363637, 5.75, 6.340909090909091, 6.0600000000000005, 6.742268041237113, 6.78980891719745, 6.610687022900765, 6.783549783549782, 6.777777777777778, 6.838028169014084, 6.990131578947372, 7.068965517241379, 7.191601049868762, 7.337620578778132, 7.418879056047195, 7.421276595744679, 7.636842105263154, 7.734939759036145, 7.873239436619716, 8.405660377358496, 8.525252525252531, 9.064516129032258, 10.125000000000004, 8.913043478260871, 9.5, 9.450000000000001, 9.820512820512818, 8.945945945945944, 10.151515151515154, 9.687499999999998, 10.379310344827585, 9.82608695652174, 9.020833333333332, 9.788461538461537, 9.762711864406777, 9.779411764705882, 9.937499999999998, 9.8, 9.81818181818182, 9.999999999999995, 10.075471698113208, 9.649122807017545, 10.543307086614172, 10.324561403508772, 9.859154929577462, 10.709219858156027, 9.941605839416063, 10.242647058823524, 11.216216216216218, 11.06779661016949, 11.88095238095238, 11.419580419580424, 11.331288343558281, 11.024000000000003, 11.746153846153847, 11.232558139534893, 11.951219512195125, 11.618705035971228, 11.783333333333331, 11.293706293706292, 11.312499999999998, 12.359374999999996, 11.757575757575756, 12.13793103448276, 12.15841584158416, 11.77659574468085, 12.380952380952378, 12.56989247311828, 12.5, 13.20833333333334, 12.918604651162795, 13.573033707865175, 12.970149253731341, 13.830769230769231, 13.659574468085106, 13.769230769230766, 13.341463414634147, 13.82926829268293, 13.133333333333333, 13.222222222222221, 13.600000000000001, 14.380952380952383, 13.551724137931034, 14.454545454545455, 13.0, 16.41379310344827, 14.933333333333334, 14.95238095238095, 17.230769230769234, 15.5, 16.3, 16.22222222222222, 14.8, 13.2, 9.5, 14.285714285714285, 15.799999999999997, 18.0, 21.0, 17.166666666666668, 10.0, 14.0, 14.5, 17.833333333333332, 10.0, 15.0], length_ratio_stats_per_bucket=[(None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (None, None), (0.03155818540433925, 0.002789375862668828), (0.03004858965154873, 0.010713522189797191), (0.030466830466830463, 0.007018602878174791), (0.029385076668157067, 0.007490577176879539), (0.03136589778380823, 0.009774881286582243), (0.02874542309679448, 0.008731015415518701), (0.030787460158144254, 0.009132183518014423), (0.029919308232661774, 0.009241641556816232), (0.028064055175380645, 0.007581470711878169), (0.02792579876160196, 0.008849891753366111), (0.027017224871443674, 0.007728143693459796), (0.026396088413840977, 0.007714048273315171), (0.02616558624941393, 0.00716640153267519), (0.025705624859116708, 0.007558270587444818), (0.02541289381344657, 0.007832129830004466), (0.025243418844455526, 0.00796077209243033), (0.02480463748214313, 0.007999469185269052), (0.024167160513058436, 0.0070210019880264625), (0.024248633698617503, 0.00843360657213571), (0.0239434806082706, 0.0069385242522507565), (0.023810745117307894, 0.008259083105856728), (0.024834248705505418, 0.009452672429358518), (0.02457295852088428, 0.008395954733238337), (0.02555769634912359, 0.008421126194934937), (0.02786284083195252, 0.00907566242203976), (0.024035182652794015, 0.010065977524509332), (0.02507362309154327, 0.01135396063266525), (0.024439288218208528, 0.008428155030378921), (0.024872884468708112, 0.007709245995001897), (0.022231874434700705, 0.006688320745447228), (0.024714845609130967, 0.007546056614798257), (0.023090435015408108, 0.005237443308604424), (0.024309186418719892, 0.007527009724737724), (0.02261032245938324, 0.005885791683940805), (0.0203517924023542, 0.005272381466622201), (0.021714354761302346, 0.007621951693880687), (0.02128376614438217, 0.005109757449722549), (0.020942522410986545, 0.007280239225033139), (0.02090912415067747, 0.005328360580920803), (0.02029448957284009, 0.006350960606465536), (0.01998870975587958, 0.005412158346319601), (0.020023724495653657, 0.00547067978569564), (0.0198786131346945, 0.0057326910571210525), (0.018732739072221602, 0.004819852632142076), (0.020160099949118646, 0.006079250551215679), (0.019433005493123285, 0.006210803255790242), (0.01829404955429871, 0.005531928345289418), (0.019578929079317877, 0.006380829652581713), (0.017911803671931823, 0.006202259576693886), (0.01819548177985879, 0.006009829897814049), (0.019642439852416024, 0.005862780568951808), (0.0191080771611631, 0.0066905203854310735), (0.020245287421291953, 0.006851499278575219), (0.019208630960476196, 0.006665470992119595), (0.018784811907818468, 0.006305040464709724), (0.018038460611223172, 0.006308932056650274), (0.018975210202044117, 0.005904501307209767), (0.017919404493412706, 0.0059235619605382565), (0.01882342437521856, 0.006838067878334744), (0.018082681480255706, 0.00526530106679493), (0.018097933680512126, 0.005604052152138559), (0.017134096569572082, 0.00604667863094754), (0.016966483312702, 0.005501871780554419), (0.018311580513189693, 0.006168102867649186), (0.017212241605308917, 0.006245410960264675), (0.017574704627404687, 0.005412331057715377), (0.017398371416950827, 0.005182476540281316), (0.01666403819226513, 0.005231327195660922), (0.017322207402427105, 0.00551649449549581), (0.017386246700371345, 0.004847962479602189), (0.017095222132213532, 0.005332590267736059), (0.017879334159774066, 0.005369322784540216), (0.017295231371395155, 0.005397760132031553), (0.017979310007285556, 0.005990114224346006), (0.01700879539844259, 0.005025165766049693), (0.017936014698329294, 0.005065458363005738), (0.017542865010650444, 0.004657199412094616), (0.017503659144694726, 0.00547237104502398), (0.016794093095316646, 0.005587353401491216), (0.017209324581207285, 0.005347280491125719), (0.016196462747486973, 0.0045567100064617914), (0.016146102256273307, 0.004039317892285893), (0.01643939655037363, 0.004415944021680709), (0.017216786014627734, 0.0050210255368584675), (0.016072868236990143, 0.004976540199793936), (0.016984705442404106, 0.004641734389157963), (0.015137654748321567, 0.003550976889556747), (0.018948692970941372, 0.004460115088557372), (0.017058048303272767, 0.005314501071962786), (0.016949335766550413, 0.005176542365405662), (0.01932405533467159, 0.004368012394471667), (0.017231153625498076, 0.004921579397475034), (0.017961453603923927, 0.0036520827359816576), (0.017732474460148515, 0.006616382284357205), (0.016038970566657866, 0.00537634813679545), (0.01420051986889059, 0.005512423081595203), (0.010095642933049948, 0.0005313496280552597), (0.015070643642072213, 0.001922146485234748), (0.016543259420561587, 0.004998949753024343), (0.018685630792622105, 0.004554655505181765), (0.021608188902217457, 0.0036554436873190883), (0.017535320527066463, 0.003933219162198703), (0.010111223458038422, 0.0), (0.014042126379137413, 0.0), (0.01445966968355028, 0.001680069385991536), (0.017643609619381474, 0.006426306022980464), (0.00983284169124877, 0.0), (0.014634146341463415, 0.0)]), max_seq_len_source=1025, max_seq_len_target=1025, num_source_factors=1, num_target_factors=5, eop_id=-1), vocab_source_size=1008, vocab_target_size=416, config_embed_source=EmbeddingConfig(vocab_size=1008, num_embed=512, dropout=0.0, num_factors=1, factor_configs=None, allow_sparse_grad=False), config_embed_target=EmbeddingConfig(vocab_size=416, num_embed=512, dropout=0.0, num_factors=5, factor_configs=[FactorConfig(vocab_size=16, num_embed=512, combine='sum', share_embedding=False), FactorConfig(vocab_size=24, num_embed=512, combine='sum', share_embedding=False), FactorConfig(vocab_size=272, num_embed=512, combine='sum', share_embedding=False), FactorConfig(vocab_size=328, num_embed=512, combine='sum', share_embedding=False)], allow_sparse_grad=False), config_encoder=TransformerConfig(model_size=512, attention_heads=8, feed_forward_num_hidden=2048, act_type='relu', num_layers=4, dropout_attention=0.1, dropout_act=0.1, dropout_prepost=0.1, positional_embedding_type='fixed', preprocess_sequence='n', postprocess_sequence='dr', max_seq_len_source=1025, max_seq_len_target=1025, decoder_type='transformer', block_prepended_cross_attention=False, use_lhuc=False, depth_key_value=512, use_glu=False), config_decoder=TransformerConfig(model_size=512, attention_heads=8, feed_forward_num_hidden=2048, act_type='relu', num_layers=4, dropout_attention=0.1, dropout_act=0.1, dropout_prepost=0.1, positional_embedding_type='fixed', preprocess_sequence='n', postprocess_sequence='dr', max_seq_len_source=1025, max_seq_len_target=1025, decoder_type='transformer', block_prepended_cross_attention=False, use_lhuc=False, depth_key_value=512, use_glu=False), config_length_task=None, weight_tying_type='none', lhuc=False, dtype='float32', neural_vocab_selection=None, neural_vocab_selection_block_loss=False)
[INFO:sockeye.utils] # of parameters: 32051232 | trainable: 31001632 (96.73%) | shared: 213408 (0.67%) | fixed: 1049600 (3.27%)
[INFO:sockeye.utils] Trainable parameters: 
['embedding_source.embedding [(1008, 512), float32]',
 'embedding_target.embedding [(416, 512), float32]',
 'embedding_target.factor_embeds.0 [(16, 512), float32]',
 'embedding_target.factor_embeds.1 [(24, 512), float32]',
 'embedding_target.factor_embeds.2 [(272, 512), float32]',
 'embedding_target.factor_embeds.3 [(328, 512), float32]',
 'encoder.layers.0.pre_self_attention.layer_norm [(512,), float32]',
 'encoder.layers.0.pre_self_attention.layer_norm [(512,), float32]',
 'encoder.layers.0.self_attention.ff_out [(512, 512), float32]',
 'encoder.layers.0.self_attention.ff_in [(1536, 512), float32]',
 'encoder.layers.0.pre_ff.layer_norm [(512,), float32]',
 'encoder.layers.0.pre_ff.layer_norm [(512,), float32]',
 'encoder.layers.0.ff.ff1 [(2048, 512), float32]',
 'encoder.layers.0.ff.ff1 [(2048,), float32]',
 'encoder.layers.0.ff.ff2 [(512, 2048), float32]',
 'encoder.layers.0.ff.ff2 [(512,), float32]',
 'encoder.layers.1.pre_self_attention.layer_norm [(512,), float32]',
 'encoder.layers.1.pre_self_attention.layer_norm [(512,), float32]',
 'encoder.layers.1.self_attention.ff_out [(512, 512), float32]',
 'encoder.layers.1.self_attention.ff_in [(1536, 512), float32]',
 'encoder.layers.1.pre_ff.layer_norm [(512,), float32]',
 'encoder.layers.1.pre_ff.layer_norm [(512,), float32]',
 'encoder.layers.1.ff.ff1 [(2048, 512), float32]',
 'encoder.layers.1.ff.ff1 [(2048,), float32]',
 'encoder.layers.1.ff.ff2 [(512, 2048), float32]',
 'encoder.layers.1.ff.ff2 [(512,), float32]',
 'encoder.layers.2.pre_self_attention.layer_norm [(512,), float32]',
 'encoder.layers.2.pre_self_attention.layer_norm [(512,), float32]',
 'encoder.layers.2.self_attention.ff_out [(512, 512), float32]',
 'encoder.layers.2.self_attention.ff_in [(1536, 512), float32]',
 'encoder.layers.2.pre_ff.layer_norm [(512,), float32]',
 'encoder.layers.2.pre_ff.layer_norm [(512,), float32]',
 'encoder.layers.2.ff.ff1 [(2048, 512), float32]',
 'encoder.layers.2.ff.ff1 [(2048,), float32]',
 'encoder.layers.2.ff.ff2 [(512, 2048), float32]',
 'encoder.layers.2.ff.ff2 [(512,), float32]',
 'encoder.layers.3.pre_self_attention.layer_norm [(512,), float32]',
 'encoder.layers.3.pre_self_attention.layer_norm [(512,), float32]',
 'encoder.layers.3.self_attention.ff_out [(512, 512), float32]',
 'encoder.layers.3.self_attention.ff_in [(1536, 512), float32]',
 'encoder.layers.3.pre_ff.layer_norm [(512,), float32]',
 'encoder.layers.3.pre_ff.layer_norm [(512,), float32]',
 'encoder.layers.3.ff.ff1 [(2048, 512), float32]',
 'encoder.layers.3.ff.ff1 [(2048,), float32]',
 'encoder.layers.3.ff.ff2 [(512, 2048), float32]',
 'encoder.layers.3.ff.ff2 [(512,), float32]',
 'encoder.final_process.layer_norm [(512,), float32]',
 'encoder.final_process.layer_norm [(512,), float32]',
 'decoder.layers.0.autoregr_layer.ff_out [(512, 512), float32]',
 'decoder.layers.0.autoregr_layer.ff_in [(1536, 512), float32]',
 'decoder.layers.0.pre_autoregr_layer.layer_norm [(512,), float32]',
 'decoder.layers.0.pre_autoregr_layer.layer_norm [(512,), float32]',
 'decoder.layers.0.pre_enc_attention.layer_norm [(512,), float32]',
 'decoder.layers.0.pre_enc_attention.layer_norm [(512,), float32]',
 'decoder.layers.0.enc_attention.ff_out [(512, 512), float32]',
 'decoder.layers.0.enc_attention.ff_q [(512, 512), float32]',
 'decoder.layers.0.enc_attention.ff_kv [(1024, 512), float32]',
 'decoder.layers.0.pre_ff.layer_norm [(512,), float32]',
 'decoder.layers.0.pre_ff.layer_norm [(512,), float32]',
 'decoder.layers.0.ff.ff1 [(2048, 512), float32]',
 'decoder.layers.0.ff.ff1 [(2048,), float32]',
 'decoder.layers.0.ff.ff2 [(512, 2048), float32]',
 'decoder.layers.0.ff.ff2 [(512,), float32]',
 'decoder.layers.1.autoregr_layer.ff_out [(512, 512), float32]',
 'decoder.layers.1.autoregr_layer.ff_in [(1536, 512), float32]',
 'decoder.layers.1.pre_autoregr_layer.layer_norm [(512,), float32]',
 'decoder.layers.1.pre_autoregr_layer.layer_norm [(512,), float32]',
 'decoder.layers.1.pre_enc_attention.layer_norm [(512,), float32]',
 'decoder.layers.1.pre_enc_attention.layer_norm [(512,), float32]',
 'decoder.layers.1.enc_attention.ff_out [(512, 512), float32]',
 'decoder.layers.1.enc_attention.ff_q [(512, 512), float32]',
 'decoder.layers.1.enc_attention.ff_kv [(1024, 512), float32]',
 'decoder.layers.1.pre_ff.layer_norm [(512,), float32]',
 'decoder.layers.1.pre_ff.layer_norm [(512,), float32]',
 'decoder.layers.1.ff.ff1 [(2048, 512), float32]',
 'decoder.layers.1.ff.ff1 [(2048,), float32]',
 'decoder.layers.1.ff.ff2 [(512, 2048), float32]',
 'decoder.layers.1.ff.ff2 [(512,), float32]',
 'decoder.layers.2.autoregr_layer.ff_out [(512, 512), float32]',
 'decoder.layers.2.autoregr_layer.ff_in [(1536, 512), float32]',
 'decoder.layers.2.pre_autoregr_layer.layer_norm [(512,), float32]',
 'decoder.layers.2.pre_autoregr_layer.layer_norm [(512,), float32]',
 'decoder.layers.2.pre_enc_attention.layer_norm [(512,), float32]',
 'decoder.layers.2.pre_enc_attention.layer_norm [(512,), float32]',
 'decoder.layers.2.enc_attention.ff_out [(512, 512), float32]',
 'decoder.layers.2.enc_attention.ff_q [(512, 512), float32]',
 'decoder.layers.2.enc_attention.ff_kv [(1024, 512), float32]',
 'decoder.layers.2.pre_ff.layer_norm [(512,), float32]',
 'decoder.layers.2.pre_ff.layer_norm [(512,), float32]',
 'decoder.layers.2.ff.ff1 [(2048, 512), float32]',
 'decoder.layers.2.ff.ff1 [(2048,), float32]',
 'decoder.layers.2.ff.ff2 [(512, 2048), float32]',
 'decoder.layers.2.ff.ff2 [(512,), float32]',
 'decoder.layers.3.autoregr_layer.ff_out [(512, 512), float32]',
 'decoder.layers.3.autoregr_layer.ff_in [(1536, 512), float32]',
 'decoder.layers.3.pre_autoregr_layer.layer_norm [(512,), float32]',
 'decoder.layers.3.pre_autoregr_layer.layer_norm [(512,), float32]',
 'decoder.layers.3.pre_enc_attention.layer_norm [(512,), float32]',
 'decoder.layers.3.pre_enc_attention.layer_norm [(512,), float32]',
 'decoder.layers.3.enc_attention.ff_out [(512, 512), float32]',
 'decoder.layers.3.enc_attention.ff_q [(512, 512), float32]',
 'decoder.layers.3.enc_attention.ff_kv [(1024, 512), float32]',
 'decoder.layers.3.pre_ff.layer_norm [(512,), float32]',
 'decoder.layers.3.pre_ff.layer_norm [(512,), float32]',
 'decoder.layers.3.ff.ff1 [(2048, 512), float32]',
 'decoder.layers.3.ff.ff1 [(2048,), float32]',
 'decoder.layers.3.ff.ff2 [(512, 2048), float32]',
 'decoder.layers.3.ff.ff2 [(512,), float32]',
 'decoder.final_process.layer_norm [(512,), float32]',
 'decoder.final_process.layer_norm [(512,), float32]',
 'output_layer [(416, 512), float32]',
 'output_layer [(416,), float32]',
 'output_layer_module_cached [(416, 512), float32]',
 'output_layer_module_cached [(416,), float32]',
 'output_layer_script_cached [(416, 512), float32]',
 'output_layer_script_cached [(416,), float32]',
 'factor_output_layers.0 [(16, 512), float32]',
 'factor_output_layers.0 [(16,), float32]',
 'factor_output_layers.1 [(24, 512), float32]',
 'factor_output_layers.1 [(24,), float32]',
 'factor_output_layers.2 [(272, 512), float32]',
 'factor_output_layers.2 [(272,), float32]',
 'factor_output_layers.3 [(328, 512), float32]',
 'factor_output_layers.3 [(328,), float32]']
[INFO:sockeye.utils] Shared parameters: 
['output_layer.weight = output_layer_module_cached.weight = output_layer_script_cached.weight',
 'output_layer.bias = output_layer_module_cached.bias = output_layer_script_cached.bias']
[INFO:sockeye.utils] Fixed parameters:
['encoder.pos_embedding [(1025, 512), float32]',
 'decoder.pos_embedding [(1025, 512), float32]']
[INFO:sockeye.loss] Loss: cross-entropy | weight=1.00 | metric: perplexity (ppl) | output_name: 'logits' | label_name: 'target_label'
[INFO:sockeye.loss] Loss: cross-entropy | weight=1.00 | metric: f1-perplexity (f1-ppl) | output_name: 'factor1_logits' | label_name: 'target_factor1_label'
[INFO:sockeye.loss] Loss: cross-entropy | weight=1.00 | metric: f2-perplexity (f2-ppl) | output_name: 'factor2_logits' | label_name: 'target_factor2_label'
[INFO:sockeye.loss] Loss: cross-entropy | weight=1.00 | metric: f3-perplexity (f3-ppl) | output_name: 'factor3_logits' | label_name: 'target_factor3_label'
[INFO:sockeye.loss] Loss: cross-entropy | weight=1.00 | metric: f4-perplexity (f4-ppl) | output_name: 'factor4_logits' | label_name: 'target_factor4_label'
[WARNING:sockeye.optimizers] Cannot import NVIDIA Apex optimizers (FusedAdam, FusedSGD). Consider installing Apex for faster GPU training: https://github.com/NVIDIA/apex
[INFO:sockeye.lr_scheduler] Will reduce the learning rate by a factor of 0.90 whenever the validation score doesn't improve 8 times.
[INFO:__main__] Tracing SockeyeModel on a validation batch
[INFO:sockeye.training] Logging training events for Tensorboard at '/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/model/tensorboard'
[INFO:sockeye.inference] Translator (1 model(s) beam_size=5 algorithm=BeamSearch, beam_search_stop=all max_input_length=1024 nbest_size=1 ensemble_mode=None max_batch_size=16 dtype=torch.float32 skip_nvs=False nvs_thresh=0.5)
[INFO:sockeye.checkpoint_decoder] Created CheckpointDecoder(max_input_len=-1, beam_size=5, num_sentences=210)
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/jit/_trace.py:976: TracerWarning: Encountering a list at the output of the tracer might cause the trace to be incorrect, this is only valid if the container structure does not change based on the module's inputs. Consider using a constant container instead (e.g. for `list`, use a `tuple` instead. for `dict`, use a `NamedTuple` instead). If you absolutely need this and know the side effects, pass strict=False to trace() to allow this behavior.
  module._c._create_method_from_trace(
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/nn/modules/module.py:1194: UserWarning: FALLBACK path has been taken inside: runCudaFusionGroup. This is an indication that codegen Failed for some reason.
To debug try disable codegen fallback path via setting the env variable `export PYTORCH_NVFUSER_DISABLE=fallback`
 (Triggered internally at ../torch/csrc/jit/codegen/cuda/manager.cpp:331.)
  return forward_call(*input, **kwargs)
[INFO:sockeye.training] Early stopping by optimizing 'signwriting-similarity'
[INFO:sockeye.model] Saved model config to "/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/model/config"
[INFO:root] Saved params/state_dict to "/shares/volk.cl.uzh/amoryo/checkpoints/sockeye-vq/target-factors/model/params.00000"
[INFO:sockeye.training] Training started.
[ERROR:root] Uncaught exception
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/train.py", line 1225, in <module>
    main()
  File "/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/train.py", line 943, in main
    train(args)
  File "/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/train.py", line 1207, in train
    training_state = trainer.fit(train_iter=train_iter, validation_iter=eval_iter,
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/training.py", line 275, in fit
    did_grad_step = self._step(batch=train_iter.next())
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/training.py", line 404, in _step
    loss_values, num_samples = self._forward_backward(batch, is_update_batch)
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/training.py", line 361, in _forward_backward
    sum_losses, loss_values, num_samples = self.model_object(batch.source, batch.source_length,
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/training.py", line 78, in forward
    model_outputs = self.model(source, source_length, target, target_length)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/nn/modules/linear.py(114): forward
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/nn/modules/module.py(1194): _call_impl
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/transformer.py(344): forward
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/nn/modules/module.py(1194): _call_impl
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/transformer.py(254): forward
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/nn/modules/module.py(1194): _call_impl
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/decoder.py(297): forward
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/decoder.py(264): decode_seq
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/model.py(344): forward
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/nn/modules/module.py(1182): _slow_forward
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/nn/modules/module.py(1194): _call_impl
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/jit/_trace.py(976): trace_module
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/jit/_trace.py(759): trace
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/train.py(1153): train
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py(346): wrapper
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/train.py(943): main
/data/amoryo/conda/envs/sockeye/lib/python3.11/site-packages/sockeye/train.py(1225): <module>
<frozen runpy>(88): _run_code
<frozen runpy>(198): _run_module_as_main
RuntimeError: CUDA out of memory. Tried to allocate 474.00 MiB (GPU 0; 79.15 GiB total capacity; 77.70 GiB already allocated; 341.62 MiB free; 78.24 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
@AmitMY
Copy link
Author

AmitMY commented Mar 22, 2024

Any idea on this one? Still encountering it

@mjdenkowski
Copy link
Contributor

The only unusual things I notice about the training setup are a large number of buckets and very long sequences (by machine translation standards). You could try increasing the bucket width and/or splitting examples into shorter sequences. A more typical scenario might have 8 buckets total covering lengths 1-128.

@AmitMY
Copy link
Author

AmitMY commented Mar 23, 2024

Thanks! could you please point me to what "buckets" are? I have a vague sense, but maybe an exact understanding will help here.

@mjdenkowski
Copy link
Contributor

During training, examples are grouped into "buckets" of similar-length sequences. The default bucket width is 8, meaning that examples with source length <=8 and target length <=8 go into the first bucket, remaining examples <=16 go into the second, etc., up to the max sequence length. One reason for bucketing is to trade off between efficiency (similar lengths to minimize padding) and variation (some length differences within batches plus the ability to shuffle the entire bucket for each epoch).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants