Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sockeye.utils.SockeyeError: Target sequences are not token-parallel: #1098

Open
jingshu-liu opened this issue Oct 1, 2023 · 1 comment
Open

Comments

@jingshu-liu
Copy link

Hi, I got an error during sockeye.prepare_data saying that the target sequences are not token parallel (not having the same length). [[2, 1960], [2, 4, 4, 4, 4]] . For info 2 is the <s> and 4 is the , token. Here's the log :

[2023-09-28:15:10:02:INFO:sockeye.utils:log_sockeye_version] Sockeye: 3.1.29, commit 4dba5a3
9b3bde, path /home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/site-packages/sockeye/init.py
[2023-09-28:15:10:02:INFO:sockeye.utils:log_torch_version] PyTorch: 1.10.0 (/home/jingshu.liu/anaconda3/envs/dev_jingsh
u/lib/python3.6/site-packages/torch/init.py)
[2023-09-28:15:10:02:INFO:sockeye.utils:log_basic_info] Command: /home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/pytho
n3.6/site-packages/sockeye/prepare_data.py -s /data/mlmt//final.src.mask -t /data/mlmt//final.tgt --source-factors /dat
a/mlmt//final.src.cf --target-factors /data/mlmt//final.tgt.cf --shared-vocab --num-words 120000 --word-min-count 2 --m
ax-seq-len 200 --num-samples-per-shard 15000000 --max-processes 4 -o /home/share/research/mlmd/data_bin
[2023-09-28:15:10:02:INFO:sockeye.utils:log_basic_info] Arguments: Namespace(bucket_scaling=False, bucket_width=8, conf
ig=None, loglevel='INFO', loglevel_secondary_workers='INFO', max_processes=4, max_seq_len=(200, 200), min_num_shards=1,
no_bucketing=False, no_logfile=False, num_samples_per_shard=15000000, num_words=(120000, 120000), output='/home/share/
research/mlmd/data_bin', pad_vocab_to_multiple_of=8, quiet=False, quiet_secondary_workers=False, seed=13, shared_vocab=
True, source='/data/mlmt//final.src.mask', source_factor_vocabs=[], source_factors=['/data/mlmt//final.src.cf'], source
_factors_use_source_vocab=[], source_vocab=None, target='/data/mlmt//final.tgt', target_factor_vocabs=[], target_factor
s=['/data/mlmt//final.tgt.cf'], target_factors_use_target_vocab=[], target_vocab=None, word_min_count=(2, 2))
[2023-09-28:15:10:02:INFO:sockeye.utils:seed_rngs] Random seed: 13
[2023-09-28:15:10:02:INFO:sockeye.utils:seed_rngs] PyTorch seed: 13
[2023-09-28:15:10:02:INFO:main:prepare_data] Adjusting maximum length to reserve space for a BOS/EOS marker. New ma
ximum length: (201, 201)
[2023-09-28:15:40:05:INFO:main:prepare_data] 1997912086 samples will be split into 134 shard(s) (requested samples/
shard=15000000, min_num_shards=1).
[2023-09-28:22:37:39:INFO:sockeye.vocab:load_or_create_vocabs] =============================
[2023-09-28:22:37:39:INFO:sockeye.vocab:load_or_create_vocabs] Loading/creating vocabularies
[2023-09-28:22:37:39:INFO:sockeye.vocab:load_or_create_vocabs] =============================
[2023-09-28:22:37:39:INFO:sockeye.vocab:load_or_create_vocabs] (1) Surface form vocabularies (source & target)
[2023-09-28:22:37:39:INFO:sockeye.vocab:build_from_paths] Building vocabulary from dataset(s):

...

Traceback (most recent call last):
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/site-packages/sockeye/prepare_data.py", line 121, in
main()
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/site-packages/sockeye/prepare_data.py", line 32, in main
prepare_data(args)
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/site-packages/sockeye/prepare_data.py", line 117, in prepare_data
keep_tmp_shard_files=keep_tmp_shard_files)
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/site-packages/sockeye/data_io.py", line 609, in prepare_data
length_stats = pool.starmap(analyze_sequence_lengths, stats_args)
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/multiprocessing/pool.py", line 274, in starmap
return self._map_async(func, iterable, starmapstar, chunksize).get()
File "/home/jingshu.liu/anaconda3/envs/dev_jingshu/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
sockeye.utils.SockeyeError: Target sequences are not token-parallel: [[2, 1960], [2, 4, 4, 4, 4]]

@mjdenkowski
Copy link
Contributor

It looks like the error message is reporting that the files for target sequences (factor 0) and target factor sequences (factors 1+) are not token parallel. If you run word counts for each pair of lines from /data/mlmt//final.tgt and /data/mlmt//final.tgt.cf, are the lengths the same?

assert len(t_line.split()) == len(tf_line.split()), (t_line, tf_line)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants