OLMoThreadError #552

juripapay · 2024-04-19T21:27:19Z

❓ The question

Please advise where this error might come from:
[2024-04-18 19:06:17] INFO [olmo.train:816, rank=0] [step=75/739328]
train/CrossEntropyLoss=7.417
train/Perplexity=1,664
throughput/total_tokens=314,572,800
throughput/device/tokens_per_second=9,407
throughput/device/batches_per_second=0.0022
[2024-04-18 19:10:41] CRITICAL [olmo.util:158, rank=0] Uncaught OLMoThreadError: generator thread data thread 3 failed

prakamya-mishra · 2024-04-26T00:26:24Z

@juripapay, can you give more details on the size of the model, batch size, GPU(AMD/Nvidia), and flash attention use? I wanted to know more regarding in which setting are you getting a throughout of 9k tokens/GPU/sec.

dumitrac · 2024-04-29T20:24:57Z

@juripapay - is there a traceback logged after the last line you pasted?
I would expect it to log the traceback info, based on this.

lecifire · 2024-05-23T11:30:05Z

Hi i encountered the same problem, would need some assistance on how to resolve

I tried training on the OLMo1b model.
I didn't change anything much in the config yaml

global_train_batch_size: 2048
device_train_microbatch_size: 8
My GPU was A100, using 2nodes with 4GPU each for the azure cluster NC96ads and I didn't use flash attention

Traceback (most recent call last):
File "/mnt/azureml/cr/j/e83d0122ec494c7dbf7572c30a51c53b/exe/wd/scripts/train.py", line 300, in
main(cfg)
File "/mnt/azureml/cr/j/e83d0122ec494c7dbf7572c30a51c53b/exe/wd/scripts/train.py", line 272, in main
trainer.fit()
File "/workspace/OLMo/olmo/train.py", line 1053, in fit
for batch in self.train_loader:
File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in next
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
data.append(next(self.dataset_iter))
^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/OLMo/olmo/data/iterable_dataset.py", line 177, in
return (x for x in roundrobin(*thread_generators))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/OLMo/olmo/util.py", line 695, in roundrobin
yield next()
^^^^^^
File "/workspace/OLMo/olmo/util.py", line 679, in threaded_generator
raise OLMoThreadError(f"generator thread {thread_name} failed") from x
olmo.exceptions.OLMoThreadError: generator thread data thread 3 failed

juripapay added the type/question An issue that's a question label Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OLMoThreadError #552

OLMoThreadError #552

juripapay commented Apr 19, 2024

prakamya-mishra commented Apr 26, 2024

dumitrac commented Apr 29, 2024

lecifire commented May 23, 2024

OLMoThreadError #552

OLMoThreadError #552

Comments

juripapay commented Apr 19, 2024

❓ The question

prakamya-mishra commented Apr 26, 2024

dumitrac commented Apr 29, 2024

lecifire commented May 23, 2024