Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OLMoThreadError #552

Open
juripapay opened this issue Apr 19, 2024 · 3 comments
Open

OLMoThreadError #552

juripapay opened this issue Apr 19, 2024 · 3 comments
Labels
type/question An issue that's a question

Comments

@juripapay
Copy link

❓ The question

Please advise where this error might come from:
[2024-04-18 19:06:17] INFO [olmo.train:816, rank=0] [step=75/739328]
train/CrossEntropyLoss=7.417
train/Perplexity=1,664
throughput/total_tokens=314,572,800
throughput/device/tokens_per_second=9,407
throughput/device/batches_per_second=0.0022
[2024-04-18 19:10:41] CRITICAL [olmo.util:158, rank=0] Uncaught OLMoThreadError: generator thread data thread 3 failed

@juripapay juripapay added the type/question An issue that's a question label Apr 19, 2024
@prakamya-mishra
Copy link

@juripapay, can you give more details on the size of the model, batch size, GPU(AMD/Nvidia), and flash attention use? I wanted to know more regarding in which setting are you getting a throughout of 9k tokens/GPU/sec.

@dumitrac
Copy link
Contributor

@juripapay - is there a traceback logged after the last line you pasted?
I would expect it to log the traceback info, based on this.

@lecifire
Copy link

Hi i encountered the same problem, would need some assistance on how to resolve

I tried training on the OLMo1b model.
I didn't change anything much in the config yaml

global_train_batch_size: 2048
device_train_microbatch_size: 8
My GPU was A100, using 2nodes with 4GPU each for the azure cluster NC96ads and I didn't use flash attention

Traceback (most recent call last):
File "/mnt/azureml/cr/j/e83d0122ec494c7dbf7572c30a51c53b/exe/wd/scripts/train.py", line 300, in
main(cfg)
File "/mnt/azureml/cr/j/e83d0122ec494c7dbf7572c30a51c53b/exe/wd/scripts/train.py", line 272, in main
trainer.fit()
File "/workspace/OLMo/olmo/train.py", line 1053, in fit
for batch in self.train_loader:
File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 631, in next
data = self._next_data()
^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/dataloader.py", line 675, in _next_data
data = self._dataset_fetcher.fetch(index) # may raise StopIteration
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/lib/python3.11/site-packages/torch/utils/data/_utils/fetch.py", line 32, in fetch
data.append(next(self.dataset_iter))
^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/OLMo/olmo/data/iterable_dataset.py", line 177, in
return (x for x in roundrobin(*thread_generators))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/workspace/OLMo/olmo/util.py", line 695, in roundrobin
yield next()
^^^^^^
File "/workspace/OLMo/olmo/util.py", line 679, in threaded_generator
raise OLMoThreadError(f"generator thread {thread_name} failed") from x
olmo.exceptions.OLMoThreadError: generator thread data thread 3 failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

4 participants