Issue training on multiple nodes #550

edwardsp · 2024-04-18T16:01:50Z

❓ The question

I am trying to run training and I get this error when staring up:

HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/glue/paths-info/bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c
[2024-04-18 15:55:06] CRITICAL [olmo.util:158, rank=6] Uncaught HfHubHTTPError: 429 Client Error: Too Many Requests for url: https://huggingface.co/api/datasets/glue/paths-info/bcdcba79d07bc864c1c254ccfcedcce55bcc9a8c

I am running on 2 nodes each with 8 GPUs, using the main branch and pytorch 2.2.2+cu121.

This works with just 1 node using 8 GPUs.

The text was updated successfully, but these errors were encountered:

xijiu9 · 2024-04-18T17:46:12Z

I have exactly the same problem. 1 node works, but 2 node fails. I think this is a problem on huggingface side.

2015aroras · 2024-04-19T18:02:36Z

We run into issues like that too. We don't have a robust solution yet, but one trick we do is caching the datasets locally (or once per node or however many file systems you have) as follows and then making HF not call the hub by setting the environment variable HF_DATASETS_OFFLINE=1.

from olmo.eval.downstream import *
tokenizer = Tokenizer.from_file("tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json")
for x in label_to_task_map.values():
    kwargs = {}
    if isinstance(x, tuple):
        x, kwargs = x
    x(tokenizer=tokenizer, **kwargs)

edwardsp added the type/question An issue that's a question label Apr 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue training on multiple nodes #550

Issue training on multiple nodes #550

edwardsp commented Apr 18, 2024

xijiu9 commented Apr 18, 2024

2015aroras commented Apr 19, 2024 •

edited

Issue training on multiple nodes #550

Issue training on multiple nodes #550

Comments

edwardsp commented Apr 18, 2024

❓ The question

xijiu9 commented Apr 18, 2024

2015aroras commented Apr 19, 2024 • edited

2015aroras commented Apr 19, 2024 •

edited