eval.py error while benchmarking T5 #460

sigjhl · 2023-07-14T22:38:35Z

Console

[Eval batch=1/1289] Eval on lambada_openai/0-shot data
[Eval batch=130/1289] Eval on lambada_openai/0-shot data
[Eval batch=259/1289] Eval on lambada_openai/0-shot data
[Eval batch=387/1289] Eval on lambada_openai/0-shot data
[Eval batch=516/1289] Eval on lambada_openai/0-shot data
[Eval batch=645/1289] Eval on lambada_openai/0-shot data
[Eval batch=774/1289] Eval on lambada_openai/0-shot data
[Eval batch=903/1289] Eval on lambada_openai/0-shot data
[Eval batch=1031/1289] Eval on lambada_openai/0-shot data
[Eval batch=1160/1289] Eval on lambada_openai/0-shot data
/home/codeless/Desktop/llm-foundry/mosaic/lib/python3.10/site-packages/composer/core/data_spec.py:35: UserWarning: Cannot split tensor of length 1 into batches of size 4. As it is smaller, no splitting will be done. This may happen on the last batch of a dataset if it is a smaller size than the microbatch size.
warnings.warn(f'Cannot split tensor of length {len(t)} into batches of size {microbatch_size}. '
/home/codeless/Desktop/llm-foundry/mosaic/lib/python3.10/site-packages/composer/core/data_spec.py:26: UserWarning: Cannot split list of length 1 into batches of size 4. As it is smaller, no splitting will be done. This may happen on the last batch of a dataset if it is a smaller size than the microbatch size.
warnings.warn(f'Cannot split list of length {len(l)} into batches of size {microbatch_size}. '
[Eval batch=1289/1289] Eval on lambada_openai/0-shot data
[Eval batch=1/919] Eval on piqa/10-shot data
[Eval batch=93/919] Eval on piqa/10-shot data
[Eval batch=185/919] Eval on piqa/10-shot data
[Eval batch=276/919] Eval on piqa/10-shot data
[Eval batch=368/919] Eval on piqa/10-shot data
[Eval batch=460/919] Eval on piqa/10-shot data
[Eval batch=552/919] Eval on piqa/10-shot data
[Eval batch=644/919] Eval on piqa/10-shot data
[Eval batch=735/919] Eval on piqa/10-shot data
[Eval batch=827/919] Eval on piqa/10-shot data
[Eval batch=919/919] Eval on piqa/10-shot data
[Eval batch=1/10042] Eval on hellaswag/10-shot data
[Eval batch=1005/10042] Eval on hellaswag/10-shot data
[Eval batch=2009/10042] Eval on hellaswag/10-shot data
[Eval batch=3013/10042] Eval on hellaswag/10-shot data
[Eval batch=4017/10042] Eval on hellaswag/10-shot data
[Eval batch=5022/10042] Eval on hellaswag/10-shot data
[Eval batch=6026/10042] Eval on hellaswag/10-shot data
[Eval batch=7030/10042] Eval on hellaswag/10-shot data
[Eval batch=8034/10042] Eval on hellaswag/10-shot data
[Eval batch=9038/10042] Eval on hellaswag/10-shot data
[Eval batch=10042/10042] Eval on hellaswag/10-shot data
[Eval batch=1/2376] Eval on arc_easy/10-shot data
[Eval batch=238/2376] Eval on arc_easy/10-shot data
[Eval batch=476/2376] Eval on arc_easy/10-shot data
[Eval batch=714/2376] Eval on arc_easy/10-shot data
[Eval batch=951/2376] Eval on arc_easy/10-shot data
[Eval batch=1188/2376] Eval on arc_easy/10-shot data
[Eval batch=1426/2376] Eval on arc_easy/10-shot data
[Eval batch=1664/2376] Eval on arc_easy/10-shot data
[Eval batch=1901/2376] Eval on arc_easy/10-shot data
[Eval batch=2138/2376] Eval on arc_easy/10-shot data
[Eval batch=2376/2376] Eval on arc_easy/10-shot data
[Eval batch=1/1172] Eval on arc_challenge/10-shot data
[Eval batch=118/1172] Eval on arc_challenge/10-shot data
[Eval batch=235/1172] Eval on arc_challenge/10-shot data
[Eval batch=352/1172] Eval on arc_challenge/10-shot data
[Eval batch=469/1172] Eval on arc_challenge/10-shot data
[Eval batch=586/1172] Eval on arc_challenge/10-shot data
[Eval batch=704/1172] Eval on arc_challenge/10-shot data
[Eval batch=821/1172] Eval on arc_challenge/10-shot data
[Eval batch=938/1172] Eval on arc_challenge/10-shot data
[Eval batch=1055/1172] Eval on arc_challenge/10-shot data
[Eval batch=1172/1172] Eval on arc_challenge/10-shot data
[Eval batch=1/50] Eval on copa/0-shot data
[Eval batch=6/50] Eval on copa/0-shot data
[Eval batch=11/50] Eval on copa/0-shot data
[Eval batch=16/50] Eval on copa/0-shot data
[Eval batch=21/50] Eval on copa/0-shot data
[Eval batch=26/50] Eval on copa/0-shot data
[Eval batch=30/50] Eval on copa/0-shot data
[Eval batch=35/50] Eval on copa/0-shot data
[Eval batch=40/50] Eval on copa/0-shot data
[Eval batch=45/50] Eval on copa/0-shot data
[Eval batch=50/50] Eval on copa/0-shot data
[Eval batch=1/1635] Eval on boolq/10-shot data
[Eval batch=164/1635] Eval on boolq/10-shot data
[Eval batch=328/1635] Eval on boolq/10-shot data
[Eval batch=491/1635] Eval on boolq/10-shot data
[Eval batch=655/1635] Eval on boolq/10-shot data
[Eval batch=818/1635] Eval on boolq/10-shot data
[Eval batch=981/1635] Eval on boolq/10-shot data
[Eval batch=1145/1635] Eval on boolq/10-shot data
[Eval batch=1308/1635] Eval on boolq/10-shot data
[Eval batch=1472/1635] Eval on boolq/10-shot data
[Eval batch=1635/1635] Eval on boolq/10-shot data
Ran google/flan-t5-xl eval in: 13817.477584123611 seconds
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /home/codeless/Desktop/llm-foundry/scripts/eval/eval.py:252 in │
│ │
│ 249 │ │ yaml_cfg = om.load(f) │
│ 250 │ cli_cfg = om.from_cli(args_list) │
│ 251 │ cfg = om.merge(yaml_cfg, cli_cfg) │
│ ❱ 252 │ main(cfg) │
│ 253 │
│ │
│ /home/codeless/Desktop/llm-foundry/scripts/eval/eval.py:126 in main │
│ │
│ 123 │ │ │ │ │ │ │ │ │ │ │ model_gauntlet_df) │
│ 124 │ │ │
│ 125 │ │ if model_gauntlet_callback is not None: │
│ ❱ 126 │ │ │ composite_scores = model_gauntlet_callback.eval_end( │
│ 127 │ │ │ │ None, in_memory_logger) │
│ 128 │ │ │
│ 129 │ │ benchmark_to_taxonomy = {} │
│ │
│ /home/codeless/Desktop/llm-foundry/llmfoundry/callbacks/model_gauntlet_callback.py:112 in │
│ eval_end │
│ │
│ 109 │ │ return {k: sum(v) / len(v) for k, v in results.items()} │
│ 110 │ │
│ 111 │ def eval_end(self, state: State, logger: Logger): │
│ ❱ 112 │ │ new_metrics = self.compute_averages(logger) │
│ 113 │ │ composite_scores = {} │
│ 114 │ │ for category in self.categories: │
│ 115 │ │ │ composite_scores[category['name']] = [] │
│ │
│ /home/codeless/Desktop/llm-foundry/llmfoundry/callbacks/model_gauntlet_callback.py:92 in │
│ compute_averages │
│ │
│ 89 │ │ │ 'metrics/(.?)/(\d+)-shot(/.?)?/InContextLearning(.*)') │
│ 90 │ │ for key in self.logger_keys: │
│ 91 │ │ │ match = pat.match(key) │
│ ❱ 92 │ │ │ val = logger_data.data[key][0][1].item() │
│ 93 │ │ │ │
│ 94 │ │ │ if match: │
│ 95 │ │ │ │ eval_name = match.group(1) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
KeyError: 'metrics/lambada_openai/0-shot/InContextLearningLMAccuracy'
ERROR:composer.cli.launcher:Rank 0 crashed with exit code 1.
Waiting up to 30 seconds for all training processes to terminate. Press Ctrl-C to exit immediately.
Global rank 0 (PID 11800) exited with code 1
ERROR:composer.cli.launcher:Global rank 0 (PID 11800) exited with code 1

To reproduce

I pip installed mosaicml and llm-foundry requirements yesterday, and ran the eval.py script on a flan-t5-xl model according to the quickstart guide.
I only changed the max_seq_len, icl_seq_len to 512, model_name_or_path = google/flan-t5-xl, and model name to hf_t5, in hf_eval.yaml and tasks_light.yaml

Expected behavior

Successful benchmarking.

Additional context

I can't figure out why it couldn't find the key in the logger. I lack the experience to dig into it more, so I hope this info is enough for you guys to figure out what's wrong.

By the way, where is the benchmark results saved to?

hanlint · 2023-07-23T16:58:21Z

cc: @bmosaicml who worked on the evaluation code, to take a look.

sigjhl added the bug Something isn't working label Jul 14, 2023

hanlint assigned bmosaicml Jul 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval.py error while benchmarking T5 #460

eval.py error while benchmarking T5 #460

sigjhl commented Jul 14, 2023

hanlint commented Jul 23, 2023

eval.py error while benchmarking T5 #460

eval.py error while benchmarking T5 #460

Comments

sigjhl commented Jul 14, 2023

Console

To reproduce

Expected behavior

Additional context

hanlint commented Jul 23, 2023