Title: CUDA RuntimeError: Unspecified Launch Failure during Training #30913

Hongjie1Chu · 2024-05-20T11:05:47Z

System Info

transformers version: 4.41.0
Platform: Linux-5.15.0-88-generic-x86_64-with-glibc2.35
Python version: 3.10.6
Huggingface_hub version: 0.23.0
Safetensors version: 0.4.3
Accelerate version: 0.27.2
Accelerate config: not found
PyTorch version (GPU?): 2.1.2+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

Why does this error occur when passing a custom device_map? The map I wrote only differs from the auto-generated map in device order. Why does this cause an error? Does the device order affect the execution results?

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import torch
from torch import nn
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, LlamaForCausalLM
from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding
from transformers.utils.fx import symbolic_trace
import argparse
import numpy as np
from datasets import load_metric, load_dataset

def compute_metrics(eval_preds):
metric = load_metric("glue", "mrpc")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)

def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

if name == "main":
parser = argparse.ArgumentParser()
parser.add_argument('--gpus', type=int, help='the number of gpus', default=8)
parser.add_argument('--modelName', type=str, help="the name of model", default='Llama2')
parser.add_argument('--bs', type=int, help="the name of bs", default=4)

args = parser.parse_args()

# Step 1: Define the model
tokenizer = AutoTokenizer.from_pretrained('FlagAlpha/Atom-7B-Chat')
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

device_map = {
    'model.embed_tokens': 6,
    'model.layers.0': 6,
    'model.layers.1': 4,
    'model.layers.2': 1,
    'model.layers.3': 1,
    'model.layers.4': 1,
    'model.layers.5': 0,
    'model.layers.6': 0,
    'model.layers.7': 0,
    'model.layers.8': 0,
    'model.layers.9': 0,
    'model.layers.10': 6,
    'model.layers.11': 5,
    'model.layers.12': 5,
    'model.layers.13': 5,
    'model.layers.14': 5,
    'model.layers.15': 5,
    'model.layers.16': 4,
    'model.layers.17': 4,
    'model.layers.18': 4,
    'model.layers.19': 4,
    'model.layers.20': 3,
    'model.layers.21': 3,
    'model.layers.22': 3,
    'model.layers.23': 3,
    'model.layers.24': 3,
    'model.layers.25': 2,
    'model.layers.26': 2,
    'model.layers.27': 2,
    'model.layers.28': 2,
    'model.layers.29': 2,
    'model.layers.30': 1,
    'model.layers.31': 1,
    "model.norm.weight": 1,
    "lm_head": 6,
}

model = AutoModelForCausalLM.from_pretrained('FlagAlpha/Atom-7B-Chat', device_map=device_map, num_labels=2)

print(model)
print(model.hf_device_map)

print("gpt start train")

# Step 4: Load the dataset
data_files = {
    'train': '/mnt/glue_mrpc/train.jsonl',
    'test': '/mnt/glue_mrpc/test.jsonl',
    'validation': '/mnt/glue_mrpc/validation.jsonl'
}
raw_datasets = load_dataset('json', data_files=data_files)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.rename_column("label", 'labels')
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Step 5: Train the model
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=5,
    per_device_train_batch_size=args.bs,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

print('start train')
trainer.train()

Expected behavior

I want to know if the device order in the device_map affects the results.

The text was updated successfully, but these errors were encountered:

Hongjie1Chu · 2024-05-20T11:53:05Z

and when i set :
device_map["model.embed_tokens"] = 0
device_map["model.norm.weight"] = 0

it will not error at start ,but it will error after:

younesbelkada · 2024-05-21T07:41:00Z

Hi @Hongjie1Chu !
In principle the device order shouldn't affect the training behaviour - can you let us know what happens when you run the training script with CUDA_LAUNCH_BLOCKING=1 ? Also do you run your training script with accelerate launch xxx or python xxx.py?

Sharan1712 · 2024-05-21T17:07:14Z

I too am facing a similar issue.
I haven't made any changes to my code but all of a sudden, my code gives this error after training for like 30 steps.

Sharan1712 · 2024-05-22T15:43:03Z

Update: I downgraded my PEFT to 10.0 and Transformers to 4.39.0 and it is working fine now

Hongjie1Chu · 2024-05-23T05:18:50Z

thanks for your answer!

Sharan1712 · 2024-06-06T17:22:26Z

Has there been a solution for this yet? I tried using the latest version of transformers and it still gave this issue. I want to use some of the new quantization methods.

Sharan1712 · 2024-06-10T08:34:22Z

@ArthurZucker @younesbelkada @muellerzr

younesbelkada · 2024-06-10T10:41:37Z

Hi !
It is hard for us to debug without a proper error trace, can you re-run the training script with CUDA_LAUNCH_BLOCKING=1 and paste the error trace here?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Title: CUDA RuntimeError: Unspecified Launch Failure during Training #30913

Title: CUDA RuntimeError: Unspecified Launch Failure during Training #30913

Hongjie1Chu commented May 20, 2024

Hongjie1Chu commented May 20, 2024

younesbelkada commented May 21, 2024

Sharan1712 commented May 21, 2024

Sharan1712 commented May 22, 2024

Hongjie1Chu commented May 23, 2024

Sharan1712 commented Jun 6, 2024

Sharan1712 commented Jun 10, 2024

younesbelkada commented Jun 10, 2024

Title: CUDA RuntimeError: Unspecified Launch Failure during Training #30913

Title: CUDA RuntimeError: Unspecified Launch Failure during Training #30913

Comments

Hongjie1Chu commented May 20, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Hongjie1Chu commented May 20, 2024

younesbelkada commented May 21, 2024

Sharan1712 commented May 21, 2024

Sharan1712 commented May 22, 2024

Hongjie1Chu commented May 23, 2024

Sharan1712 commented Jun 6, 2024

Sharan1712 commented Jun 10, 2024

younesbelkada commented Jun 10, 2024