Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Title: CUDA RuntimeError: Unspecified Launch Failure during Training #30913

Open
2 of 4 tasks
Hongjie1Chu opened this issue May 20, 2024 · 8 comments
Open
2 of 4 tasks

Comments

@Hongjie1Chu
Copy link

System Info

  • transformers version: 4.41.0
  • Platform: Linux-5.15.0-88-generic-x86_64-with-glibc2.35
  • Python version: 3.10.6
  • Huggingface_hub version: 0.23.0
  • Safetensors version: 0.4.3
  • Accelerate version: 0.27.2
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker @younesbelkada @muellerzr

Why does this error occur when passing a custom device_map? The map I wrote only differs from the auto-generated map in device order. Why does this cause an error? Does the device order affect the execution results?

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import torch
from torch import nn
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, Trainer, LlamaForCausalLM
from transformers import DataCollatorForLanguageModeling, DataCollatorWithPadding
from transformers.utils.fx import symbolic_trace
import argparse
import numpy as np
from datasets import load_metric, load_dataset

def compute_metrics(eval_preds):
metric = load_metric("glue", "mrpc")
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)

def tokenize_function(example):
return tokenizer(example["sentence1"], example["sentence2"], truncation=True)

if name == "main":
parser = argparse.ArgumentParser()
parser.add_argument('--gpus', type=int, help='the number of gpus', default=8)
parser.add_argument('--modelName', type=str, help="the name of model", default='Llama2')
parser.add_argument('--bs', type=int, help="the name of bs", default=4)

args = parser.parse_args()

# Step 1: Define the model
tokenizer = AutoTokenizer.from_pretrained('FlagAlpha/Atom-7B-Chat')
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

device_map = {
    'model.embed_tokens': 6,
    'model.layers.0': 6,
    'model.layers.1': 4,
    'model.layers.2': 1,
    'model.layers.3': 1,
    'model.layers.4': 1,
    'model.layers.5': 0,
    'model.layers.6': 0,
    'model.layers.7': 0,
    'model.layers.8': 0,
    'model.layers.9': 0,
    'model.layers.10': 6,
    'model.layers.11': 5,
    'model.layers.12': 5,
    'model.layers.13': 5,
    'model.layers.14': 5,
    'model.layers.15': 5,
    'model.layers.16': 4,
    'model.layers.17': 4,
    'model.layers.18': 4,
    'model.layers.19': 4,
    'model.layers.20': 3,
    'model.layers.21': 3,
    'model.layers.22': 3,
    'model.layers.23': 3,
    'model.layers.24': 3,
    'model.layers.25': 2,
    'model.layers.26': 2,
    'model.layers.27': 2,
    'model.layers.28': 2,
    'model.layers.29': 2,
    'model.layers.30': 1,
    'model.layers.31': 1,
    "model.norm.weight": 1,
    "lm_head": 6,
}

model = AutoModelForCausalLM.from_pretrained('FlagAlpha/Atom-7B-Chat', device_map=device_map, num_labels=2)

print(model)
print(model.hf_device_map)

print("gpt start train")

# Step 4: Load the dataset
data_files = {
    'train': '/mnt/glue_mrpc/train.jsonl',
    'test': '/mnt/glue_mrpc/test.jsonl',
    'validation': '/mnt/glue_mrpc/validation.jsonl'
}
raw_datasets = load_dataset('json', data_files=data_files)

tokenized_datasets = raw_datasets.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.rename_column("label", 'labels')
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# Step 5: Train the model
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=5,
    per_device_train_batch_size=args.bs,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

print('start train')
trainer.train()

Expected behavior

I want to know if the device order in the device_map affects the results.

@Hongjie1Chu
Copy link
Author

and when i set :
device_map["model.embed_tokens"] = 0
device_map["model.norm.weight"] = 0

it will not error at start ,but it will error after:
image

@younesbelkada
Copy link
Contributor

Hi @Hongjie1Chu !
In principle the device order shouldn't affect the training behaviour - can you let us know what happens when you run the training script with CUDA_LAUNCH_BLOCKING=1 ? Also do you run your training script with accelerate launch xxx or python xxx.py?

@Sharan1712
Copy link

I too am facing a similar issue.
I haven't made any changes to my code but all of a sudden, my code gives this error after training for like 30 steps.

@Sharan1712
Copy link

Update: I downgraded my PEFT to 10.0 and Transformers to 4.39.0 and it is working fine now

@Hongjie1Chu
Copy link
Author

thanks for your answer!

@Sharan1712
Copy link

Has there been a solution for this yet? I tried using the latest version of transformers and it still gave this issue. I want to use some of the new quantization methods.

@Sharan1712
Copy link

@younesbelkada
Copy link
Contributor

Hi !
It is hard for us to debug without a proper error trace, can you re-run the training script with CUDA_LAUNCH_BLOCKING=1 and paste the error trace here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants