Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Mismatch between dtype settings in model and ds_config results in NaN loss #5509

Open
Taiki-azrs opened this issue May 8, 2024 · 0 comments
Labels
bug Something isn't working training

Comments

@Taiki-azrs
Copy link

Describe the bug
When there is a mismatch between the dtype settings of the model and ds_config, training starts without any specific error and the loss turns NaN (this issue occurs mainly in Zero stage0).

I suggest adding a dtype check between the model and config during the execution of deepspeed.initialize and throwing an assert if they do not match. What do you think?

To Reproduce

  1. Use the DeepSpeedExample with cifar.
  2. Edit cifar10_deepspeed.py as follows:
+    net = net.half()
    model_engine, optimizer, trainloader, __ = deepspeed.initialize(
        args=args,
        model=net,
        model_parameters=parameters,
        training_data=trainset,
        config=ds_config,
    )

    # Get the local device name (str) and local rank (int).
    local_device = get_accelerator().device_name(model_engine.local_rank)
    local_rank = model_engine.local_rank

    # For float32, target_dtype will be None so no datatype conversion needed.
    target_dtype = None
    if model_engine.bfloat16_enabled():
        target_dtype = torch.bfloat16
    elif model_engine.fp16_enabled():
        target_dtype = torch.half
+    target_dtype = torch.half
  1. Execute the following:
$ deepspeed --bind_cores_to_rank cifar10_deepspeed.py --dtype fp32 --stage 0
  1. You will observe that the loss turns NaN.
[ 1,  2000] loss:  nan
[ 2,  2000] loss:  nan
[ 3,  2000] loss:  nan
[ 4,  2000] loss:  nan
[ 5,  2000] loss:  nan
@Taiki-azrs Taiki-azrs added bug Something isn't working training labels May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

1 participant