Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ValueError: math domain error #40

Open
hayoung-jeremy opened this issue Apr 17, 2024 · 5 comments
Open

ValueError: math domain error #40

hayoung-jeremy opened this issue Apr 17, 2024 · 5 comments

Comments

@hayoung-jeremy
Copy link

summary

  • error happens when training
  • tested on Runpod's A100 SXM 80GB x4 GPUs, 128 vCPU 1006 GB RAM
  • runpod/pytorch:2.2.0-py3.10-cuda12.1.1-devel-ubuntu22.04

reproduction of the error

  1. installation of OpenLRM was successful

  2. data preparation using blender_script.py was successful, generated 100 pairs of data each containig rgba, pose, intrinsics.npy.

  3. configuration of training_sample.yaml and accelerate_training.yaml as follows :

        
    experiment:
        type: lrm
        seed: 42
        parent: lrm-objaverse
        child: small-dummyrun
    
    model:
        camera_embed_dim: 1024
        rendering_samples_per_ray: 96
        transformer_dim: 512
        transformer_layers: 12
        transformer_heads: 8
        triplane_low_res: 32
        triplane_high_res: 64
        triplane_dim: 32
        encoder_type: dinov2
        encoder_model_name: dinov2_vits14_reg
        encoder_feat_dim: 384
        encoder_freeze: false
    
    dataset:
        subsets:
            -   name: objaverse
                root_dirs:
                    - "/root/OpenLRM/views" # modified this value
                meta_path:
                    train: "/root/OpenLRM/train_uids.json" # modified this value
                    val: "/root/OpenLRM/val_uids.json" # modified this value
                sample_rate: 1.0
        sample_side_views: 3
        source_image_res: 224
        render_image:
            low: 64
            high: 192
            region: 64
        normalize_camera: true
        normed_dist_to_center: auto
        num_train_workers: 4
        num_val_workers: 2
        pin_mem: true
    
    train:
        mixed_precision: bf16  # REPLACE THIS BASED ON GPU TYPE
        find_unused_parameters: false
        loss:
            pixel_weight: 1.0
            perceptual_weight: 1.0
            tv_weight: 5e-4
        optim:
            lr: 4e-4
            weight_decay: 0.05
            beta1: 0.9
            beta2: 0.95
            clip_grad_norm: 1.0
        scheduler:
            type: cosine
            warmup_real_iters: 3000
        batch_size: 16  # REPLACE THIS (PER GPU)
        accum_steps: 1  # REPLACE THIS
        epochs: 60  # REPLACE THIS
        debug_global_steps: null
    
    val:
        batch_size: 4
        global_step_period: 1000
        debug_batches: null
    
    saver:
        auto_resume: true
        load_model: null
        checkpoint_root: ./exps/checkpoints
        checkpoint_global_steps: 1000
        checkpoint_keep_level: 5
    
    logger:
        stream_level: WARNING
        log_level: INFO
        log_root: ./exps/logs
        tracker_root: ./exps/trackers
        enable_profiler: false
        trackers:
            - tensorboard
        image_monitor:
            train_global_steps: 100
            samples_per_log: 4
    
    compile:
        suppress_errors: true
        print_specializations: true
        disable: true
    compute_environment: LOCAL_MACHINE
    debug: false
    distributed_type: MULTI_GPU
    downcast_bf16: 'no'
    gpu_ids: all
    machine_rank: 0
    main_training_function: main
    mixed_precision: bf16
    num_machines: 1
    num_processes: 4 # only modified this value
    rdzv_backend: static
    same_network: true
    tpu_env: []
    tpu_use_cluster: false
    tpu_use_sudo: false
    use_cpu: false
  4. the error message :

    [TRAIN STEP]loss=0.624, loss_pixel=0.0577, loss_perceptual=0.566, loss_tv=0.698, lr=8.13e-6: 100%|███████████████████████████████████████████████| 60/60 [04:55<00:00,  4.92s/it]Traceback (most recent call last):
      File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
        return _run_code(code, main_globals, None,
      File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
        exec(code, run_globals)
      File "/root/OpenLRM/openlrm/launch.py", line 36, in <module>
        main()
      File "/root/OpenLRM/openlrm/launch.py", line 32, in main
        runner.run()
      File "/root/OpenLRM/openlrm/runners/train/base_trainer.py", line 338, in run
        self.train()
      File "/root/OpenLRM/openlrm/runners/train/lrm.py", line 343, in train
        self.save_checkpoint()
      File "/root/OpenLRM/openlrm/runners/train/base_trainer.py", line 118, in wrapper
        result = accelerated_func(self, *args, **kwargs)
      File "/usr/local/lib/python3.10/dist-packages/accelerate/accelerator.py", line 669, in _inner
        return PartialState().on_main_process(function)(*args, **kwargs)
      File "/root/OpenLRM/openlrm/runners/train/base_trainer.py", line 246, in save_checkpoint
        cur_order = ckpt_base ** math.floor(math.log(max_ckpt // ckpt_period, ckpt_base))
    ValueError: math domain error
    [2024-04-17 08:24:09,179] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65932 closing signal SIGTERM
    [2024-04-17 08:24:09,183] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65933 closing signal SIGTERM
    [2024-04-17 08:24:09,186] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 65934 closing signal SIGTERM
    [2024-04-17 08:24:09,301] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 65931) of binary: /usr/bin/python
    Traceback (most recent call last):
      File "/usr/local/bin/accelerate", line 8, in <module>
        sys.exit(main())
      File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/accelerate_cli.py", line 46, in main
        args.func(args)
      File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 1066, in launch_command
        multi_gpu_launcher(args)
      File "/usr/local/lib/python3.10/dist-packages/accelerate/commands/launch.py", line 711, in multi_gpu_launcher
        distrib_run.run(args)
      File "/usr/local/lib/python3.10/dist-packages/torch/distributed/run.py", line 803, in run
        elastic_launch(
      File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 135, in __call__
        return launch_agent(self._config, self._entrypoint, list(args))
      File "/usr/local/lib/python3.10/dist-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
        raise ChildFailedError(
    torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
    ============================================================
    openlrm.launch FAILED
    ------------------------------------------------------------
    Failures:
      <NO_OTHER_FAILURES>
    ------------------------------------------------------------
    Root Cause (first observed failure):
    [0]:
      time      : 2024-04-17_08:24:09
      host      : dcf76dfb9908
      rank      : 0 (local_rank: 0)
      exitcode  : 1 (pid: 65931)
      error_file: <N/A>
      traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
    ============================================================
@kunalkathare
Copy link

Hey @hayoung-jeremy , try reducing the value of global_step_period under val: in the train sample yaml file , until it stops giving the error, which worked for me when I was trying to train with 350 objects.

@hayoung-jeremy
Copy link
Author

Wow, you're my savior, thank you so much! I'll try it!

@hayoung-jeremy
Copy link
Author

Thank you @kunalkathare , I've tried with the following config, modified epoch and global_step_period :

...

train:
    mixed_precision: bf16
    find_unused_parameters: false
    loss:
        pixel_weight: 1.0
        perceptual_weight: 1.0
        tv_weight: 5e-4
    optim:
        lr: 4e-4
        weight_decay: 0.05
        beta1: 0.9
        beta2: 0.95
        clip_grad_norm: 1.0
    scheduler:
        type: cosine
        warmup_real_iters: 3000
    batch_size: 16 
    accum_steps: 1
    epochs: 100  # MODIFIED : 60 -> 100
    debug_global_steps: null

val:
    batch_size: 4
    global_step_period: 100 # MODIFIED : 1000 -> 100
    debug_batches: null

...

and successfully generated a checkpoint as follows :

[TRAIN STEP]loss=0.642, loss_pixel=0.0695, loss_perceptual=0.572, loss_tv=0.7, lr=1.35e-5: 100%|███████████████████████████████████████████████| 100/100 [03:24<00:00,  5.10s/it]

But it seems the loss value is too high. What should I modify to decrease the loss value?
Should I increase the epoch to 1000?
And what is the ideal loss values for successfully generated checkpoint?
Could you share me your case?
Thank you so much for your help

@kunalkathare
Copy link

Thank you @kunalkathare , I've tried with the following config, modified epoch and global_step_period :

...

train:
    mixed_precision: bf16
    find_unused_parameters: false
    loss:
        pixel_weight: 1.0
        perceptual_weight: 1.0
        tv_weight: 5e-4
    optim:
        lr: 4e-4
        weight_decay: 0.05
        beta1: 0.9
        beta2: 0.95
        clip_grad_norm: 1.0
    scheduler:
        type: cosine
        warmup_real_iters: 3000
    batch_size: 16 
    accum_steps: 1
    epochs: 100  # MODIFIED : 60 -> 100
    debug_global_steps: null

val:
    batch_size: 4
    global_step_period: 100 # MODIFIED : 1000 -> 100
    debug_batches: null

...

and successfully generated a checkpoint as follows :

[TRAIN STEP]loss=0.642, loss_pixel=0.0695, loss_perceptual=0.572, loss_tv=0.7, lr=1.35e-5: 100%|███████████████████████████████████████████████| 100/100 [03:24<00:00,  5.10s/it]

But it seems the loss value is too high. What should I modify to decrease the loss value?
Should I increase the epoch to 1000?
And what is the ideal loss values for successfully generated checkpoint?
Could you share me your case?
Thank you so much for your help

The loss value is reduced when the size of the dataset is more, and I guess you can increase the epochs and see if it affects.

@hayoung-jeremy
Copy link
Author

Thank you for kind reply @kunalkathare !

  • I don't have enough dataset for now, can I just copy the same data to increase the amount of it?
  • And I've tried to increase the epoch to 1000, it also generated the checkpoint with the loss value about 0.3.
    But the inference result quality from that checkpoint is not that good, as you can see in this issue.
    So I'm going to try to increase the epoch to 10000, is it okay?
    If it is, what kind of values should I adjust from the train_sample.yaml?

Really great help from you, many thanks for your assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants