Training covergence on the gobjaverse dataset #35

wenqsun · 2024-03-04T02:55:14Z

Thanks for your great work!

I have tried the code on Gobjaverse, but I found that the training is not very stable. For example, during the initial epochs, I achieved the result below:

the top row is the prediction input and the bottom row is the ground-truth input, we can find that the color and shape have been trained relatively well. But after several epochs, the training process suffers from a severe collapse and the loss stays high

I adjusted the lr, batch_size the coefficient of lpips loss, but it seems that most settings will meet this collapse. I wonder if you have met such a situation, or can you provide some potential solution to solve this issue?

Thanks!

ashawkey · 2024-03-05T09:22:19Z

@wenqsun This is strange... I find that turning off LPIPS loss at the beginning stage of training can be helpful, but once it has converged to the first image level, adding LPIPS will significantly improve the fidelity. Have you done any other changes?

wenqsun · 2024-03-08T08:12:37Z

Oh, sorry to reply so late.

Thanks for your suggestion! I have solved this problem and achieved high-quality generation results. I also find that the coefficient of LPIPS loss is very important.

junwuzhang19 · 2024-03-08T08:35:34Z

@wenqsun Hi bro, I also tried to reproduce on Gobjaverse but got bad results after training 10k steps. I tried to decrease lr from 4e-4 to 1e-4, but it didn't work. I wonder if you can share your training configurations as well as the modifications?

Thanks!

wenqsun · 2024-03-11T12:55:22Z

@junwuzhang19 Hi, sorry to reply so late. I think you can adjust the lr to a smaller value. I use 1e-5 to 5e-5, and the training process is very stable. Moreover, I recommend that you record the mse and lpips loss, and adjust the coefficient between these two terms based on loss curves.

junwuzhang19 · 2024-03-11T13:38:26Z

@wenqsun Thanks for your kindly advice! I will try it later. Besides, I find that training lgm on 8 A100 gpus for 30 epochs will be finished in about 1 day, which is significantly faster than 4 days with 32 A100 gpus as claimed in the paper. I wonder if you trained lgm for similar time and get as good results as the released lgm weights. Thanks!

wenqsun · 2024-03-11T15:13:39Z

@junwuzhang19 yes, I also find that 30 epochs will only take about one day using 8 A100 GPUs. Based on my results, I think that if you want to achieve a competitive performance, more epochs are required. Overall, I use a similar time to get a similar result as the original LGM.

chenguolin · 2024-03-13T08:09:11Z

Hi @ashawkey @wenqsun @junwuzhang19,

I'm also attempting to replicate LGM on GObjaverse. Have you experimented with training the tiny, small, and large versions of LGM and comparing their performance and loss/PSNR/LPIPS curves?

In my case, the tiny LGM has the best performance, which is surprising to me. I tried to adjust the learning rate to a smaller value for the small and large versions, but it didn't make a difference: the tiny model still performs the best. Do you have any insights?

Thank you for your time and attention.

ashawkey · 2024-03-13T11:57:22Z

@chenguolin Hi, how to you evaluate the performance? The loss/PSNR curve may not fairly reflect the quality between these settings, as they are evaluted on different output resolutions.

chenguolin · 2024-03-13T13:23:56Z

@chenguolin Hi, how to you evaluate the performance? The loss/PSNR curve may not fairly reflect the quality between these settings, as they are evaluted on different output resolutions.

Thanks for your reply!

Yes, that's correct. However, for the tiny and small versions, the rendering image size remains the same at 256, and I have set the number of rendering views (opt.num_views) to 8 for both. The tiny model still outperforms the small version in both LPIPS and PSNR, even after adjusting the learning rate of the small version. The tiny model works perfectly for me, so the training process should be correct (?).

I'm interested to know if others have encountered the same issue. I'm also curious if this could be due to the change in the training dataset size, and whether more careful hyperparameter selection is needed. Additionally, I plan to train different versions of LGM in the 80k dataset (https://github.com/ashawkey/objaverse_filter) later.

ashawkey · 2024-03-14T04:20:27Z

This is interesting, actually I don't do a lot of experiments with the small and tiny models. Have you tried testing and visualizing the generated Gaussians with imagedream-generated mv images? The tiny model also contains fewer number of Gaussians, which I found to be potentially harming the performance.

wenqsun · 2024-03-14T09:30:00Z

@chenguolin Oh, that's amazing! Actually, I only train the LGM using the big version config.

I am wondering whether you have compared the final performance on the test dataset using tiny, small, and big configs. During my experiments, I found that sometimes, even though the training loss curve does not decrease for a long time, the test loss would decrease gradually (sometimes even large changes). Maybe you can keep the training process longer, and see what would happen for these three models.

This is a really interesting finding!

TianFangzheng · 2024-06-01T09:22:13Z

@wenqsun @ashawkey Hi, Sorry to bother you. I am trying to train a model using the gobjaverse dataset. The training results are poor. I think there may be something wrong with the camera parameters. I read the camera parameters from the jeson article. Are the parameters loaded by the camera correct? Thank you for your time.

The following is the code related to the camera parameters

        # default camera intrinsics
        self.tan_half_fov = np.tan(0.5 * np.deg2rad(self.opt.fovy))
        self.proj_matrix = torch.zeros(4, 4, dtype=torch.float32)
        self.proj_matrix[0, 0] = 1 / self.tan_half_fov
        self.proj_matrix[1, 1] = 1 / self.tan_half_fov
        self.proj_matrix[2, 2] = (self.opt.zfar + self.opt.znear) / (self.opt.zfar - self.opt.znear)
        self.proj_matrix[3, 2] = - (self.opt.zfar * self.opt.znear) / (self.opt.zfar - self.opt.znear)
        self.proj_matrix[2, 3] = 1
       
 ......       
               with open(meta_path, 'r', encoding='utf-8') as file:
                    meta = json.load(file)

                c2w = np.eye(4)
                c2w[:3, 0] = np.array(meta['x'])
                c2w[:3, 1] = np.array(meta['y'])
                c2w[:3, 2] = np.array(meta['z'])
                c2w[:3, 3] = np.array(meta['origin'])
                c2w = torch.tensor(c2w, dtype=torch.float32).reshape(4, 4)


            # blender world + opencv cam --> opengl world & cam
            c2w[1] *= -1
            c2w[[1, 2]] = c2w[[2, 1]]
            c2w[:3, 1:3] *= -1  # invert up and forward direction

            cam_poses.append(c2w)

            vid_cnt += 1
            if vid_cnt == self.opt.num_views:
                break

        if vid_cnt < self.opt.num_views:
            print(f'[WARN] dataset {uid}: not enough valid views, only {vid_cnt} views found!')
            n = self.opt.num_views - vid_cnt
            cam_poses = cam_poses + [cam_poses[-1]] * n

        cam_poses = torch.stack(cam_poses, dim=0)  # [V, 4, 4]


        # normalized camera feats as in paper (transform the first pose to a fixed position)
        radius = torch.norm(cam_poses[0, :3, 3])
        cam_poses[:, :3, 3] *= self.opt.cam_radius / radius
        transform = torch.tensor([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, self.opt.cam_radius], [0, 0, 0, 1]], dtype=torch.float32) @ torch.inverse(cam_poses[0])
        cam_poses = transform.unsqueeze(0) @ cam_poses  # [V, 4, 4]

        # opengl to colmap camera for gaussian renderer
        cam_poses[:, :3, 1:3] *= -1
        cam_view = torch.inverse(cam_poses).transpose(1, 2)  # [V, 4, 4]
        cam_view_proj = cam_view @ self.proj_matrix  # [V, 4, 4]
        cam_pos = -cam_poses[:, :3, 3]  # [V, 3]


        results['cam_view'] = cam_view
        results['cam_view_proj'] = cam_view_proj
        results['cam_pos'] = cam_pos

wenqsun changed the title ~~Training covergence on the gobajaverse dataset~~ Training covergence on the gobjaverse dataset Mar 4, 2024

ashawkey mentioned this issue Apr 5, 2024

Issue with training speed / loss #46

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training covergence on the gobjaverse dataset #35

Training covergence on the gobjaverse dataset #35

wenqsun commented Mar 4, 2024

ashawkey commented Mar 5, 2024

wenqsun commented Mar 8, 2024

junwuzhang19 commented Mar 8, 2024

wenqsun commented Mar 11, 2024

junwuzhang19 commented Mar 11, 2024

wenqsun commented Mar 11, 2024

chenguolin commented Mar 13, 2024

ashawkey commented Mar 13, 2024

chenguolin commented Mar 13, 2024 •

edited

ashawkey commented Mar 14, 2024

wenqsun commented Mar 14, 2024

TianFangzheng commented Jun 1, 2024 •

edited

Training covergence on the gobjaverse dataset #35

Training covergence on the gobjaverse dataset #35

Comments

wenqsun commented Mar 4, 2024

ashawkey commented Mar 5, 2024

wenqsun commented Mar 8, 2024

junwuzhang19 commented Mar 8, 2024

wenqsun commented Mar 11, 2024

junwuzhang19 commented Mar 11, 2024

wenqsun commented Mar 11, 2024

chenguolin commented Mar 13, 2024

ashawkey commented Mar 13, 2024

chenguolin commented Mar 13, 2024 • edited

ashawkey commented Mar 14, 2024

wenqsun commented Mar 14, 2024

TianFangzheng commented Jun 1, 2024 • edited

chenguolin commented Mar 13, 2024 •

edited

TianFangzheng commented Jun 1, 2024 •

edited