Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training covergence on the gobjaverse dataset #35

Open
wenqsun opened this issue Mar 4, 2024 · 12 comments
Open

Training covergence on the gobjaverse dataset #35

wenqsun opened this issue Mar 4, 2024 · 12 comments

Comments

@wenqsun
Copy link

wenqsun commented Mar 4, 2024

Thanks for your great work!

I have tried the code on Gobjaverse, but I found that the training is not very stable. For example, during the initial epochs, I achieved the result below:
image
image
the top row is the prediction input and the bottom row is the ground-truth input, we can find that the color and shape have been trained relatively well. But after several epochs, the training process suffers from a severe collapse and the loss stays high
image
I adjusted the lr, batch_size the coefficient of lpips loss, but it seems that most settings will meet this collapse. I wonder if you have met such a situation, or can you provide some potential solution to solve this issue?

Thanks!

@wenqsun wenqsun changed the title Training covergence on the gobajaverse dataset Training covergence on the gobjaverse dataset Mar 4, 2024
@ashawkey
Copy link
Collaborator

ashawkey commented Mar 5, 2024

@wenqsun This is strange... I find that turning off LPIPS loss at the beginning stage of training can be helpful, but once it has converged to the first image level, adding LPIPS will significantly improve the fidelity. Have you done any other changes?

@wenqsun
Copy link
Author

wenqsun commented Mar 8, 2024

Oh, sorry to reply so late.

Thanks for your suggestion! I have solved this problem and achieved high-quality generation results. I also find that the coefficient of LPIPS loss is very important.

@junwuzhang19
Copy link

@wenqsun Hi bro, I also tried to reproduce on Gobjaverse but got bad results after training 10k steps. I tried to decrease lr from 4e-4 to 1e-4, but it didn't work. I wonder if you can share your training configurations as well as the modifications?

Thanks!

@wenqsun
Copy link
Author

wenqsun commented Mar 11, 2024

@junwuzhang19 Hi, sorry to reply so late. I think you can adjust the lr to a smaller value. I use 1e-5 to 5e-5, and the training process is very stable. Moreover, I recommend that you record the mse and lpips loss, and adjust the coefficient between these two terms based on loss curves.

@junwuzhang19
Copy link

@wenqsun Thanks for your kindly advice! I will try it later. Besides, I find that training lgm on 8 A100 gpus for 30 epochs will be finished in about 1 day, which is significantly faster than 4 days with 32 A100 gpus as claimed in the paper. I wonder if you trained lgm for similar time and get as good results as the released lgm weights. Thanks!

@wenqsun
Copy link
Author

wenqsun commented Mar 11, 2024

@junwuzhang19 yes, I also find that 30 epochs will only take about one day using 8 A100 GPUs. Based on my results, I think that if you want to achieve a competitive performance, more epochs are required. Overall, I use a similar time to get a similar result as the original LGM.

@chenguolin
Copy link

Hi @ashawkey @wenqsun @junwuzhang19,

I'm also attempting to replicate LGM on GObjaverse. Have you experimented with training the tiny, small, and large versions of LGM and comparing their performance and loss/PSNR/LPIPS curves?

In my case, the tiny LGM has the best performance, which is surprising to me. I tried to adjust the learning rate to a smaller value for the small and large versions, but it didn't make a difference: the tiny model still performs the best. Do you have any insights?

Thank you for your time and attention.

@ashawkey
Copy link
Collaborator

@chenguolin Hi, how to you evaluate the performance? The loss/PSNR curve may not fairly reflect the quality between these settings, as they are evaluted on different output resolutions.

@chenguolin
Copy link

chenguolin commented Mar 13, 2024

@chenguolin Hi, how to you evaluate the performance? The loss/PSNR curve may not fairly reflect the quality between these settings, as they are evaluted on different output resolutions.

Thanks for your reply!

Yes, that's correct. However, for the tiny and small versions, the rendering image size remains the same at 256, and I have set the number of rendering views (opt.num_views) to 8 for both. The tiny model still outperforms the small version in both LPIPS and PSNR, even after adjusting the learning rate of the small version. The tiny model works perfectly for me, so the training process should be correct (?).

I'm interested to know if others have encountered the same issue. I'm also curious if this could be due to the change in the training dataset size, and whether more careful hyperparameter selection is needed. Additionally, I plan to train different versions of LGM in the 80k dataset (https://github.com/ashawkey/objaverse_filter) later.

@ashawkey
Copy link
Collaborator

This is interesting, actually I don't do a lot of experiments with the small and tiny models. Have you tried testing and visualizing the generated Gaussians with imagedream-generated mv images? The tiny model also contains fewer number of Gaussians, which I found to be potentially harming the performance.

@wenqsun
Copy link
Author

wenqsun commented Mar 14, 2024

@chenguolin Oh, that's amazing! Actually, I only train the LGM using the big version config.

I am wondering whether you have compared the final performance on the test dataset using tiny, small, and big configs. During my experiments, I found that sometimes, even though the training loss curve does not decrease for a long time, the test loss would decrease gradually (sometimes even large changes). Maybe you can keep the training process longer, and see what would happen for these three models.

This is a really interesting finding!

@TianFangzheng
Copy link

TianFangzheng commented Jun 1, 2024

@wenqsun @ashawkey Hi, Sorry to bother you. I am trying to train a model using the gobjaverse dataset. The training results are poor. I think there may be something wrong with the camera parameters. I read the camera parameters from the jeson article. Are the parameters loaded by the camera correct? Thank you for your time.

The following is the code related to the camera parameters

        # default camera intrinsics
        self.tan_half_fov = np.tan(0.5 * np.deg2rad(self.opt.fovy))
        self.proj_matrix = torch.zeros(4, 4, dtype=torch.float32)
        self.proj_matrix[0, 0] = 1 / self.tan_half_fov
        self.proj_matrix[1, 1] = 1 / self.tan_half_fov
        self.proj_matrix[2, 2] = (self.opt.zfar + self.opt.znear) / (self.opt.zfar - self.opt.znear)
        self.proj_matrix[3, 2] = - (self.opt.zfar * self.opt.znear) / (self.opt.zfar - self.opt.znear)
        self.proj_matrix[2, 3] = 1
       
 ......       
               with open(meta_path, 'r', encoding='utf-8') as file:
                    meta = json.load(file)

                c2w = np.eye(4)
                c2w[:3, 0] = np.array(meta['x'])
                c2w[:3, 1] = np.array(meta['y'])
                c2w[:3, 2] = np.array(meta['z'])
                c2w[:3, 3] = np.array(meta['origin'])
                c2w = torch.tensor(c2w, dtype=torch.float32).reshape(4, 4)


            # blender world + opencv cam --> opengl world & cam
            c2w[1] *= -1
            c2w[[1, 2]] = c2w[[2, 1]]
            c2w[:3, 1:3] *= -1  # invert up and forward direction

            cam_poses.append(c2w)

            vid_cnt += 1
            if vid_cnt == self.opt.num_views:
                break

        if vid_cnt < self.opt.num_views:
            print(f'[WARN] dataset {uid}: not enough valid views, only {vid_cnt} views found!')
            n = self.opt.num_views - vid_cnt
            cam_poses = cam_poses + [cam_poses[-1]] * n

        cam_poses = torch.stack(cam_poses, dim=0)  # [V, 4, 4]


        # normalized camera feats as in paper (transform the first pose to a fixed position)
        radius = torch.norm(cam_poses[0, :3, 3])
        cam_poses[:, :3, 3] *= self.opt.cam_radius / radius
        transform = torch.tensor([[1, 0, 0, 0], [0, 1, 0, 0], [0, 0, 1, self.opt.cam_radius], [0, 0, 0, 1]], dtype=torch.float32) @ torch.inverse(cam_poses[0])
        cam_poses = transform.unsqueeze(0) @ cam_poses  # [V, 4, 4]

        # opengl to colmap camera for gaussian renderer
        cam_poses[:, :3, 1:3] *= -1
        cam_view = torch.inverse(cam_poses).transpose(1, 2)  # [V, 4, 4]
        cam_view_proj = cam_view @ self.proj_matrix  # [V, 4, 4]
        cam_pos = -cam_poses[:, :3, 3]  # [V, 3]


        results['cam_view'] = cam_view
        results['cam_view_proj'] = cam_view_proj
        results['cam_pos'] = cam_pos

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants