Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VideoMAE might not be a useful model for video reconstruction? Or perhaps it only learns the most generic distribution within patches? #120

Open
apptcom1123 opened this issue Apr 8, 2024 · 1 comment

Comments

@apptcom1123
Copy link

I attempted to use VideoMAE for video reconstruction tasks, and while the reconstructed videos seemed correct on a larger scale, they appeared blurry at each patch level.

Initially, I thought this was a normal phenomenon caused by MSELoss, as it's well-known that such a loss function can lead to detail blurriness. However, when I adjusted the Patchsize to 1, the model predicted extremely good results, with almost perfect details.
At first, I thought this indicated the model's strength or that my video training data was sufficient, but I was mistaken. Even when I reduced the training video count to 20, the model still achieved very good results.

Then, I trained the model with a set of black and white, static videos. It also produced similarly good results on the test set. This seemed unreasonable, especially since the trained model was very small (around 500KB). So, I started printing out each layer to see where the problem was that led to these unreasonably good results. Eventually, I found the issue was with this line:

rec_img = rec_img * (img_squeeze.var(dim=-2, unbiased=True, keepdim=True).sqrt() + 1e-6) + img_squeeze.mean(dim=-2, keepdim=True)

img_squeeze is a rearrangement of the original photo, and multiplying rec_img by the variance and then adding the mean made the results, good or bad, very close to the original image, especially noticeable when patch_size = 1. This also explains why the boundaries between patches in the original paper were particularly mismatched.

However, this method only makes the model learn a distribution of patches, ensuring that each patch matches the value of each patch in videos_norm at every position as closely as possible.

But the good results obtained are not because of a good model architecture, but because the model can produce very good results, close to the original video, no matter how small the training dataset is.

Am I misunderstanding something?

If not, VideoMAE might not be a useful model for video reconstruction.

Brick

@wanglimin
Copy link
Contributor

Your understanding is right. The normalized pixel loss is from the original image mae paper. If you want to use videomae for reconstruction, you could refer to the original image mae repo, where it tries other loss like GAN loss. It will give more reasonable reconstruction result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants