Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can it do super resolution? #3

Open
francqz31 opened this issue Apr 4, 2024 · 3 comments
Open

can it do super resolution? #3

francqz31 opened this issue Apr 4, 2024 · 3 comments

Comments

@francqz31
Copy link

Can VAR do super resolution like GigaGan super resolution for example. Gigagan is the most impressive super resolution algorithm till now.
And if yes would you be able to add support for it Later next month or so?

@keyu-tian
Copy link
Collaborator

VAR supports zero-shot super resolution. Although it might not rival the GigaGan upsampler, we're planning to release a demo for testing in the coming days. Stay tuned for updates!

@judywxy1122
Copy link

Hi keyu,

I'm Bingyue's friend and I'm very impressed with this work!

I have a question regarding large size image with super high resolution.

First, let me try to understand the fundamental logic. Correct me if I'm wrong.

  1. The basic idea is to establish a self-supervised learning mechanism. In VAR, we follow the process as
    raw img
    -> embedding f
    -> forward: (r_K -> ... -> r_1)
    -> backward: (r_1 -> ... -> r_K)
    -> recovered embedding f^
    -> reconstructed img

i.e. from fine to coarse and then inversely from coarse to fine

  1. The learning is based on the probabilistic generative model for the conditional generation probabilities:
    P(r_k | r_(k-1), ..., r_1) for k = 1, ..., K, with r_0 as a pre-defined start (, i.e. guidence).

Based on this understanding, regarding a large size image with super high resolution, we can set the dimension of the the embedding vector f to be higher for more representation capability.

Considering the main-stream techniques like the one in the paper “Scalable Diffusion Models with Transformers”, one technique is to “patchify” the raw image into patches (i.e. tokens) and then find the “best” embedding of each patch by a transformer-architecture based learning. When each token embedding is decoded back into a “predicted” patch, then all the “predicted” patches can be re-organized together to recover the whole image.

Now, the QUESTION is: can we also do the “patchify” and then apply the fine→coarse→fine process to each patch and then reorganize the “predicted” patches to recover the whole image?

Not quite sure which of the two method is better. I mean
a) setting dimension of the the embedding vector f to be higher for more representation capability
b) patchifing the raw image into patches, working on each patch, and the piecing together the “predicted” patches

One intuition for the “patchify” in method b) is that there could be some un-smoothed piecing-together when the computation for the optimization process is not enough yet. Note that breaking a whole image into pieces actually destroys the spatial connection information of the pieces. The method a) does not need to deal with the problem of piecing-together because the embedding is regarding the whole image.

Best,

Xugang Ye

@keyu-tian
Copy link
Collaborator

@judywxy1122 Thank you for your kind words! The question is a bit detailed; let me give it some thought and I'll get back to you shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants