Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text and Vision tokens different from CLIP #17

Open
MLRadfys opened this issue Mar 7, 2024 · 0 comments
Open

Text and Vision tokens different from CLIP #17

MLRadfys opened this issue Mar 7, 2024 · 0 comments

Comments

@MLRadfys
Copy link

MLRadfys commented Mar 7, 2024

Hi and thanks for all the work done in this repository!

I noticed that the implementation of the CLS token in the Vision transformer, as well as the tokens used in the text transformer are different in you're implementation.

As far as I understand, CLIP attaches a CLS embeddings token before patch embeddings are send through the transformer. In this repo, it seems like the mean is computed over all patch embeddings instead, meaning the CLS token has no learnable parameters.

In addition, CLIP uses a and a token which are combined with the token embeddings in the beginning and end, respectively. In you're implementation, the text transformer uses a single CLS token.

Iam trying to make use of the FILIP part and incorporate it into the openAI implementation of CLIP. Unfortunately Iam somewhat unsure about how to handle the text tokens in the fine-grained loss.

When comparing patch token embeddings to text token embeddings should I ignore both the and the tokens?

Thanks in advance,

kind regards,

M

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant