Suggest your favorite papers to add! #1

lucidrains · 2021-12-01T02:57:25Z

will start with

Mut1nyJD · 2021-12-01T18:20:38Z

Florence https://arxiv.org/abs/2111.11432

afiaka87 · 2021-12-01T20:45:34Z

Would it be possible to explicitly target the same API created by open ai for their CLIP? This way it can be used as a drop-in replacement in e.g. CLIP-guidance notebooks (but anywhere else CLIP is used as well, which is a lot of places).

I think this would basically amount to using the same function signatures for clip.load(), encode_image, encode_text, etc. Not sure how limiting that could be in practice.

lucidrains · 2021-12-02T04:35:28Z

sure! but in also thinking of extending this to any number of modalities (audio, biosequences, etc)

rom1504 · 2021-12-06T10:52:35Z

LiT: Zero-Shot Transfer with Locked-image Text Tuning https://arxiv.org/abs/2111.07991 and in particular I think it would be interesting to be able to somehow transfer weights of existing models (clip image and text encoders but also other pretrained encoders) to this implementation somehow, and then continue training
do you think there could be some good ways?

RenShuhuai-Andy · 2021-12-09T08:35:44Z

MURAL: Multimodal, Multitask Retrieval Across Languages: https://arxiv.org/abs/2109.05125

haofanwang · 2021-12-10T09:36:27Z

Combined Scaling for Zero-shot Transfer Learning

https://arxiv.org/abs/2111.10050

lucidrains · 2021-12-11T18:20:08Z

LiT: Zero-Shot Transfer with Locked-image Text Tuning https://arxiv.org/abs/2111.07991 and in particular I think it would be interesting to be able to somehow transfer weights of existing models (clip image and text encoders but also other pretrained encoders) to this implementation somehow, and then continue training do you think there could be some good ways?

yup, i think it'll end up something like

clip = CLIP(
    vision_model = vit_transformer,
    text_model = text_transformer,
    ...
)

antofuller · 2021-12-16T23:54:29Z

CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations: https://arxiv.org/pdf/2112.07133.pdf

afiaka87 · 2021-12-18T20:26:56Z

RegionCLIP: https://arxiv.org/abs/2112.09106v1

They encourage region-level representations by using the released CLIP to both detect objects and to generate region-level captions for objects in a scene which becomes the dataset for finetuning an object detection task. Still reading but I believe it's a Microsoft paper.

batrlatom · 2021-12-19T22:15:00Z

Hi, I would just to ask if it is possible to make your models scriptable? It looks like lambda functions make it problematic for normal user. Good thing about torchscript is, that it would export to onnx, tensorrt, etc ...

rom1504 · 2021-12-24T10:41:11Z

https://github.com/facebookresearch/SLIP they combine the losses of CLIP (vision+language) and SimCLR (vision) and get better zero shot accuracy on a 15M dataset than clip on the same dataset
Hopefully accuracies would be even better at large scale

rom1504 · 2021-12-25T22:49:06Z

https://github.com/FreddeFrallan/Multilingual-CLIP works pretty well although they used very little resources
Basically they took an existing text model and aligned with the existing clip image

Here's one example showing it works well :

Searching for blue dress in korean

With clip

https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2Fknn.laion.ai&index=laion_400m_128G&useMclip=false&query=%ED%8C%8C%EB%9E%80+%EB%93%9C%EB%A0%88%EC%8A%A4

With mclip

https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2Fknn.laion.ai&index=laion_400m_128G&useMclip=true&query=%ED%8C%8C%EB%9E%80+%EB%93%9C%EB%A0%88%EC%8A%A4

(Many other examples can be tried on that ui)

I think we may be able to learn something from their approach

Edit: in practice I believe we already have what we need in the code here : the ability to plug some text encoder

haofanwang · 2022-01-25T09:40:12Z

https://arxiv.org/abs/2112.09133

Any plan to implement MaskFeat? @lucidrains

lucidrains · 2022-01-27T01:34:06Z

@haofanwang ohh nope, this doesn't look like it is related to contrastive learning

i could add it to https://github.com/lucidrains/vit-pytorch , but i'd have to understand HOGs better

transformers007 · 2022-02-18T15:38:16Z

@lucidrains BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, code

lucidrains · 2022-03-02T16:11:12Z

@lucidrains BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation, code

this is a great paper :) but it also already came with code!

MicPie · 2022-04-24T08:22:29Z

Hi @lucidrains ,

I hope you are doing fine?
We miss over at the EAI discord!

This could be very interesting for x-clip:
“FLAVA - A Foundational Language And Vision Alignment Model”, https://arxiv.org/abs/2112.04482

However, the official code seems to be on the way too: facebookresearch/mmf#1219 (comment) & https://github.com/facebookresearch/multimodal

All the best,
Michael

lucidrains · 2022-04-25T19:54:59Z

@MicPie hey Michael! miss you too ❤️ thanks for the share, i'll give it a read later tonight after i finish some code

MicPie · 2022-05-05T08:53:15Z

Looks interesting:
"CoCa - Contrastive Captioners are Image-Text Foundation Models"
https://arxiv.org/abs/2205.01917

“Unlike standard decoder transformers, CoCa omits cross-attention in the first half of the decoder layers to encode unimodal text representations, and cascades the rest of the decoder layers, cross-attending to the image encoder for multimodal image-text representations.”

jwyang · 2022-05-21T07:47:53Z

Florence https://arxiv.org/abs/2111.11432

Please refer to our UniCL repo on the core algorithm used in Florence: https://github.com/microsoft/UniCL

rom1504 referenced this issue Dec 24, 2021

add SimSiam and MLM SSL losses, from DeCLIP paper

e66b133

rom1504 mentioned this issue Mar 8, 2022

CLIP-Lite: Information Efficient Visual Representation Learning from Textual Annotations mlfoundations/open_clip#43

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggest your favorite papers to add! #1

Suggest your favorite papers to add! #1

lucidrains commented Dec 1, 2021 •

edited

Mut1nyJD commented Dec 1, 2021

afiaka87 commented Dec 1, 2021

lucidrains commented Dec 2, 2021

rom1504 commented Dec 6, 2021 •

edited

RenShuhuai-Andy commented Dec 9, 2021

haofanwang commented Dec 10, 2021

lucidrains commented Dec 11, 2021

antofuller commented Dec 16, 2021

afiaka87 commented Dec 18, 2021 •

edited

batrlatom commented Dec 19, 2021

rom1504 commented Dec 24, 2021

rom1504 commented Dec 25, 2021 •

edited

haofanwang commented Jan 25, 2022 •

edited

lucidrains commented Jan 27, 2022

transformers007 commented Feb 18, 2022

lucidrains commented Mar 2, 2022

MicPie commented Apr 24, 2022 •

edited

lucidrains commented Apr 25, 2022

MicPie commented May 5, 2022

jwyang commented May 21, 2022

Suggest your favorite papers to add! #1

Suggest your favorite papers to add! #1

Comments

lucidrains commented Dec 1, 2021 • edited

Mut1nyJD commented Dec 1, 2021

afiaka87 commented Dec 1, 2021

lucidrains commented Dec 2, 2021

rom1504 commented Dec 6, 2021 • edited

RenShuhuai-Andy commented Dec 9, 2021

haofanwang commented Dec 10, 2021

lucidrains commented Dec 11, 2021

antofuller commented Dec 16, 2021

afiaka87 commented Dec 18, 2021 • edited

batrlatom commented Dec 19, 2021

rom1504 commented Dec 24, 2021

rom1504 commented Dec 25, 2021 • edited

haofanwang commented Jan 25, 2022 • edited

lucidrains commented Jan 27, 2022

transformers007 commented Feb 18, 2022

lucidrains commented Mar 2, 2022

MicPie commented Apr 24, 2022 • edited

lucidrains commented Apr 25, 2022

MicPie commented May 5, 2022

jwyang commented May 21, 2022

lucidrains commented Dec 1, 2021 •

edited

rom1504 commented Dec 6, 2021 •

edited

afiaka87 commented Dec 18, 2021 •

edited

rom1504 commented Dec 25, 2021 •

edited

haofanwang commented Jan 25, 2022 •

edited

MicPie commented Apr 24, 2022 •

edited