Kosmos-2.5 implementation in transformers #30877

Natyren · 2024-05-17T10:28:14Z

Model description

Hello everyone,

The Kosmos-2.5 is a multimodal literate model that can be used for tasks such as OCR and text-rich image comprehension. It includes a ViT encoder, a Resampler, and a shared decoder module. To the best of my knowledge, the architecture of this model is similar to Kosmos-2 but has some differences. Due to these differences, using this model in Transformers requires a standalone implementation.

Open source status

The model implementation is available
The model weights are available

Provide useful links for the implementation

Paper: https://arxiv.org/pdf/2309.11419
Code: https://github.com/microsoft/unilm/tree/master/kosmos-2.5
Authors: @Dod-o @wolfshow

amyeroberts · 2024-05-17T12:19:43Z

cc @ydshieh

Natyren · 2024-05-17T13:25:45Z

I would like to assist with this implementation. If there are any guidelines on how to do it effectively, I would like to join implementation process

amyeroberts · 2024-05-20T10:49:32Z

Hi @Natyren, there's a guide in the documentation here: https://huggingface.co/docs/transformers/add_new_model

ydshieh · 2024-05-29T14:26:10Z

Hi @Natyren, sorry for the late reply. I am thinking to talk to the model author about if they are interested in porting this model into transformers. I will come back to you here for the updates.

Natyren added the New model label May 17, 2024

amyeroberts added the Multimodal label May 17, 2024

ydshieh self-assigned this May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kosmos-2.5 implementation in transformers #30877

Kosmos-2.5 implementation in transformers #30877

Natyren commented May 17, 2024 •

edited

amyeroberts commented May 17, 2024

Natyren commented May 17, 2024

amyeroberts commented May 20, 2024

ydshieh commented May 29, 2024

Kosmos-2.5 implementation in transformers #30877

Kosmos-2.5 implementation in transformers #30877

Comments

Natyren commented May 17, 2024 • edited

Model description

Open source status

Provide useful links for the implementation

amyeroberts commented May 17, 2024

Natyren commented May 17, 2024

amyeroberts commented May 20, 2024

ydshieh commented May 29, 2024

Natyren commented May 17, 2024 •

edited