Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy between Hugging Face and fashion-clip #14

Open
thomas-woodruff opened this issue Jun 21, 2023 · 7 comments
Open

Discrepancy between Hugging Face and fashion-clip #14

thomas-woodruff opened this issue Jun 21, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@thomas-woodruff
Copy link

Hello there,

I was looking into the difference in performance between the Hugging Face implementation of FashionCLIP and this repo, which wraps around the former.

I noticed there's a discrepancy between the image embeddings produced by the two approaches. Having dug into it, it looks like the cause is that in this repo the images are put into a Hugging Face Dataset here before being passed to the model.

The below code illustrates the discrepancy:

from transformers import CLIPProcessor, CLIPModel
from fashion_clip.fashion_clip import FashionCLIP
import torch
from datasets import Dataset

model_name = "patrickjohncyh/fashion-clip"
model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(model_name)

def get_image_embeddings_without_dataset(images):
    inputs = processor(images=images, return_tensors='pt')
    
    with torch.no_grad():
        embeddings = model.get_image_features(**inputs)

    return embeddings.numpy()

def pass_images_through_data(images):
    dataset = Dataset.from_dict({'image': images})
    images = dataset['image']
    return images

def get_image_embeddings_with_dataset(images):
    images = pass_images_through_data(images)
    return get_image_embeddings_without_dataset(images)

hf_ds_embeddings = get_image_embeddings_with_dataset(images)
hf_wo_embeddings = get_image_embeddings_without_dataset(images)

fclip = FashionCLIP('fashion-clip')
fc_embeddings = fclip.encode_images(images, batch_size=batch_size)

In the above code the embeddings produced by passing the images through a Dataset,hf_ds_embeddings, are the same as those produced by this repo, fc_embeddings. The embeddings produced without using a Dataset, hf_wo_embeddings are slightly different.

I imagine that putting the images into the dataset is implicitly applying some transformation or pre-processing.

Just wanted to flag this, thanks!

@vinid
Copy link
Collaborator

vinid commented Jun 28, 2023

I am surprised because both methods seem to use the same transformation, but I'll take a look! thanks!!

@anilsathyan7
Copy link

anilsathyan7 commented Jul 4, 2023

This looks like a similar issue:-

from PIL import Image
from io import BytesIO
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("patrickjohncyh/fashion-clip")
processor = CLIPProcessor.from_pretrained("patrickjohncyh/fashion-clip")

image = requests.get('https://sc04.alicdn.com/kf/Ha258d067f6ff4af687a73b1b18b07333w/233027149/Ha258d067f6ff4af687a73b1b18b07333w.jpg').content
image = Image.open(BytesIO(image))

inputs = processor(text=['paperbag waist', 'waist band', 'drawstring waist'],
                   images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  
print(probs)
test_captions = ['drawstring waist', 'paperbag waist', 'waist band']
test_img_path = 'paperbag_waist.jpg'
#display_images([test_img_path])
fclip.zero_shot_classification([test_img_path], test_captions)

The probs generated here and hugginface hosted inference UI seems to be different: https://huggingface.co/patrickjohncyh/fashion-clip. I beleive both should ideally output same probability for same input image? Are they both using latest v2 models?

Both the above methods classify image as 'drawstring waist' - wrongly. But it's correctly identified in the HF hosted inference API.

hf_fashion_clip

@vinid
Copy link
Collaborator

vinid commented Jul 4, 2023

Hi @anilsathyan7!

I am not sure how the UI computes the score; in the meantime, I have run your example on both the original HF API and our internal wrapper and the results are more or less the same. Take a look:

img_url = "https://sc04.alicdn.com/kf/Ha258d067f6ff4af687a73b1b18b07333w/233027149/Ha258d067f6ff4af687a73b1b18b07333w.jpg"
image = requests.get(img_url').content
image = Image.open(BytesIO(image))

inputs = processor(text=['paperbag waist', 'waist band', 'drawstring waist'],
                   images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  
print(probs)

>>> [0.1976, 0.0051, 0.7973]

test_captions = ['paperbag waist', 'waist band', 'drawstring waist']
test_img_path = 'paperbag_waist.jpg'

images = [test_img_path]
texts = test_captions

# we create image embeddings and text embeddings
image_embeddings = fclip.encode_images(images, batch_size=32)
text_embeddings = fclip.encode_text(texts, batch_size=32)

# we normalize the embeddings to unit norm (so that we can use dot product instead of cosine similarity to do comparisons)
image_embeddings = image_embeddings/np.linalg.norm(image_embeddings, ord=2, axis=-1, keepdims=True)
text_embeddings = text_embeddings/np.linalg.norm(text_embeddings, ord=2, axis=-1, keepdims=True)

# note that we need to include logit scaling to get the same output the default hugging face model gives us
logit_sacling = fclip.model.logit_scale.exp().item()
torch.tensor(image_embeddings.dot(text_embeddings.T)*logit_sacling).softmax(dim=1)

>>> [0.1976, 0.0051, 0.7972]

Which are reasonably similar scores.

@anilsathyan7
Copy link

anilsathyan7 commented Jul 4, 2023

@vinid Ok, that's strange. The hosted api shows the image as 'paperbag waist' clearly with probs - 0.943. It's a large difference and the 'Hosted inference API' output is actually correct. What could be the reason for this?

@vinid
Copy link
Collaborator

vinid commented Jul 4, 2023

it's an effect due to prompting, by default the pipeline component (included in the UI) uses the format "this is a photo of {}." See here.

test_img_path = 'paperbag_waist.jpg'
test_captions = ['This is a photo of paperbag waist.', 'This is a photo of waist band.', 'This is a photo of drawstring waist.']

images = [test_img_path]
texts = test_captions

# we create image embeddings and text embeddings
image_embeddings = fclip.encode_images(images, batch_size=32)
text_embeddings = fclip.encode_text(texts, batch_size=32)

# we normalize the embeddings to unit norm (so that we can use dot product instead of cosine similarity to do comparisons)
image_embeddings = image_embeddings/np.linalg.norm(image_embeddings, ord=2, axis=-1, keepdims=True)
text_embeddings = text_embeddings/np.linalg.norm(text_embeddings, ord=2, axis=-1, keepdims=True)


logit_sacling = fclip.model.logit_scale.exp().item()
torch.tensor(image_embeddings.dot(text_embeddings.T)*logit_sacling).softmax(dim=1)

>>> [0.6159, 0.0288, 0.3552]

(You have some typos in your screnshot, you should remove ' chars)

@anilsathyan7
Copy link

anilsathyan7 commented Jul 4, 2023

@vinid Thanks a lot ...
If we even change the full stop in caption, the result completely changes.
Prompt Engineering !! 😅

@vinid vinid added the bug Something isn't working label Jul 11, 2023
@dalphajw
Copy link

Great find! I was just thinking the same thing and was pleasantly surprised to stumble onto this insightful thread.
In my time using FashionCLIP, I did find the "photo of" trick works quite well but I didn't know that was the reason for the discrepancy. Thx all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants