Discrepancy between Hugging Face and fashion-clip #14

thomas-woodruff · 2023-06-21T08:50:05Z

Hello there,

I was looking into the difference in performance between the Hugging Face implementation of FashionCLIP and this repo, which wraps around the former.

I noticed there's a discrepancy between the image embeddings produced by the two approaches. Having dug into it, it looks like the cause is that in this repo the images are put into a Hugging Face Dataset here before being passed to the model.

The below code illustrates the discrepancy:

from transformers import CLIPProcessor, CLIPModel
from fashion_clip.fashion_clip import FashionCLIP
import torch
from datasets import Dataset

model_name = "patrickjohncyh/fashion-clip"
model = CLIPModel.from_pretrained(model_name)
processor = CLIPProcessor.from_pretrained(model_name)

def get_image_embeddings_without_dataset(images):
    inputs = processor(images=images, return_tensors='pt')
    
    with torch.no_grad():
        embeddings = model.get_image_features(**inputs)

    return embeddings.numpy()

def pass_images_through_data(images):
    dataset = Dataset.from_dict({'image': images})
    images = dataset['image']
    return images

def get_image_embeddings_with_dataset(images):
    images = pass_images_through_data(images)
    return get_image_embeddings_without_dataset(images)

hf_ds_embeddings = get_image_embeddings_with_dataset(images)
hf_wo_embeddings = get_image_embeddings_without_dataset(images)

fclip = FashionCLIP('fashion-clip')
fc_embeddings = fclip.encode_images(images, batch_size=batch_size)

In the above code the embeddings produced by passing the images through a Dataset,hf_ds_embeddings, are the same as those produced by this repo, fc_embeddings. The embeddings produced without using a Dataset, hf_wo_embeddings are slightly different.

I imagine that putting the images into the dataset is implicitly applying some transformation or pre-processing.

Just wanted to flag this, thanks!

The text was updated successfully, but these errors were encountered:

vinid · 2023-06-28T23:52:39Z

I am surprised because both methods seem to use the same transformation, but I'll take a look! thanks!!

anilsathyan7 · 2023-07-04T13:09:28Z

This looks like a similar issue:-

from PIL import Image
from io import BytesIO
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("patrickjohncyh/fashion-clip")
processor = CLIPProcessor.from_pretrained("patrickjohncyh/fashion-clip")

image = requests.get('https://sc04.alicdn.com/kf/Ha258d067f6ff4af687a73b1b18b07333w/233027149/Ha258d067f6ff4af687a73b1b18b07333w.jpg').content
image = Image.open(BytesIO(image))

inputs = processor(text=['paperbag waist', 'waist band', 'drawstring waist'],
                   images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  
print(probs)

test_captions = ['drawstring waist', 'paperbag waist', 'waist band']
test_img_path = 'paperbag_waist.jpg'
#display_images([test_img_path])
fclip.zero_shot_classification([test_img_path], test_captions)

The probs generated here and hugginface hosted inference UI seems to be different: https://huggingface.co/patrickjohncyh/fashion-clip. I beleive both should ideally output same probability for same input image? Are they both using latest v2 models?

Both the above methods classify image as 'drawstring waist' - wrongly. But it's correctly identified in the HF hosted inference API.

vinid · 2023-07-04T15:52:19Z

Hi @anilsathyan7!

I am not sure how the UI computes the score; in the meantime, I have run your example on both the original HF API and our internal wrapper and the results are more or less the same. Take a look:

img_url = "https://sc04.alicdn.com/kf/Ha258d067f6ff4af687a73b1b18b07333w/233027149/Ha258d067f6ff4af687a73b1b18b07333w.jpg"
image = requests.get(img_url').content
image = Image.open(BytesIO(image))

inputs = processor(text=['paperbag waist', 'waist band', 'drawstring waist'],
                   images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image  # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1)  
print(probs)

>>> [0.1976, 0.0051, 0.7973]

test_captions = ['paperbag waist', 'waist band', 'drawstring waist']
test_img_path = 'paperbag_waist.jpg'

images = [test_img_path]
texts = test_captions

# we create image embeddings and text embeddings
image_embeddings = fclip.encode_images(images, batch_size=32)
text_embeddings = fclip.encode_text(texts, batch_size=32)

# we normalize the embeddings to unit norm (so that we can use dot product instead of cosine similarity to do comparisons)
image_embeddings = image_embeddings/np.linalg.norm(image_embeddings, ord=2, axis=-1, keepdims=True)
text_embeddings = text_embeddings/np.linalg.norm(text_embeddings, ord=2, axis=-1, keepdims=True)

# note that we need to include logit scaling to get the same output the default hugging face model gives us
logit_sacling = fclip.model.logit_scale.exp().item()
torch.tensor(image_embeddings.dot(text_embeddings.T)*logit_sacling).softmax(dim=1)

>>> [0.1976, 0.0051, 0.7972]

Which are reasonably similar scores.

anilsathyan7 · 2023-07-04T16:41:16Z

@vinid Ok, that's strange. The hosted api shows the image as 'paperbag waist' clearly with probs - 0.943. It's a large difference and the 'Hosted inference API' output is actually correct. What could be the reason for this?

vinid · 2023-07-04T17:05:43Z

it's an effect due to prompting, by default the pipeline component (included in the UI) uses the format "this is a photo of {}." See here.

test_img_path = 'paperbag_waist.jpg'
test_captions = ['This is a photo of paperbag waist.', 'This is a photo of waist band.', 'This is a photo of drawstring waist.']

images = [test_img_path]
texts = test_captions

# we create image embeddings and text embeddings
image_embeddings = fclip.encode_images(images, batch_size=32)
text_embeddings = fclip.encode_text(texts, batch_size=32)

# we normalize the embeddings to unit norm (so that we can use dot product instead of cosine similarity to do comparisons)
image_embeddings = image_embeddings/np.linalg.norm(image_embeddings, ord=2, axis=-1, keepdims=True)
text_embeddings = text_embeddings/np.linalg.norm(text_embeddings, ord=2, axis=-1, keepdims=True)


logit_sacling = fclip.model.logit_scale.exp().item()
torch.tensor(image_embeddings.dot(text_embeddings.T)*logit_sacling).softmax(dim=1)

>>> [0.6159, 0.0288, 0.3552]

(You have some typos in your screnshot, you should remove ' chars)

anilsathyan7 · 2023-07-04T17:56:34Z

@vinid Thanks a lot ...
If we even change the full stop in caption, the result completely changes.
Prompt Engineering !! 😅

dalphajw · 2023-07-12T06:50:51Z

Great find! I was just thinking the same thing and was pleasantly surprised to stumble onto this insightful thread.
In my time using FashionCLIP, I did find the "photo of" trick works quite well but I didn't know that was the reason for the discrepancy. Thx all!

vinid added the bug Something isn't working label Jul 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discrepancy between Hugging Face and fashion-clip #14

Discrepancy between Hugging Face and fashion-clip #14

thomas-woodruff commented Jun 21, 2023

vinid commented Jun 28, 2023

anilsathyan7 commented Jul 4, 2023 •

edited

vinid commented Jul 4, 2023

anilsathyan7 commented Jul 4, 2023 •

edited

vinid commented Jul 4, 2023 •

edited

anilsathyan7 commented Jul 4, 2023 •

edited

dalphajw commented Jul 12, 2023

Discrepancy between Hugging Face and fashion-clip #14

Discrepancy between Hugging Face and fashion-clip #14

Comments

thomas-woodruff commented Jun 21, 2023

vinid commented Jun 28, 2023

anilsathyan7 commented Jul 4, 2023 • edited

vinid commented Jul 4, 2023

anilsathyan7 commented Jul 4, 2023 • edited

vinid commented Jul 4, 2023 • edited

anilsathyan7 commented Jul 4, 2023 • edited

dalphajw commented Jul 12, 2023

anilsathyan7 commented Jul 4, 2023 •

edited

anilsathyan7 commented Jul 4, 2023 •

edited

vinid commented Jul 4, 2023 •

edited

anilsathyan7 commented Jul 4, 2023 •

edited