Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do feature vectors for DINOv2 include small objects? #397

Open
smandava98 opened this issue Mar 23, 2024 · 5 comments
Open

Do feature vectors for DINOv2 include small objects? #397

smandava98 opened this issue Mar 23, 2024 · 5 comments

Comments

@smandava98
Copy link

smandava98 commented Mar 23, 2024

Hi,

When I visualize the features via PCA I'm able to see small objects but I'm not sure if this means the 1024 feature vector or ViT-L from DINOv2 must include spatial information of small objects relative to larger objects in the image?

Also, how can I properly reason about when to use the patch tokens vs the final embedded vector that the model returns is I am trying to use it to build a video object detection model, which would predict accurate bounding boxes over frames?

Currently, I just use that final 1024 vector but not sure if I should use patch tokens as that would be a lot if I am operating on video.

@nourihilscher
Copy link

As far as I know, I would recommend using the patch tokens for your case. The final class token characterizes the image as a complete entity, giving you the ability to compare the overall content of two images with each other. The patch tokens characterize the content of each 14x14 image patch. Remember that you can down scale the original images to reduce the overall number of patch tokens (has to be a multiple of 14 of course). Downscaling reduces the quality of the image which is bad for segmentation models, but in your case, as you are only interested in a bounding box, this should be fine.

I am actually curios how you visualized the final class token using PCA. How did you retrieve an image back just from the class token?

@smandava98
Copy link
Author

Oh I used the patch tokens for PCA, not the class token.

from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
with torch.no_grad():
    features_dict = dinov2_vitl14.forward_features(imgs_tensor_orig.cuda())
    features = features_dict['x_norm_patchtokens']
batch_size = features.shape[0]
scale1 = features.shape[1]
features2 = features.reshape(batch_size*scale1, 1024) #ViTG is 1536, ViTL is 1024
features2 = features2.cpu()
pca = PCA(n_components=1)
pca.fit(features2)
pca_features1 = pca.transform(features2)

# Visualize the first PCA component
for i in range(batch_size):
    plt.subplot(1, batch_size, i+1)
    plt.imshow(features2[i * scale1: (i+1) * scale1, 0].reshape(91, 52))
plt.show()

Is there any benefit for prepending the CLS token to the patch tokens before passing into my model? Or would just the patch tokens suffice?

@nourihilscher
Copy link

nourihilscher commented Mar 25, 2024

I think the patch tokens should suffice, but I didn't try to include the class tokens. If you prepend them to your patch embeddings by concatenating, I would assume that the eigenvectors PCA is projecting to, should not change much, as these new additional parts have very low variance.

(By the way, in addition to down scaling the image, you probably also don't need to process every frame of your video. Maybe every k-th frame is enough if you smoothly interpolate between the positions of you bounding box around objects)

@LiAo365
Copy link

LiAo365 commented May 20, 2024

I have the same question, too. From the given segmentation notebook, if 1024-d image representation could be used for segmentation, it should not a problem to use them for object detection. However, I find that others works liking Grounding-DINO, Video Grounding-DINO, do this by obtaining multi-scale features from the image backbone for each frame, and I am also curious if 1024-d feature representation is suitable for video object detection.

@LiAo365
Copy link

LiAo365 commented May 20, 2024

@smandava98
May I ask how effective you are if you just use 1024 dimensional features?
Thanks!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants