Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Owlv2 model keeps crashing #30874

Open
preethiseshadri518 opened this issue May 17, 2024 · 2 comments
Open

Owlv2 model keeps crashing #30874

preethiseshadri518 opened this issue May 17, 2024 · 2 comments
Labels
Examples Which is related to examples in general Vision

Comments

@preethiseshadri518
Copy link

preethiseshadri518 commented May 17, 2024

I am trying to run OWLv2 (google/owlv2-base-patch16-ensemble) to perform object detection.

I am following the example code to perform inference. I am using a Colab notebook with a T4 GPU and using transformers version 4.40.2. When I try to perform inference, the cell just keeps running and eventually crashes with the message: Your session crashed after using all available RAM. This is surprising because the model is not that large (relatively speaking) and inference for a single image using OWL-ViT (google/owlvit-base-patch32) takes < 0.001 seconds. Not sure where this difference is coming from? Here is the code I am running:

import requests
from PIL import Image
import torch
from transformers import Owlv2Processor, Owlv2ForObjectDetection

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]

processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble")

with torch.no_grad(): # tried with and without this line
    inputs = processor(text=texts, images=image, return_tensors="pt")
    outputs = model(**inputs)

target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to Pascal VOC Format (xmin, ymin, xmax, ymax)
results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.1)
i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

Has anyone run into a similar issue and resolved it? I imagine there's some issue with actually utilizing the GPU, but again this issue is not happening for OWL-ViT with nearly identical code. Thanks!

@amyeroberts
Copy link
Collaborator

cc @qubvel if you have time :)

@amyeroberts amyeroberts added Examples Which is related to examples in general Vision labels May 17, 2024
@qubvel
Copy link
Member

qubvel commented May 17, 2024

Hi @preethiseshadri518, thanks for the issue!

I found that your code for reproducing is not using GPU. I have updated it as follows

import requests
from PIL import Image
import torch
from transformers import Owlv2Processor, Owlv2ForObjectDetection

device = "cuda"

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
texts = [["a photo of a cat", "a photo of a dog"]]

processor = Owlv2Processor.from_pretrained("google/owlv2-base-patch16-ensemble")
model = Owlv2ForObjectDetection.from_pretrained("google/owlv2-base-patch16-ensemble", device_map=device)
#                                                                                      ^^^^^^^^^^^^^

inputs = processor(text=texts, images=image, return_tensors="pt").to(device)
#                                                                ^^^^^^^^^^^

with torch.no_grad(): # tried with and without this line
    outputs = model(**inputs)

target_sizes = torch.Tensor([image.size[::-1]])
# Convert outputs (bounding boxes and class logits) to Pascal VOC Format (xmin, ymin, xmax, ymax)
results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.1)
i = 0  # Retrieve predictions for the first image for the corresponding text queries
text = texts[i]
boxes, scores, labels = results[i]["boxes"], results[i]["scores"], results[i]["labels"]

print(boxes, scores, labels)

It works fine locally and in colab. I use the following setup

!pip install -U transformers==4.40.2 accelerate

Here are the results, inference takes only ~3-4GB of GPU RAM:
Screenshot 2024-05-17 at 13 44 57

Are you running exactly this script or is there anything else that can cause the problem?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Examples Which is related to examples in general Vision
Projects
None yet
Development

No branches or pull requests

3 participants