Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_video error for slightly large videos when extracting S3D features. #90

Open
divineSix opened this issue Jan 6, 2023 · 6 comments

Comments

@divineSix
Copy link

divineSix commented Jan 6, 2023

I was trying to extract S3D features on a video (~51MB, 11 mins), and was getting an error at the very start of the extraction process, with a console message Killed.
image

This is occurring because in extract-S3D.py, we're using the read_video from torchvision.io.video to process the video file. I tried to execute only this statement separately and faced the same issue. However, I was able to process a smaller video file (<1MB, ~5 secs) and feature extraction then proceeded without a hitch. Same for the samples provided in the repo. This issue is not present in the I3D feature extraction. Probably because there you use the VideoCapture methods from OpenCV?

image

I'm trying to see if some other video reader works for this, but I am unsure if read_video applies any transforms before outputting the RGB torch array mentioned in the code. Can you suggest any workaround if this doesn't work?

The torchvision version in my environment is 0.12.0, omegaconf is 2.1.1 as described.

EDIT: I've tried extracting the features for the video I had issue with on the S3D colab notebook, but the kernel crashes there as well.

@divineSix
Copy link
Author

divineSix commented Jan 6, 2023

I've tried using read_video on a brand-new environment with the latest torchvision and av modules installed and I'm facing the same issue. There seems to be an open issue in the torchvision repo regarding this as well, although I'm not sure on the details.

My video has ~13k frames, and I'm wondering if the problem is that the code loads all 13k frames into the cpu/gpu at once. I'm new to this field entirely, so please do let me know if I'm missing something.

@v-iashin
Copy link
Owner

v-iashin commented Jan 7, 2023

I am quite sure that the issue is related to the lack of RAM. You may confirm it by tracking the RAM level as you run the script with your video.

The reason why it works with opencv is the way it loads the video. In particular, in contrast to torchvision which tries to read the whole video into RAM, opencv reads frames one by one and features are extracted from chunks of the frames and those are discarded after that.

My suggestion is to split your long video into small pieces with ffmpeg.

I do admit that such a difference between readers is confusing and limits applications. However, it ensures that the feature extraction process matches the one that was used during training.

@mrkstt
Copy link

mrkstt commented Sep 11, 2023

I was trying to extract S3D features on a video (~51MB, 11 mins), and was getting an error at the very start of the extraction process, with a console message Killed. image

This is occurring because in extract-S3D.py, we're using the read_video from torchvision.io.video to process the video file. I tried to execute only this statement separately and faced the same issue. However, I was able to process a smaller video file (<1MB, ~5 secs) and feature extraction then proceeded without a hitch. Same for the samples provided in the repo. This issue is not present in the I3D feature extraction. Probably because there you use the VideoCapture methods from OpenCV?

image

I'm trying to see if some other video reader works for this, but I am unsure if read_video applies any transforms before outputting the RGB torch array mentioned in the code. Can you suggest any workaround if this doesn't work?

The torchvision version in my environment is 0.12.0, omegaconf is 2.1.1 as described.

EDIT: I've tried extracting the features for the video I had issue with on the S3D colab notebook, but the kernel crashes there as well.

Can anyone share a success story to run this, especially the hardware configuration (RAM and GPU memory size probably)?.
I also faced the same problem, with GTX 1050Ti. @v-iashin @divineSix @borijang

@GunjanDhanuka
Copy link

You can use the OpenCV video reader instead of the torchvision video reader, seemed to fix the issue in my case.

        # rgb_vid, audio, info = read_video(video_path, pts_unit='sec')

        print("Video reading started")
        cap = cv2.VideoCapture(video_path)
        fps = cap.get(cv2.CAP_PROP_FPS)
        rgb_stack = []

        while cap.isOpened():
            frame_exists, rgb = cap.read()

            if frame_exists:
                # preprocess the image
                rgb = cv2.cvtColor(rgb, cv2.COLOR_BGR2RGB)
                rgb_stack.append(rgb)
            else:
                # we don't run inference if the stack is not full (applicable for i3d)
                cap.release()
                break

        rgb1 = torch.tensor(np.array(rgb_stack))

@GunjanDhanuka
Copy link

I am using it to extract features from the XD-Violence dataset, and I compared the numpy arrays (using np.array_equal) after getting the features from both cv2 and read_video and the result was True.

@v-iashin
Copy link
Owner

v-iashin commented Feb 15, 2024

I compared the numpy arrays (using np.array_equal) after getting the features from both cv2 and read_video and the result was True.

Ok, that's great to know.

However, I think the suggested code won't work if you have 1000s of frames.

The code above needs to be updated to handle chunks of frames and their release after features were extracted for that chunk to free up memory.

It should be similar to how it is done for I3D:

batch_feats_dict = self.run_on_a_stack(rgb_stack, stack_counter, padder)
for stream in self.streams:
feats_dict[stream].extend(batch_feats_dict[stream].tolist())
# leaving the elements if step_size < stack_size so they will not be loaded again
# if step_size == stack_size one element is left because the flow between the last element
# in the prev list and the first element in the current list
rgb_stack = rgb_stack[self.step_size:]

If read_video and cv2 output comparable features, one could use cv2 frame-by-frame reading as it is done for i3d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants