Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster video inference script. #650

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

eliphatfs
Copy link

Changes:

  1. Moved the final scaling and uint8 quantization to GPU, reducing CPU and main memory bandwidth consumption (Line 225-227). 2.5x speed-up.
  2. Instruct FFMPEG to use RGB frames instead of BGR so no need to swap channels (Line 70 and 148).
  3. Batched inference (controlled by --batch parameter, default is 4). Crushed CUDA GPU util to 100%.
  4. Instruct torch to make contiguous tensors after the BCHW -> BHWC transform on GPU (Line 227). So no need to copy the buffer before writing to FFMPEG (Line 167). Reduced output IO time by 10x.

The metrics above are measured on a 1920x1080 30 fps anime video. On AMD R9-5900HX CPU (8 cores 16 threads) and 3080 LP (16GB), FP16, the processing rate goes from 0.8 fps to 4.6 fps with the optimizations (575% speed-up!). About 7.6 GB VRAM is used. You also get 4.4 fps (550% speed-up) at batch size 2, which now requires about 4.4 GB VRAM.

The script is not yet extensively tested (have no idea how to go, need some advice), and does not support extracting frames first, face enhance, alpha or grayscale images. Extract frames and face enhance go through very different workflows so the optimizations may not be applicable. Alpha and grayscale should not be an issue for almost all videos to be processed.

See #619, #634, #531.

@DaDaDaDaDaYeah
Copy link

DaDaDaDaDaYeah commented Jul 2, 2023

Tested. This really works! Thanks!!

Test results (480p, upscale parameter 2):
From 5-7 frame/s (original codes) to average 30 frame/s. Same output quality.
From GPU 3D usage 30% (mostly CPU) to GPU 3D usage 95-100%.

@tthg119
Copy link

tthg119 commented Jul 3, 2023

Hmm, I have no idea why my result stays the same somehow. My run was ESRNet_4xplus, nproc=1, [480x270] --> [1920x1080], noFace, noExtractFirst on 4x A40 (48Gb VRAM each), 52 cpu cores with 920G RAM. Both the old and new inference script give ~3fr/s
Screenshot from 2023-07-03 09-56-55

Yes I noticed that the batch approach boost up GPU utilization significantly (around 23Gb 100% on each GPU comparing to just 4Gb 60% ish). I didn't measure cpu in details but Htop shows that it's about the same.
Also I tried with different models and configs, with different batch size but again the difference is only ~0.5fr/s. Would love to have some thoughts if possible

@eliphatfs
Copy link
Author

Hmm, I have no idea why my result stays the same somehow. My run was ESRNet_4xplus, nproc=1, [480x270] --> [1920x1080], noFace, noExtractFirst on 4x A40 (48Gb VRAM each), 52 cpu cores with 920G RAM. Both the old and new inference script give ~3fr/s Screenshot from 2023-07-03 09-56-55

Yes I noticed that the batch approach boost up GPU utilization significantly (around 23Gb 100% on each GPU comparing to just 4Gb 60% ish). I didn't measure cpu in details but Htop shows that it's about the same. Also I tried with different models and configs, with different batch size but again the difference is only ~0.5fr/s. Would love to have some thoughts if possible

Could you attach the video for some analysis here?

@FNsi
Copy link

FNsi commented Jul 13, 2023

25% faster for me!

@FNsi
Copy link

FNsi commented Jul 13, 2023

May I ask a question?
Do you know why without --fp32 the output will be white noise? (Also the main branch, Amd Rocm)

@eliphatfs
Copy link
Author

I am running the animevideov3 model without FP32 and the outputs are correct.
Could you please provide more details about your setup?
I don't have RoCM available and there may be flaws in some APIs with FP16 as it is relatively new and not as mature as CUDA. For a suggestion on debugging by yourself, you may record the output of each layer in the network on the same input in the two modes FP16 and FP32 and compare them. If all of them are very different, perhaps there is a problem with rocm on your hardware; if it starts to become very different after a specific layer, you may be running into precision issues and you can't do much without changing a model.

@FNsi
Copy link

FNsi commented Jul 13, 2023

I am running the animevideov3 model without FP32 and the outputs are correct.

Sorry I tried with or without fp32 and there's no difference, whole white output at all.

Could you please provide more details about your setup?

Just simply using Python inference_video_fast.py --fp32 or not with general x3v4 model (the tiny denoise one, the master branch work fine with fp32)

@eliphatfs
Copy link
Author

This command is working fine on my machine:

python inference_realesrgan_video_fast.py --model_name=realesr-general-x4v3 -i "videos\2022-12-24 17-53-30.mp4" -s 2

Did I understand your input correctly?

@FNsi
Copy link

FNsi commented Jul 13, 2023

Did I understand your input correctly?

I think you are right. I did it with -dn 0, I will try to use it again without it.

@eliphatfs
Copy link
Author

It also works here with -dn 0.

@FNsi
Copy link

FNsi commented Jul 13, 2023

It also works here with -dn 0.

So I guess I need to debug into it...

Wait I did use the no nb_frames video patch and my input is a webm file. (still the original script works. )

Okay test the demo.mp4 find out fp16 has some detail and fp32 is just color blocks...

@wacky6
Copy link

wacky6 commented Jul 20, 2023

FYI, I observe torch.compile + channel_last provides 2x speedup (no tiling, no face enhancing, fp16) on NVIDIA A4000.

self.model = self.model.to(memory_format=torch.channels_last)
self.model = torch.compile(self.model)

Might be worth a shot?

I'm not sure how easy torch.compile fits with other features (face enhance, tiling), nor the end-to-end speedup in these cases (tiling and face enhace add CPU overheads).

@eliphatfs
Copy link
Author

FYI, I observe torch.compile + channel_last provides 2x speedup (no tiling, no face enhancing, fp16) on NVIDIA A4000.

self.model = self.model.to(memory_format=torch.channels_last)
self.model = torch.compile(self.model)

Might be worth a shot?

I'm not sure how easy torch.compile fits with other features (face enhance, tiling), nor the end-to-end speedup in these cases (tiling and face enhace add CPU overheads).

Thanks for your comments! Which model are you using? On my side, using channel_last seems to reduce performance by half.
torch.compile is generally helpful for performance as it generates optimized code on kernel level.

@wacky6
Copy link

wacky6 commented Jul 21, 2023

Thanks for your comments! Which model are you using? On my side, using channel_last seems to reduce performance by half.

"Officlal" Real-ESRGAN x4 I suspect channel_last / channel_first gain will vary by device? Without channel_last, I get about 1.5x speedup on A4000.

@eliphatfs
Copy link
Author

pytorch/pytorch#92542
I guess RRDB-based networks and VGG-based networks have different preferences for channel formats.

@epistemex
Copy link

You could also add an option to change the default libx264 to h264_nvenc encoder for ffmpeg which would give an additional performance boost. It would require ffmpeg compiled with cuda support, hence this as an option.

@aliencaocao
Copy link

how to use this on images instead of videos?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants