Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

libhb: Improve buffer pool sizes #5232

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

stickz
Copy link
Contributor

@stickz stickz commented Jul 2, 2023

  • Increased the size of the first pool to 1 << 12 and increased the element size to 512.
  • Increased the element size of pool 1 << 13 to 1 << 14 from 32 elements to 64 elements.

@bradleysepos
Copy link
Contributor

Nice. If it works as well as it is clean and concise, seems like a win. I'm looking forward to testing this.

@bradleysepos bradleysepos self-assigned this Jul 2, 2023
@Mister-XY

This comment was marked as off-topic.

@galad87
Copy link
Contributor

galad87 commented Jul 3, 2023

Are you using the QSV decoder? If not I wonder how it would reduce GPU usage.

@sr55
Copy link
Contributor

sr55 commented Jul 3, 2023

Similar to the last patch, this caused a performance regression for me.

1080p Big Buck Bunny.

Before: [20:18:40] work: average encoding speed for job is 44.418655 fps

After: [20:22:03] work: average encoding speed for job is 23.167715 fps
The CPU Usage after has higher peaks, but lower average and much lower base values. It looks like it's "stalling" cores.

@sr55
Copy link
Contributor

sr55 commented Jul 3, 2023

It seems it can be faster in some cases, but significantly slower in others. Mode 4 FPS is all over the place. It's randomly peaking 2 times faster, then hitting lows of 2~3 times slower but overall ends up faster. Mode 6 is consistently 2x slower.

There is definitely some kind of stalling going on here.

You may have adjusted too far the other way.

@stickz stickz force-pushed the patch-1 branch 2 times, most recently from 72a7bb8 to 63fde73 Compare July 4, 2023 01:15
@jensdraht1999
Copy link

I wonder if this has something to do with your respective caches, which is very high on 12900k (14 + 30Mb) so bigger cache cpus will gain more out of this if pools are bigger?

@galad87
Copy link
Contributor

galad87 commented Jul 20, 2023

#5257 might change the way the pool is used.

@stickz
Copy link
Contributor Author

stickz commented Jul 20, 2023

#5257 might change the way the pool is used.

Thank you. I will start over with testing the increase in pool buffer sizes, with that pull request incorporated.

@jensdraht1999
Copy link

@stickz any Updates?

@stickz
Copy link
Contributor Author

stickz commented Oct 8, 2023

@stickz any Updates?

All fixed. Just had to change 1 number for #5257.

@stickz
Copy link
Contributor Author

stickz commented Oct 9, 2023

I made a mistake if you want to run the workflow again @sr55. It's fixed now.

@sr55
Copy link
Contributor

sr55 commented Oct 10, 2023

Some really basic initial testing to begin with has before/after within a margin of error on my 5900X so whatever issue was showing up before seems to have disappeared. Maybe a slight bias + few fps in some cases.

Suspect the impact will be very CPU / Memory dependant in cases where it was an issue.

@jensdraht1999
Copy link

Some really basic initial testing to begin with has before/after within a margin of error on my 5900X so whatever issue was showing up before seems to have disappeared. Maybe a slight bias + few fps in some cases.

Suspect the impact will be very CPU / Memory dependant in cases where it was an issue.

So mode 6 has been is at least as fast as it was now (with error of margin)?
If yes, we might merge this.

Regardless, we can close this : #2848 ???

@stickz
Copy link
Contributor Author

stickz commented Oct 12, 2023

Some really basic initial testing to begin with has before/after within a margin of error on my 5900X so whatever issue was showing up before seems to have disappeared. Maybe a slight bias + few fps in some cases.

Suspect the impact will be very CPU / Memory dependant in cases where it was an issue.

Any change with QSV decoding enabled on an Intel CPU? I can do more testing if required to confirm the benefit.

@galad87
Copy link
Contributor

galad87 commented Oct 13, 2023

Some statistics on how much the fifo is used would be nice, if used at all. Enabling HB_BUFFER_DEBUG on fifo.c:37 will provide some info in the activity logs.

@sr55
Copy link
Contributor

sr55 commented Oct 15, 2023

Not had a chance to test further but is on my todo list still. I'll also check QSV when I get a moment.

@stickz
Copy link
Contributor Author

stickz commented Oct 15, 2023

I increased the size of the first pool to 1 << 12 and increased the element size to 512. The result was a 8% performance increase. The testing condition was SVT-AV1 Mode 4 with QSV decoding. Increasing the other pools from 32 to 64 elements had no impact.

@stickz
Copy link
Contributor Author

stickz commented Oct 15, 2023

Some statistics on how much the fifo is used would be nice, if used at all. Enabling HB_BUFFER_DEBUG on fifo.c:37 will provide some info in the activity logs.

Enabling HB_BUFFER_DEBUG had no impact. There was nothing printed in the activity log. I had to relay on testing.

Not had a chance to test further but is on my todo list still. I'll also check QSV when I get a moment.

@sr55 I would run the workflow again before testing. I had to restructure since it only cares about the first pool.

@galad87
Copy link
Contributor

galad87 commented Oct 16, 2023

The log level needs to be set to extended (3 I think) to make it print the usage logs.

@stickz
Copy link
Contributor Author

stickz commented Oct 22, 2023

The log level needs to be set to extended (3 I think) to make it print the usage logs.

Alright I got usage logs. I increased 1 << 13 (8192) to 1 << 18 (262144) from 32 to 128 elements. It's faster at a memory cost of an additional 50mb. Up to 1 << 12 is exponential. It will use as many elements as given. This is why speed increased by 8%.

[14:36:20] Freed 512 buffers of size 4096
[14:36:20] Freed 93 buffers of size 8192
[14:36:20] Freed 87 buffers of size 16384
[14:36:20] Freed 59 buffers of size 32768
[14:36:20] Freed 48 buffers of size 65536
[14:36:20] Freed 21 buffers of size 131072
[14:36:20] Freed 31 buffers of size 262144
[14:36:20] Freed 10 buffers of size 524288
[14:36:20] Freed 2 buffers of size 1048576

The first buffer is going into the L3 CPU cache. I can't justify going higher than 512 elements, even though it's faster in some situations. It could cause cache misses to happen. Not everyone has a 30MB L3 CPU cache.

@galad87
Copy link
Contributor

galad87 commented Oct 22, 2023

I don't see how the that would influence the L3 cache, the only difference is that they are re-allocated each time in the main memory, that shouldn't influence the L3 cache.
Are you still testing with QSV decode enabled?

@sr55
Copy link
Contributor

sr55 commented Oct 22, 2023

Big Buck Bunny, 1080p, no filters, no audio.

image

Not really sure why QSV is outlying here but it seems to have a fair bit of swing on it on this laptop. It may be related to the hyperencode so I don't necessarily think this is anything to be concerned about .

@stickz
Copy link
Contributor Author

stickz commented Oct 22, 2023

Are you still testing with QSV decode enabled?

Yes, I'm testing with QSV decode enabled. There is a big gain in doing so.

I don't see how the that would influence the L3 cache, the only difference is that they are re-allocated each time in the main memory, that shouldn't influence the L3 cache.

Okay, this makes sense. I increased the first pool from 512 to 1024 elements. Let's see if this is faster on @sr55's laptop.

Big Buck Bunny, 1080p, no filters, no audio.

image

Not really sure why QSV is outlying here but it seems to have a fair bit of swing on it on this laptop. It may be related to the hyperencode so I don't necessarily think this is anything to be concerned about .

Do you want to try this again with 1024 elements in the first pool? This should help on your laptop where the fps is lower.

@sr55
Copy link
Contributor

sr55 commented Oct 22, 2023

Yeh, I can see it having an impact without QSV decode:
image

Something going on with the A370M. It seems to bit bottlenecking in some way so it's getting any benefit as a result.

Will check with the larger buffer size in a bit.

@sr55
Copy link
Contributor

sr55 commented Oct 22, 2023

1024 elements doesn't appear to have made any discernible difference.

@stickz
Copy link
Contributor Author

stickz commented Oct 22, 2023

Yeah your laptop has 45 watts base power. No wonder it's swinging. It can go up to 115 watts based on the "temperature". I think we should test this on a desktop, so we don't get these swings. The results are much more consistent.

The 5900x results show a consistent performance gain in every situation. Can someone test this on a desktop, with an Intel CPU using both QSV decoding enabled and disabled? I'm seeing 8% performance gains with QSV decoding on SVT-AV1 Mode 4.

The higher element size is beneficial for QSV decoding and it apparently makes no discernible difference otherwise.

@galad87
Copy link
Contributor

galad87 commented Oct 23, 2023

The buffer poll is not used as much as before. Now we keep the buffers that libavcodec or libavfilter provides. Software decoding will use only the smallest poll to get some hb_buffer to wrap the avframes, unless you have some filters that use the buffer pool.
I think it's possible to provide a custom buffer pool to libavcodec, I wonder if that would improve memory usage or not.

@stickz
Copy link
Contributor Author

stickz commented Oct 23, 2023

The buffer poll is not used as much as before. Now we keep the buffers that libavcodec or libavfilter provides. Software decoding will use only the smallest poll to get some hb_buffer to wrap the avframes, unless you have some filters that use the buffer pool. I think it's possible to provide a custom buffer pool to libavcodec, I wonder if that would improve memory usage or not.

Okay, here give this a test run then @sr55. It should produce the same results on your 5900x. Memory usage was only increased from 1.31MB to 3.15MB for the first few pools. Perhaps, we can improve memory usage by less than 2MB if we rewrite some code. But is it really worth it? We have a solution here that improves performance by at least 2-3%. I'm seeing higher...

FYI, the reason why 2 pools have 64 elements was based on testing I performed on 4K content. They needed more elements.

@jensdraht1999
Copy link

I think it's possible to provide a custom buffer pool to libavcodec, I wonder if that would improve memory usage or not.

Maybe something like this: https://stackoverflow.com/questions/28339980/ffmpeg-while-decoding-video-is-possible-to-generate-result-to-users-provided?rq=1 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

None yet

6 participants