Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize NEON encoding for high dynamic range blocks #663

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Developer-Ecosystem-Engineering
Copy link

We can improve the encoding performance on Apple silicon configurations with the following enhancements

  • Replace the switch with a for loop
  • Invert qlp_factors & remove else conditions as branches not taken are returned out

13% - 17% improvement depending on compiler, configuration and .wav

Clang 16 or GCC 13 necessary

Platform MBP18,2 Mac14,12 AMD Ryzen 9 7900X x 24
Compiler GCC 13 ---- Clang 16 GCC 13 ---- Clang 16 GNU 13.2.1
Original 6.554 ---- 6.176 5.725 ---- 5.377 5.018
Optimized 5.642 ---- 5.131 4.803 ---- 4.672 5
Overall 14% ---- 17% 16% ---- 13% 0

We can improve the encoding performance on Apple silicon configurations with the following enhancements

- Replace the switch with a for loop
- Invert qlp_factors & remove else conditions as branches not taken are returned out

13% - 17% improvement depending on compiler, configuration and .wav

Clang 16 or GCC 13 necessary
@ktmf01
Copy link
Collaborator

ktmf01 commented Feb 26, 2024

What kind of audio did you use to get these numbers?

Perhaps somewhat rude, but I am currently not inclined to accept this PR. This change is not covered by fuzzing, provides a modest improvement for a very niche use. So, I feel like the maintenance burden of this addition is not proportional to the gains.

@Developer-Ecosystem-Engineering
Copy link
Author

Hi @ktmf01

The optimization is designed to improve high bit rate audio. The test WAVs are detailed below. It's all stereo, 24bps, music files across many genres.

This table was against macOS 14.4 + 16 inch MBP M2.

Duration Sample rate Base time (s) Optimized time (s) Improvement (x times)
Sample 1 08:47 44.1 1.843 1.685 1.09
Sample 2 22:35 48 4.856 4.376 1.11
Sample 3 15:36 88.2 8.28 6.537 1.27

This change is not covered by fuzzing, provides a modest improvement for a very niche use.

We'd be happy to review the PR further, including if you'd like additional fuzzing tests. A performance improvement for high res audio seems inline with FLAC use cases, i.e. #669

@ktmf01
Copy link
Collaborator

ktmf01 commented Mar 12, 2024

Please provide some more information on the tests, because I cannot explain the results you're getting. The problem I have with this, is that the function you are optimizing should only be rarely used for 24-bit material, if at all. So, what kind of preset are you using, and what times are you measuring? Is it possible to provide gprof reports with and without this optimization on your hardware?

Okay, nevermind, this function is indeed used for 24-bit material. But why have you made this "NEON-exclusive"? It seems to me this code would work just fine for AMD64 as well and provide a performance benefit.

Considering fuzzing, the problem is not in the code, but in the hardware. Currently fuzzing is not run on ARMv8/v9 hardware, meaning this code is not fuzzed, because it is exclusive for that hardware. If it were not exclusive, fuzzing would not be a problem

@Developer-Ecosystem-Engineering
Copy link
Author

Appreciate the dialog, thank you for taking the time to review and consider these changes!

We made it exclusive because the benefit was much more pronounced on that architecture and flat on others. You are correct that it could be enabled for other architectures and that would also resolve the fuzzing issue.

We'd be happy to modify the PR to not constrain it to only NEON. If there are specific WAVs and presets the project would like to see it tested against let us know! We will update with the results using those inputs and the expanded architecture.

@ktmf01
Copy link
Collaborator

ktmf01 commented May 16, 2024

Okay, nevermind, this function is indeed used for 24-bit material.

While this was not the intent of this PR, thank you for (indirectly) pointing me to this. I've merged two PRs, #700 and #702, which should make flac use the function you optimized, FLAC__lpc_compute_residual_from_qlp_coefficients_limit_residual, way less often.

Could you perhaps rebase this PR against the current state of the repository, and re-evaluate this PR?

@Developer-Ecosystem-Engineering
Copy link
Author

Of course, standby.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants