Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overzealous inline opitimizations? #1952

Open
oliv3r opened this issue May 9, 2024 · 5 comments
Open

Overzealous inline opitimizations? #1952

oliv3r opened this issue May 9, 2024 · 5 comments

Comments

@oliv3r
Copy link

oliv3r commented May 9, 2024

Versions

  • FTL: 5.25.2

Compiling FTL with a recent GCC and the 'MinSizeRel' CMake target (e.g. -Os) triggers a lot of

FTL-5.25.2/src/database/gravity-db.c:944:20: error: inlining failed in call to 'gravityDB_finalize_client_statements': call is unlikely and code size would grow [-Werror=inline]
  944 | static inline void gravityDB_finalize_client_statements(clientsData *client)
      |                    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
/workdir/src/FTL-5.25.2/src/database/gravity-db.c:1252:9: note: called from here
 1252 |         gravityDB_finalize_client_statements(client);
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

like errors.

Here, GCC is basically telling us 'look, you wanted to inline this, but it's less efficient but you are asking to do -Os.

Should we force inline functions at all, or should gcc figure things out by itself with -Os and -O3. Generally the idea is 'do not inline, let the compiler figure it out, it knows best.

@DL6ER
Copy link
Member

DL6ER commented May 9, 2024

Generally the idea is 'do not inline, let the compiler figure it out, it knows best.

No, because "best" is not a well-defined term here. You see that -O3 and -Os define "best" in two very different ways. Pi-hole itself is optimized for speed because that's what matters with any real application these days (space is cheap as it has never been). The only reason where I see justification for -Os is the MCU world where you may really still be restricted to a few KB of program memory.

gravityDB_finalize_client_statements(*client) involves quite some code that'd be unnecessarily be duplicated all over the place without a definition in a common place. Forcing inlining here actually gives a small speed enhancement as this function is called rather often and the mere fact that we inline saves us the costs of one context switch. Even when this isn't much, it is still measurable at runtime. The compiler, however, doesn't know if this function is called very often (without a profiling) and typically decides against inlining a function that has more than like three lines of code (the exact number has changes a few times in the history of gcc) which is not what we want.

At this point, FTL isn't meant to be compiled with -Os and I'd only consider changing proven efficient code if you can describe a real scenario where buying a barely - if at all - measureable reduction in application size at the costs of a measurable slowdown is worthwhile. In the end we are talking about a reduction at most a few KB (in an application in the multi-MB range this is < 0.1%) whereas the expected slowdown due to removing many optimizations which are known to often increase instruction size* will very likely be in the measurable low integer percent range.

TL;DR: Compiling for size comes at a measurable speed penalty. So far, this isn't a really supported configuration and I would not speak about "overzealous" in this context just because we give the compiler hints it cannot know itself.


*) not only our manual inlining instructions are removed, but also function/jump/labels/loops alignment, prefetching of loop arrays, and reordering of block algorithms, which can greatly reduce branch mis-prediction in the exectuable

@oliv3r
Copy link
Author

oliv3r commented May 10, 2024

Generally the idea is 'do not inline, let the compiler figure it out, it knows best.

No, because "best" is not a well-defined term here. You see that -O3 and -Os define "best" in two very different ways. Pi-hole itself is optimized for speed because that's what matters with any real application these days (space is cheap as it has never been). The only reason where I see justification for -Os is the MCU world where you may really still be restricted to a few KB of program memory.

I beg to differ, but your point does stand. My accesspoint that has 8 mb of flash and 128 mb of ram running openwrt doesn't have 'space to burn' no matter how cheap it is. Also, -Os has an more important impact, which is caching. Caches is still limited in a lot of CPU's because space for cache is expensive. Also being more effective in cache usage does improve speed and performance. Benchmarks would need to prove this of course.

So there certainly is still usage for Os in certain scenario's. However as I said, your point does stand and I do agree with it for the most part. ;)

gravityDB_finalize_client_statements(*client) involves quite some code that'd be unnecessarily be duplicated all over the place without a definition in a common place. Forcing inlining here actually gives a small speed enhancement as this function is called rather often and the mere fact that we inline saves us the costs of one context switch. Even when this isn't much, it is still measurable at runtime. The compiler, however, doesn't know if this function is called very often (without a profiling) and typically decides against inlining a function that has more than like three lines of code (the exact number has changes a few times in the history of gcc) which is not what we want.

I'm more then happy to accept that you know what you are doing :) no need to prove anything :p

At this point, FTL isn't meant to be compiled with -Os and I'd only consider changing proven efficient code if you can describe a real scenario where buying a barely - if at all - measureable reduction in application size at the costs of a measurable slowdown is worthwhile. In the end we are talking about a reduction at most a few KB (in an application in the multi-MB range this is < 0.1%) whereas the expected slowdown due to removing many optimizations which are known to often increase instruction size* will very likely be in the measurable low integer percent range.

And that's perfectly fine! So the answer to my question is 'no, we know what we are doing'. And that's great :) I should have potentially given more context, and that is I was trying to use the CMake target MinSizeRel because for the some parts, (openwrt, alpine, etc) it's the commonly picked target. This lead me down a path of compilation errors, where compilation was no longer possible.

That means, that

set(CMAKE_C_FLAGS_MINSIZEREL "-Os -DNDEBUG")
needs to be either removed, as it's not even valid, or (a bit better imo) is to either disable -Winline for that target (could we put -Wnonline there? or increase the inline warning limit (to which I had no success with so far).

TL;DR: Compiling for size comes at a measurable speed penalty. So far, this isn't a really supported configuration and I would not speak about "overzealous" in this context just because we give the compiler hints it cannot know itself.

Well then you conclude say we should actually remove MinSizeRel!

*) not only our manual inlining instructions are removed, but also function/jump/labels/loops alignment, prefetching of loop arrays, and reordering of block algorithms, which can greatly reduce branch mis-prediction in the exectuable

👍

@DL6ER
Copy link
Member

DL6ER commented May 10, 2024

My accesspoint that has 8 mb of flash and 128 mb of ram running openwrt doesn't have 'space to burn' no matter how cheap it is.

Ah, yes. We do not "support" any embedded devices out of the box so that's why I haven't thought about them. Sure, there are so many different devices out there that not everyone will have a USB port or whatsover with which you could extend storage for the hundreds GBs with only couple of dollars. This context indeed sheds another light on your endeavor - one which I can easily follow.

Well then you conclude say we should actually remove MinSizeRel!

I have no objections. A quick look on git blame on this line reveals it was an external contribution and (assumption ahead!) probably originated from some default template just as the other two (DEBUG and RELEASE).

The Pi-hole team itself solely uses the RELWITHDEBINFO target both with our internal tooling as well as with the recommended compiling steps packed together into build.sh. I'd be ready to say that MinSizeRel was probably tried back when this was added but never since again.

@kocburak
Copy link

Hi, (newbiew here)
Why aren't we using -Ofast instead of -O3 ?

@oliv3r
Copy link
Author

oliv3r commented May 10, 2024

@DL6ER so for me, I'm just building with Release and I'm done; but I suppose you guys have to make a choice here. Or at least have test targets that build them ;)

So I'll leave this issue open and you can close it with whatever solution you feel fits best.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants