Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate hot counters #6

Open
wants to merge 8 commits into
base: v2.1
Choose a base branch
from

Conversation

fsfod
Copy link

@fsfod fsfod commented Dec 6, 2018

This PR tries to make performance more consistent by switching the hot counters used by function and loops from being shared in a small hash table to having a separate counter for each loop or function. This will make which function\loops are considered hot enough to start a trace consistent with how often they are called, instead of being somewhat random based on collisions in the hash table that change each time the process is run. For small benchmarks this effect is rarely seen but in larger Lua code bases we've seen performance oscillate from one process execution to the next by up to 20% that stopped once we made the hot counters separate.

Normally the 16 bit hot counters for functions and loops are in a small 64 slot hashtable with the key being the address of bytecode doing the hot counting which is either the first bytecode of a function or the loop bytecodes BC_LOOP, BC_FORL, BC_ITERL. This PR moves hot count for functions to a field at the end of the function prototype(GCproto) struct and for loops to a new dummy bytecode(BC_LOOPHC) that is emitted after the loop bytecodes listed earlier to store the hot count of loops.

Function hot counter have been moved from the shared hashtable to inside the
function prototype object that is only shared between the closures of a
function. The function hot counter field was added to the end of the GCproto
struct so that its right next the function header bytecode in memory which
decrements the counter.

Since the function hot counters are now spread out around the function proto
objects it will increase the number of dirty cache lines that need to
be eventually be flushed back to memory vs only two cache lines of the shared
hotcount hashtable, but this stops happening once the function is JIT'ed.

Updated the interpreter assembly for the Lua function header bytecodes to use
the new hot counter location.
Added a separate JIT param for function hot trigger count currently set to the
same effective value as it was with the shared hotcount table system of 112 calls.
…unters

The loop hot count will be stored in the 16 bit D part of the opcode and the opcode is emitted
directly after either BC_LOOP, BC_FORL or BC_ITERL depending on what kind of loop its being
generated. These three opcodes do the hot counting for loops so the hot count should
be right next to them in memory.

Update embedded bytecode in buildvm_libbc.h for bytecode changing from BC_LOOPHC
…ake them JIT params

Make the loop hot counters decrement by one instead of two for consistently now that
the counter backoff scaling values are not shared with functions.
Use some non default max trace attempt counts for loops and functions in the
unit tests to test out penaltymaxfunc and penaltymaxloop JIT parameters
…other arch builds

Add BC_LOOPHC under a define in lj_bc.h and lj_record.c
Don't embed hot counter bytecode if not built with separate hot counters
Scale penalty values by 2 when separate counters is disabled
@javierguerragiraldez
Copy link
Member

nice! on a first glance it seems to be controlled by a compile flag, right?

@fsfod
Copy link
Author

fsfod commented Dec 6, 2018

Well its on by default for x86\x64 but can be disabled with LUAJIT_NO_SEPARATE_COUNTERS. The hot counter tests should still pass with it disabled.

src/lj_record.c Outdated
lj_trace_flush(J, lnk); /* Flush trace that only returns. */
/* Set a small, pseudo-random hotcount for a quick retry of JFUNC*. */
hotcount_set(J2GG(J), J->pc+1, LJ_PRNG_BITS(J, 4));
hotcount_set_pt(J2GG(J), pt, J->pc, LJ_PRNG_BITS(J, 4));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indentation is weird here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well it changed from a tab to spaces but the indent is the same if you set tabsize to 8 like LuaJIT uses. If there was .editorconfig in the repo github would display this better.

@@ -5,3 +5,4 @@ gc64_slot_revival.lua
phi
snap.lua
stitch.lua
hotcounters.lua

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing newline at EOL

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@lukego
Copy link

lukego commented Dec 13, 2018

This seems like an excellent change to me. Being able to exclude hot-counter table collisions as the root cause during "WTF?" performance troubleshooting would be a significant productivity win. I don't think that the hashtable has ever directly caused a performance problem that I have investigated but even so it has been a productivity drag to have to consider the possibility and try to rule it out.

Great work!

@fsfod
Copy link
Author

fsfod commented Dec 13, 2018

We only saw the big performance swings in a few of our benchmarks. If LuaJIT has trouble tracing the code it seems to pronounce the effect though. The random collisions do sometimes cause different traces that make things faster but its not really something you can control.

@lukego
Copy link

lukego commented Dec 13, 2018

I would be enthusiastic about this change even if it doesn't impact any benchmarks. This should make users more productive because it means one less possibility to consider/exclude when looking for the root cause of surprising performance problems.

@lukego
Copy link

lukego commented Dec 13, 2018

Corollary is that since this feature is only implemented and enabled for x86 platforms this might mean one more possibility to consider when looking for the root cause of different JIT behaviour on different platforms.

@Ravenslofty
Copy link

Ravenslofty commented Feb 1, 2019

As an outsider, is anything going to be done with this pull request (@lukego is this in RaptorJIT?), or is it just going to languish here?

@lukego
Copy link

lukego commented Feb 3, 2019

@ZirconiumX RaptorJIT will adopt this change. The easiest way to do that is probably to fold it into the ongoing port of the VM from assembler to C though (raptorjit/raptorjit#199). Could also merge it more quickly, but my testing at raptorjit/raptorjit#56 makes me think this change is not super urgent (but definitely important.)

@javierguerragiraldez
Copy link
Member

Mea culpa. I hope to carve some time this week to apply this and also some fixes from upstream.

@alexshpilkin
Copy link

Huh... I seem to recall Mike Pall had an explanation why he didn't originally do it this way (in short: cache locality; the hash table is designed to fit into a single cache line). Has anyone measured the effect of this change?

@lukego
Copy link

lukego commented Feb 10, 2019

Has anyone measured the effect of this change?

This is a good question: it makes sense to know the cost of a change.

But I also think you need to decide up front what you are optimizing for. Is it more important to make the JIT more predictable or to make the interpreter faster?

RaptorJIT perspective is that making compilation predictable to application developers is the number one priority. If you understand the JIT then your code will always compile and you won't use the interpreter anyway.

This is why we are enthusiastic about purging all probabilistic algorithms from the JIT heuristics. We see it as a major time suck when troubleshooting performance problems and having to consider that maybe there is no root cause in the source code and the problem is just dumb bad luck. So we'd want this change even if it did add a cache-miss here and there during warmup because that's not our pain point.

@alexshpilkin
Copy link

alexshpilkin commented Feb 10, 2019

@lukego wrote:

Has anyone measured the effect of this change?

This is a good question: it makes sense to know the cost of a change.

But I also think you need to decide up front what you are optimizing for. Is it more important to make the JIT more predictable or to make the interpreter faster?

RaptorJIT perspective is that making compilation predictable to application developers is the number one priority. If you understand the JIT then your code will always compile and you won't use the interpreter anyway.

This is why we are enthusiastic about purging all probabilistic algorithms from the JIT heuristics. We see it as a major time suck when troubleshooting performance problems and having to consider that maybe there is no root cause in the source code and the problem is just dumb bad luck. So we'd want this change even if it did add a cache-miss here and there during warmup because that's not our pain point.

Well... I guess. I initially wanted to write that with JIT compilation, the behaviour in any case predictable only for the same input (data, I/O timings etc.), and the hash table doesn’t in itself include any genuine randomness. But I guess the argument still works if you replace “predictable” by “locally predictable”, as in the JIT behaviour of a loop should not depend on literally everything else in the program.

(Might still make sense to bunch the counters together somehow, like in a global or per-file table, but I have no actual idea.)

However, my original comment was less implicit disapproval and more genuine curiosity, as the original reasoning—“because cache locality”—is one of those things that sound eminently sensible, but could always turn out to be utterly negligible in practice. So I think it’d be actually very interesting to know how it turns out (and I suck at benchmarking).

P.S. I don’t know how, but for some reason I couldn’t originally find the description of the hash table despite it being literally two browser tabs over. Here it is:

Hashed profile counters: Bytecode instructions to trigger the start of a hot trace use low-overhead hashed profiling counters. The profile is imprecise because collisions are ignored. The hash table is kept very small to reduce D-cache impact (only two hot cache lines). Since [natural-loop first region selection] weeds out most false positives, this doesn't deteriorate hot trace detection.

@javierguerragiraldez
Copy link
Member

i guess cache locality is a factor against a bigger hash table, but couldn't this have even better cache use? If I recall correctly, this puts the counters in the bytecode, which is definitely in cache when it's being interpreted, and doesn't try to put entries relevant to other, far distant, parts of the code all together.

in the end, benchmarks trump any amount of guessing.

@fsfod
Copy link
Author

fsfod commented Feb 10, 2019

Yeah i put the counter value for functions right next to the function header entry bytecode in memory and for loops i put the counter value after loop hot counting instructions FORL, ITERL and LOOP.
There will still be some extra cost from when the CPU has to write the cache line with the counter back to memory that was not there before.

@lukego
Copy link

lukego commented Feb 10, 2019

but could always turn out to be utterly negligible in practice

Just indulging in a bit of napkin math...

Suppose your program has 1000 LOOP/FUNC bytecodes with hotcounts, and each one is executed on average 100 times before being JITed or blacklisted, and each hotcount update costs you 10 cycles. Then the total cost during warmup would be one million CPU cycles. On a 2GHz CPU that would be 500 microseconds of delay over the lifetime of the program.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants