Separate hot counters #6

fsfod · 2018-12-06T06:03:32Z

This PR tries to make performance more consistent by switching the hot counters used by function and loops from being shared in a small hash table to having a separate counter for each loop or function. This will make which function\loops are considered hot enough to start a trace consistent with how often they are called, instead of being somewhat random based on collisions in the hash table that change each time the process is run. For small benchmarks this effect is rarely seen but in larger Lua code bases we've seen performance oscillate from one process execution to the next by up to 20% that stopped once we made the hot counters separate.

Normally the 16 bit hot counters for functions and loops are in a small 64 slot hashtable with the key being the address of bytecode doing the hot counting which is either the first bytecode of a function or the loop bytecodes BC_LOOP, BC_FORL, BC_ITERL. This PR moves hot count for functions to a field at the end of the function prototype(GCproto) struct and for loops to a new dummy bytecode(BC_LOOPHC) that is emitted after the loop bytecodes listed earlier to store the hot count of loops.

Function hot counter have been moved from the shared hashtable to inside the function prototype object that is only shared between the closures of a function. The function hot counter field was added to the end of the GCproto struct so that its right next the function header bytecode in memory which decrements the counter. Since the function hot counters are now spread out around the function proto objects it will increase the number of dirty cache lines that need to be eventually be flushed back to memory vs only two cache lines of the shared hotcount hashtable, but this stops happening once the function is JIT'ed. Updated the interpreter assembly for the Lua function header bytecodes to use the new hot counter location. Added a separate JIT param for function hot trigger count currently set to the same effective value as it was with the shared hotcount table system of 112 calls.

…unters The loop hot count will be stored in the 16 bit D part of the opcode and the opcode is emitted directly after either BC_LOOP, BC_FORL or BC_ITERL depending on what kind of loop its being generated. These three opcodes do the hot counting for loops so the hot count should be right next to them in memory. Update embedded bytecode in buildvm_libbc.h for bytecode changing from BC_LOOPHC

…TERL\BC_JITERL Skipping for BC_FORI NYI

…LOOPHC to store the count

…ake them JIT params Make the loop hot counters decrement by one instead of two for consistently now that the counter backoff scaling values are not shared with functions. Use some non default max trace attempt counts for loops and functions in the unit tests to test out penaltymaxfunc and penaltymaxloop JIT parameters

…other arch builds Add BC_LOOPHC under a define in lj_bc.h and lj_record.c Don't embed hot counter bytecode if not built with separate hot counters Scale penalty values by 2 when separate counters is disabled

javierguerragiraldez · 2018-12-06T11:03:00Z

nice! on a first glance it seems to be controlled by a compile flag, right?

fsfod · 2018-12-06T17:20:41Z

Well its on by default for x86\x64 but can be disabled with LUAJIT_NO_SEPARATE_COUNTERS. The hot counter tests should still pass with it disabled.

daurnimator · 2018-12-12T23:06:26Z

src/lj_record.c

 	lj_trace_flush(J, lnk);  /* Flush trace that only returns. */
 	/* Set a small, pseudo-random hotcount for a quick retry of JFUNC*. */
-	hotcount_set(J2GG(J), J->pc+1, LJ_PRNG_BITS(J, 4));
+        hotcount_set_pt(J2GG(J), pt, J->pc, LJ_PRNG_BITS(J, 4));


indentation is weird here.

well it changed from a tab to spaces but the indent is the same if you set tabsize to 8 like LuaJIT uses. If there was .editorconfig in the repo github would display this better.

daurnimator · 2018-12-12T23:07:21Z

tests/LuaJIT-test-cleanup/test/trace/index

@@ -5,3 +5,4 @@ gc64_slot_revival.lua
 phi
 snap.lua
 stitch.lua
+hotcounters.lua


missing newline at EOL

lukego · 2018-12-13T04:28:06Z

This seems like an excellent change to me. Being able to exclude hot-counter table collisions as the root cause during "WTF?" performance troubleshooting would be a significant productivity win. I don't think that the hashtable has ever directly caused a performance problem that I have investigated but even so it has been a productivity drag to have to consider the possibility and try to rule it out.

Great work!

fsfod · 2018-12-13T04:52:32Z

We only saw the big performance swings in a few of our benchmarks. If LuaJIT has trouble tracing the code it seems to pronounce the effect though. The random collisions do sometimes cause different traces that make things faster but its not really something you can control.

lukego · 2018-12-13T05:01:57Z

I would be enthusiastic about this change even if it doesn't impact any benchmarks. This should make users more productive because it means one less possibility to consider/exclude when looking for the root cause of surprising performance problems.

lukego · 2018-12-13T05:13:21Z

Corollary is that since this feature is only implemented and enabled for x86 platforms this might mean one more possibility to consider when looking for the root cause of different JIT behaviour on different platforms.

Ravenslofty · 2019-02-01T23:14:17Z

As an outsider, is anything going to be done with this pull request (@lukego is this in RaptorJIT?), or is it just going to languish here?

lukego · 2019-02-03T17:41:45Z

@ZirconiumX RaptorJIT will adopt this change. The easiest way to do that is probably to fold it into the ongoing port of the VM from assembler to C though (raptorjit/raptorjit#199). Could also merge it more quickly, but my testing at raptorjit/raptorjit#56 makes me think this change is not super urgent (but definitely important.)

javierguerragiraldez · 2019-02-03T18:36:29Z

Mea culpa. I hope to carve some time this week to apply this and also some fixes from upstream.

alexshpilkin · 2019-02-10T02:31:31Z

Huh... I seem to recall Mike Pall had an explanation why he didn't originally do it this way (in short: cache locality; the hash table is designed to fit into a single cache line). Has anyone measured the effect of this change?

lukego · 2019-02-10T07:23:25Z

Has anyone measured the effect of this change?

This is a good question: it makes sense to know the cost of a change.

But I also think you need to decide up front what you are optimizing for. Is it more important to make the JIT more predictable or to make the interpreter faster?

RaptorJIT perspective is that making compilation predictable to application developers is the number one priority. If you understand the JIT then your code will always compile and you won't use the interpreter anyway.

This is why we are enthusiastic about purging all probabilistic algorithms from the JIT heuristics. We see it as a major time suck when troubleshooting performance problems and having to consider that maybe there is no root cause in the source code and the problem is just dumb bad luck. So we'd want this change even if it did add a cache-miss here and there during warmup because that's not our pain point.

alexshpilkin · 2019-02-10T16:54:20Z

@lukego wrote:

Has anyone measured the effect of this change?

This is a good question: it makes sense to know the cost of a change.

But I also think you need to decide up front what you are optimizing for. Is it more important to make the JIT more predictable or to make the interpreter faster?

RaptorJIT perspective is that making compilation predictable to application developers is the number one priority. If you understand the JIT then your code will always compile and you won't use the interpreter anyway.

This is why we are enthusiastic about purging all probabilistic algorithms from the JIT heuristics. We see it as a major time suck when troubleshooting performance problems and having to consider that maybe there is no root cause in the source code and the problem is just dumb bad luck. So we'd want this change even if it did add a cache-miss here and there during warmup because that's not our pain point.

Well... I guess. I initially wanted to write that with JIT compilation, the behaviour in any case predictable only for the same input (data, I/O timings etc.), and the hash table doesn’t in itself include any genuine randomness. But I guess the argument still works if you replace “predictable” by “locally predictable”, as in the JIT behaviour of a loop should not depend on literally everything else in the program.

(Might still make sense to bunch the counters together somehow, like in a global or per-file table, but I have no actual idea.)

However, my original comment was less implicit disapproval and more genuine curiosity, as the original reasoning—“because cache locality”—is one of those things that sound eminently sensible, but could always turn out to be utterly negligible in practice. So I think it’d be actually very interesting to know how it turns out (and I suck at benchmarking).

P.S. I don’t know how, but for some reason I couldn’t originally find the description of the hash table despite it being literally two browser tabs over. Here it is:

Hashed profile counters: Bytecode instructions to trigger the start of a hot trace use low-overhead hashed profiling counters. The profile is imprecise because collisions are ignored. The hash table is kept very small to reduce D-cache impact (only two hot cache lines). Since [natural-loop first region selection] weeds out most false positives, this doesn't deteriorate hot trace detection.

javierguerragiraldez · 2019-02-10T17:24:58Z

i guess cache locality is a factor against a bigger hash table, but couldn't this have even better cache use? If I recall correctly, this puts the counters in the bytecode, which is definitely in cache when it's being interpreted, and doesn't try to put entries relevant to other, far distant, parts of the code all together.

in the end, benchmarks trump any amount of guessing.

fsfod · 2019-02-10T20:48:18Z

Yeah i put the counter value for functions right next to the function header entry bytecode in memory and for loops i put the counter value after loop hot counting instructions FORL, ITERL and LOOP.
There will still be some extra cost from when the CPU has to write the cache line with the counter back to memory that was not there before.

lukego · 2019-02-10T21:06:29Z

but could always turn out to be utterly negligible in practice

Just indulging in a bit of napkin math...

Suppose your program has 1000 LOOP/FUNC bytecodes with hotcounts, and each one is executed on average 100 times before being JITed or blacklisted, and each hotcount update costs you 10 cycles. Then the total cost during warmup would be one million CPU cycles. On a 2GHz CPU that would be 500 microseconds of delay over the lifetime of the program.

fsfod added 6 commits December 4, 2018 19:29

Add skipping over of the LOOPHC bytecode to BC_ILOOP\BC_LOOP and BC_I…

5f7a02e

…TERL\BC_JITERL Skipping for BC_FORI NYI

Implement separate per loop hot counters in the interpreter using BC_…

7a03f51

…LOOPHC to store the count

Make separate hot counters enabled under a define so as not to break …

dfdfcfe

…other arch builds Add BC_LOOPHC under a define in lj_bc.h and lj_record.c Don't embed hot counter bytecode if not built with separate hot counters Scale penalty values by 2 when separate counters is disabled

daurnimator suggested changes Dec 12, 2018

View reviewed changes

fsfod added 2 commits December 14, 2018 06:21

Fix missing newline at the end of file for test/trace/index

cc22eef

Try to keep indent the same kind for a changed line in lj_record.c

ae185f1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate hot counters #6

Separate hot counters #6

fsfod commented Dec 6, 2018

javierguerragiraldez commented Dec 6, 2018

fsfod commented Dec 6, 2018

daurnimator Dec 12, 2018

fsfod Dec 14, 2018

daurnimator Dec 12, 2018

fsfod Dec 14, 2018

lukego commented Dec 13, 2018

fsfod commented Dec 13, 2018

lukego commented Dec 13, 2018

lukego commented Dec 13, 2018

Ravenslofty commented Feb 1, 2019 •

edited

lukego commented Feb 3, 2019

javierguerragiraldez commented Feb 3, 2019

alexshpilkin commented Feb 10, 2019

lukego commented Feb 10, 2019

alexshpilkin commented Feb 10, 2019 •

edited

javierguerragiraldez commented Feb 10, 2019

fsfod commented Feb 10, 2019

lukego commented Feb 10, 2019

Separate hot counters #6

Are you sure you want to change the base?

Separate hot counters #6

Conversation

fsfod commented Dec 6, 2018

javierguerragiraldez commented Dec 6, 2018

fsfod commented Dec 6, 2018

daurnimator Dec 12, 2018

Choose a reason for hiding this comment

fsfod Dec 14, 2018

Choose a reason for hiding this comment

daurnimator Dec 12, 2018

Choose a reason for hiding this comment

fsfod Dec 14, 2018

Choose a reason for hiding this comment

lukego commented Dec 13, 2018

fsfod commented Dec 13, 2018

lukego commented Dec 13, 2018

lukego commented Dec 13, 2018

Ravenslofty commented Feb 1, 2019 • edited

lukego commented Feb 3, 2019

javierguerragiraldez commented Feb 3, 2019

alexshpilkin commented Feb 10, 2019

lukego commented Feb 10, 2019

alexshpilkin commented Feb 10, 2019 • edited

javierguerragiraldez commented Feb 10, 2019

fsfod commented Feb 10, 2019

lukego commented Feb 10, 2019

Ravenslofty commented Feb 1, 2019 •

edited

alexshpilkin commented Feb 10, 2019 •

edited