-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Increase available allocation range in mcode_alloc() #285
Comments
Change mcode_alloc() so that instead of all execution threads using the same allocation range of '[target - range/2; target + range / 2]' (where 'range' is the jump range defined by the architecture), make some threads use the '[target - range; target]' range, and other ones use '[target; target + range]' based on a PRNG. The patch essentially doubles the available allocation pool for mcode_alloc() and helps reduce mmap() scalability issues for multi-threaded applications.
Change mcode_alloc() so that instead of all execution threads using the same allocation range of '[target - range/2; target + range / 2]' (where 'range' is the jump range defined by the architecture), make some threads use the '[target - range; target]' range, and other ones use '[target; target + range]' based on a PRNG. The patch essentially doubles the available allocation pool for mcode_alloc() and helps reduce mmap() scalability issues for multi-threaded applications.
This would cause problems when the one half of the range that was picked is occupied. This can happen when shared libraries are tightly packed into the virtual address space. I'm more inclined to eventually fix the underlying problem: remove the necessity for limited-range jumps and calls to support code in the LuaJIT core. Replace with address load+jump/call and/or trampolines, when we can't get a convenient mcode allocation. Some architectures need both approaches. This is quite architecture-dependent, though, and would need lots of testing. |
This is an addendum to the fix for LuaJIT#285.
expect a version without relative jump or can use a macro to control it, mcode_alloc() often failed on some android device |
I'd like to point out that in Android ARM64, I'm writing Lua code that transforms thousands vertices, and these transformation must be performed every game loop (16.67ms on 60Hz monitor). So, this means all code that is being called every single frame must be optimized. One particular heavy math function is called almost thousand times per single update call. Almost every codepath that's dealing with those transformation is JIT-compilable. But this also means it generates traces at high rate ( Sadly, in Android, this is all nightmare. Almost every piece of code that LuaJIT tries to compile fails because I'd like to say that my case is one case where allocation range for Tests above is done using LuaJIT-2.1.0-beta3 384d6d5 |
It should not regenerate different traces on each call. I had this bug where a function called inside a vararg function was being jit-compiled a shit ton of time, only because it didn't "understand" it was the same function again and again. |
Here was the exact bug, in this commit (which is a 100% wanted shit code commit for the sake of the example): https://github.com/ExtReMLapin/md5.lua/commit/5964ccdfb7192b6437883e0080fb6e09b5f6742f Old working version (right before this commit) produces this 6 traces , 0.288 secs to execute While the version with vararg produces this 112 traces, 0.95 secs to execute main.lua : local md5 = require("md5")
local time = os.clock()
local i = 0
while (i < 50000) do
assert(md5.sum("mdrcaca"))
i = i + 1
end
print(os.clock()-time) |
@ExtReMLapin Please stop spamming completely unrelated bug reports with whatever is on your mind currently. You're not helping. |
Anyone is welcome to submit a patch as described in my comment above: #285 (comment) |
(I think this is the right place to post this, but I'm happy to move it to a different issue if desired.) My software on macOS+arm64 is hitting something similar to what @MikuAuahDark was describing on Android – which means until this is fixed or I find some sort of workaround, the arm64 JIT compiler is unfortunately worse than useless for me (since if I leave it on, it performs worse than with JIT turned off). If I run this simple test script with the LuaJIT standalone executable on an Apple Silicon Mac using the latest commit b4b2dce, it will reproduce the problem most of the time it's run. My full codebase has more C++ code going on, so it can be repro'd 100% of the time there instead of just most of the time. I also made a second test using GLFW instead of SDL2 and had the same issue. local ffi = require("ffi")
ffi.cdef[[
int SDL_InitSubSystem(uint32_t flags);
void SDL_QuitSubSystem(uint32_t flags);
typedef struct SDL_Window SDL_Window;
SDL_Window * SDL_CreateWindow(const char *title, int x, int y, int w, int h, uint32_t flags);
void SDL_DestroyWindow(SDL_Window * window);
]]
-- SDL2 needs to be installed where dlopen can find it. Or you can download the
-- macOS framework from libsdl.org, and point this to the binary inside it.
local sdl = ffi.load("SDL2")
local initflags = 0xFFFFFFFF
sdl.SDL_InitSubSystem(initflags)
local window = sdl.SDL_CreateWindow("", 100, 100, 800, 600, 0)
local n = 0
for i=1, 10000 do
n = n + i
end
print(n)
sdl.SDL_DestroyWindow(window)
sdl.SDL_QuitSubSystem(initflags) My repro steps are:
With
If anyone has fix or workaround suggestions I'm happy to test them, although I don't think I have the skills or time necessary to try to make my own changes to LuaJIT's source code. |
Is this something someone with little experience on the luajit code base or similar code bases could possibly be able to research and learn how to do in a reasonable time(idk a few months). If so where would said person even start to try and learn what's necessary to fix this |
I don't think this is a trivial issue. I think this requires lots of changes to the LuaJIT if this ever happening. Quoting Mike Pall above:
|
Hi, I'm gonna chime in some new recent observations on this, to help others diagnose. I had problems to reproduce in a testbed environment.
Of course on LuaJIT v2.1 HEAD. -----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8< I created this branch https://github.com/folays/LuaJIT/tree/bench_mcode_alloc from latest v2.1.
So you would have to compile my branch because you would need the new global function for what's following ; Every % 100 loop Here are the result :
Every of the 1000 children does :
(well let's keep in mind that the problem is not to populate or not those pages, the problem is only hoarding the virtual memory regions which could have satisfied mcode_alloc()) As you can see on the results above, there wasn't really any problem. Hint : I found 32704 because I did not wanted to be erroneous in my calculus, so instead I opted for a "loop" over 1 million quick-and-dirty statistical PRNG-generation of the possible target address, so that I statistically found 99.99% of the possible target addresses in the same way that mcode_alloc() does. Hint 2 : To "check" freeness of the regions, I do some low effort, I just mmap(requested_address) (without MAP_FIXED) and then unmap afterwards, and I figured out that if mmap() did not yield the requested_address, probably it was already allocated. Would not prod such a thing, but for the diagnosis here, the assumption seems to be okay. Maybe If I were to actively trying to malloc() some specific sizes I could have achieved to lower to 0% the free rate quicker, but it seems that when not actively trying to cause problem, seems all good. -----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8< Sadly I'm not being able to share the whole code, but I can let you know at least that if the LuaJIT<->Go bridge were doing something outrageous I would have know it, because I'm the author. Agreed, I'm doing a lot more thing to initialise a workable environment and initialise some helpers exposed to Lua etc.. But still I was able to observe the following :
Here above I'm :
Please not that even with the ffi.C.malloc() commented, the 100 instances won't "totally dry run" ; My Go code is still preparing some stuff and whatnot, I even have a The runs WITHOUT the ffi.C.malloc runs like these :
The "virtual memory size" is the TOTAL macOS's As you can see, things are not dramatic there. After 100 runs I still have plenty of mcode regions available. Then If I uncomment the `L.RunStringFatal("ffi.C.malloc(1024 * 1024)") things go south :
As you can see, this time the This is a testbed, If I'm doing something useful in my larger codebase, I'm going down to 0% (all 327xx regions) unavialable in something like 32 runs. Especially Go allocations also are eating near I still have a mystery left : I don't know why Go+LuaJIT has a more prominent problem than LuaJIT standalone. I did not tryied C+LuaJIT. Probably in some cases, LuaJIT code is not loaded at the same address. Maybe Go calls bin/ld with some flags which tight together all libs even closer, I dunno. Of course I know that I should close & reuse my LuaJIT instance. I was prototyping some stuff so I didn't do it yet. But I wanted to show some numbers to help people figuring out what's going when they trigger this problem. |
Then finally hovering around this problem, I think I have a thought. Besides modifying LuaJIT to not be restricted by relative jumps anymore, which is probably something that Mike Pall and maybe a few living beings from another stellar plane can do, there may be a portable solution which would probably fix the problem of tightness in virtual memory and malloc()'s (C malloc, Go malloc, whatever which would allocate memory, even the random kernel mmap()). There could be :
The init stuff I think about is :
And then a "hook / callback" which LuaJIT would call to take ownership of those reserved region address. The idea is to reserve virtual memory size as soon as possible (2GB on x86) around lj_vm_exit_handler() to prevent other mallocs() to take them, and to let LuaJIT having a way for it to take ownership of those pages. |
Furthers explorations below (and may I precise that it's on a macOS Intel - not M1) I saw (with I'm going to run the aforementioned
Both have their "runs" configured to eat for each run :
(I think I observed that populating the allocations were changing behaviour, possibly the malloc implementation (or the kernel) is figuring out if you are using your allocations or not, which could constitute a deciding factor in the malloc implem.) This is the C code (minus standard #include) :
This is the Go code (minus same include than C version :
Both are compiled with The C version yields :
Not very dramatic above. The Go version yields :
=== CONCLUSIONSOn the frequency of encountering "failed to allocate mcode memory", on macOS
The Go versions is not doing any real work, and allocating repeatedly only 3-fixed-size buffer. In a real workload constituted of many more bufsize, and possibly aggravated by how the Go runtime may handle the runtime 's buffer pool of memory, LuaJIT quickly triggers "failed to allocate mcode memory" due to the tightness and exhaustion of the virtual memory size in the "relative jump" range. Without even doing ANY WORK AT ALL :
Note : of course Go had NOT done "any work at all" once entered into Note : My measure of "availability" somehow starts near 50%, possibly I have some signedness bug in my "check range availability algorithm", which would not change the fact that 0-availability is quickly encountered. |
So my point is that, at least in the Go case, on MacOS (Intel) it's already too late in Go's Because on MacOS-Intel-Go it is already too late once the Go runtime enter I'm out of solutions. Maybe figuring out :
Kind Regards, |
Finally I can alleviate the problem on my platform (macOS 11.6 ; intel processor ; linked via Cgo) You can compile The 0x488800000000 was chosen manually, but there is no specifics here. It's near the top of the 48 bits (~47,5 bits), with enough room before the 48 bits to not constrain the mcode region. C run :
Go run :
As you can see the C version (compiled with "default" bin/gcc on macOS so Xcode's gcc+ld) does not pick up the requested address. Whereas the Go version honours it (linked with the default cgo tools / behaviour), which fixes the problem. |
Here is a patch I have made to alleviate this problem internally. The changes are described in the commit message. It's X64 only, but should be trivially portable to ARM64 as well. It's not really a full blown solution, but it kinda sorta works. |
MacOS arm 64(Apple Silicon) this issue seems to be happening as well with me. I assume this issue won't be fixed any time soon sadly. In a software I am trying to run I have no choice but to fall back to Lua |
@ShadowRoi I saw that @mraleph posted a patch for this issue. Have you tried it yet? |
Same issue here. |
@zhengying I have no idea how to apply this patch and from what I see this patch is made only for x86 not for arm64. @MarioSieg Just the fact that we have reached M3 series + and this issue is from 2017 this is not looking good in getting fixed at all.. |
@ShadowRoi Yeah I'll also disable the JIT on aarch64 for now, but it's a pity to have the JIT disabled because of a problem like this. I know that on x86-64 the direct jump range is a signed 32-bit integer, so +-2G and on aarch64 it's much smaller because it's encoded into the 32-bit fixed instructions. I wonder how other JIT compilers handle this problem - is trampoline stub emission the only solution? @mraleph |
@MarioSieg , on my platform (MacOS x86_64), what fixed the problem was to compile The important matter was that upon loading On x86_64 this was because LuaJIT uses... I don't remember... relative jump or near jump or something like that. I don't know your platform nor which compiler you use, but I suggest that maybe instead of modifying the library, maybe you could also hint either the linker or the dynamic loader to which address it should load a particular library ? I choose my address (0x488800000000) semi-arbitrarily. Far away from everything else, and easily recognizable. |
This remark seems to confuse several things. Old versions (e.g. 1.x, 2.0) of LuaJIT had a restriction on data memory needing to be in the low 2GB of virtual address space, but this restriction does not apply to the 64-bit ports in the v2.1 branch. Virtual address space is per-process, so shouldn't have any particular relation to boot time. Finally, this issue is about executable memory rather than data memory; the v2.1 x86_64 restriction on executable memory is +/-2GB from the @mraleph's comment above addresses this for x86_64, allowing an arbitrary point in virtual address space to be used as the center of the +/-2GB range. @folays's For arm64 in particular, I've just written https://github.com/corsix/LuaJIT/commits/arm64-jump-range/, which should remove the restriction entirely, though it'll still try to cluster things together in +/-127MB regions when possible. It takes a more complex approach than @mraleph's, on the grounds that a single arbitrary +/-127MB range would still be quite small (compared to +/-2GB). |
@corsix Whenever I try the make command to build it, it fails in MacOS(Apple Silicon) for whatever reason. |
Per the documented build instructions:
Personally I use |
Thanks for the information and interesting approach. |
Thanks for the patch Peter, |
This is an improvement request related to #282, #283, #284 and mmap() scalability issues.
mcode_alloc() uses half the range defined by
LJ_TARGET_RANGE
to make all allocated blocks mutually reachable by relative jumps. However, for multi-threaded applications with separate JIT states per thread, that requirement is unnecessary. The only required constraint is to make static assembler code reachable from all allocated blocks, which is ensured in mcode_alloc() by calculatingtarget
based on thevm_exit_handler()
address.I propose to change mcode_alloc() so that instead of all execution threads using the same allocation range of
[target - range/2; target + range / 2]
(whererange
is the jump range defined by the architecture), make some threads use the[target - range; target]
range, and other ones use[target; target + range]
.I have an experimental patch assigning either one of the two possible ranges to JIT states based on a PRNG. The patch essentially doubles the available allocation pool for mcode_alloc() and helps reduce mmap() scalability issues for multi-threaded applications.
The text was updated successfully, but these errors were encountered: