Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase available allocation range in mcode_alloc() #285

Open
akopytov opened this issue Feb 25, 2017 · 27 comments
Open

Increase available allocation range in mcode_alloc() #285

akopytov opened this issue Feb 25, 2017 · 27 comments

Comments

@akopytov
Copy link

akopytov commented Feb 25, 2017

This is an improvement request related to #282, #283, #284 and mmap() scalability issues.

mcode_alloc() uses half the range defined by LJ_TARGET_RANGE to make all allocated blocks mutually reachable by relative jumps. However, for multi-threaded applications with separate JIT states per thread, that requirement is unnecessary. The only required constraint is to make static assembler code reachable from all allocated blocks, which is ensured in mcode_alloc() by calculating target based on the vm_exit_handler() address.

I propose to change mcode_alloc() so that instead of all execution threads using the same allocation range of [target - range/2; target + range / 2] (where range is the jump range defined by the architecture), make some threads use the [target - range; target] range, and other ones use [target; target + range].

I have an experimental patch assigning either one of the two possible ranges to JIT states based on a PRNG. The patch essentially doubles the available allocation pool for mcode_alloc() and helps reduce mmap() scalability issues for multi-threaded applications.

akopytov added a commit to akopytov/LuaJIT that referenced this issue Feb 25, 2017
Change mcode_alloc() so that instead of all execution threads using the
same allocation range of '[target - range/2; target + range / 2]' (where
'range' is the jump range defined by the architecture), make some
threads use the '[target - range; target]' range, and other ones use
'[target; target + range]' based on a PRNG. The patch essentially
doubles the available allocation pool for mcode_alloc() and helps reduce
mmap() scalability issues for multi-threaded applications.
akopytov added a commit to akopytov/LuaJIT that referenced this issue Feb 25, 2017
Change mcode_alloc() so that instead of all execution threads using the
same allocation range of '[target - range/2; target + range / 2]' (where
'range' is the jump range defined by the architecture), make some
threads use the '[target - range; target]' range, and other ones use
'[target; target + range]' based on a PRNG. The patch essentially
doubles the available allocation pool for mcode_alloc() and helps reduce
mmap() scalability issues for multi-threaded applications.
@MikePall
Copy link
Member

MikePall commented Mar 8, 2017

This would cause problems when the one half of the range that was picked is occupied. This can happen when shared libraries are tightly packed into the virtual address space.

I'm more inclined to eventually fix the underlying problem: remove the necessity for limited-range jumps and calls to support code in the LuaJIT core. Replace with address load+jump/call and/or trampolines, when we can't get a convenient mcode allocation. Some architectures need both approaches. This is quite architecture-dependent, though, and would need lots of testing.

akopytov added a commit to akopytov/LuaJIT that referenced this issue Apr 2, 2017
@topameng
Copy link

topameng commented May 6, 2017

expect a version without relative jump or can use a macro to control it, mcode_alloc() often failed on some android device

@MikuAuahDark
Copy link

MikuAuahDark commented Jul 31, 2020

I'd like to point out that in Android ARM64, mcode_alloc() fails 95% of the time. This means LuaJIT with JIT compiler turned on is slower than LuaJIT with JIT compiler off in such platforms.

I'm writing Lua code that transforms thousands vertices, and these transformation must be performed every game loop (16.67ms on 60Hz monitor). So, this means all code that is being called every single frame must be optimized. One particular heavy math function is called almost thousand times per single update call. Almost every codepath that's dealing with those transformation is JIT-compilable. But this also means it generates traces at high rate (-jdump writes 16MB and lists almost 1000 traces in desktop!), which is somewhat fine for desktops.

Sadly, in Android, this is all nightmare. Almost every piece of code that LuaJIT tries to compile fails because mcode_alloc() fails to allocate memory, reducing the overall performance by 2x compared when using interpreter alone. Tweaking the JIT options gives slight improvement (maxmcode=524288 and sizemcode=512 seems work for me, but I'm not expert on tweaing the parameters), and huge boost to performance, but only for short time. As LuaJIT tries to allocate more memory for the code, it fails. The worse part is, it's hard for LuaJIT to allocate memory near lj_vm_exit_handler(). Thus, if mcode_alloc() fails, it causes jit.flush() to be called, and it's hard for LuaJIT to start JIT compilation again.

I'd like to say that my case is one case where allocation range for mcode_alloc() should be increased, or if possible, lifting all these restrictions altogether. However, I can understand that this is not easy to do. If there's some workaround that I can try please let me know.

Tests above is done using LuaJIT-2.1.0-beta3 384d6d5

@ExtReMLapin
Copy link

It should not regenerate different traces on each call.

I had this bug where a function called inside a vararg function was being jit-compiled a shit ton of time, only because it didn't "understand" it was the same function again and again.

@ExtReMLapin
Copy link

ExtReMLapin commented Jul 31, 2020

Here was the exact bug, in this commit (which is a 100% wanted shit code commit for the sake of the example):

https://github.com/ExtReMLapin/md5.lua/commit/5964ccdfb7192b6437883e0080fb6e09b5f6742f

Old working version (right before this commit) produces this 6 traces , 0.288 secs to execute

While the version with vararg produces this 112 traces, 0.95 secs to execute

main.lua :

local md5 = require("md5")


local time = os.clock()
local i = 0
while (i < 50000) do
	assert(md5.sum("mdrcaca")) 
	i = i + 1
end

print(os.clock()-time)

@MikePall
Copy link
Member

MikePall commented Aug 1, 2020

@ExtReMLapin Please stop spamming completely unrelated bug reports with whatever is on your mind currently. You're not helping.

@MikePall
Copy link
Member

MikePall commented Aug 1, 2020

Anyone is welcome to submit a patch as described in my comment above: #285 (comment)

@slime73
Copy link

slime73 commented Oct 31, 2021

(I think this is the right place to post this, but I'm happy to move it to a different issue if desired.)

My software on macOS+arm64 is hitting something similar to what @MikuAuahDark was describing on Android – which means until this is fixed or I find some sort of workaround, the arm64 JIT compiler is unfortunately worse than useless for me (since if I leave it on, it performs worse than with JIT turned off).

If I run this simple test script with the LuaJIT standalone executable on an Apple Silicon Mac using the latest commit b4b2dce, it will reproduce the problem most of the time it's run. My full codebase has more C++ code going on, so it can be repro'd 100% of the time there instead of just most of the time. I also made a second test using GLFW instead of SDL2 and had the same issue.

local ffi = require("ffi")

ffi.cdef[[
int SDL_InitSubSystem(uint32_t flags);
void SDL_QuitSubSystem(uint32_t flags);

typedef struct SDL_Window SDL_Window;
SDL_Window * SDL_CreateWindow(const char *title, int x, int y, int w, int h, uint32_t flags);
void SDL_DestroyWindow(SDL_Window * window);
]]

-- SDL2 needs to be installed where dlopen can find it. Or you can download the
-- macOS framework from libsdl.org, and point this to the binary inside it.
local sdl = ffi.load("SDL2")

local initflags = 0xFFFFFFFF
sdl.SDL_InitSubSystem(initflags)

local window = sdl.SDL_CreateWindow("", 100, 100, 800, 600, 0)

local n = 0
for i=1, 10000 do
	n = n + i
end
print(n)

sdl.SDL_DestroyWindow(window)
sdl.SDL_QuitSubSystem(initflags)

My repro steps are:

  • Have a build of the latest LuaJIT code, and have SDL2 installed somewhere.
  • Put the above code into a file.
  • Run luajit -jdump main.lua
  • Observe hundreds of trace aborts.

With -jdump active, the above script usually results in hundreds of trace aborts and no successful traces. The following jdump output is repeated hundreds of times:

---- TRACE 1 start main.lua:22
0029  ADDVV    4   4   8
0030  FORL     5 => 0029
---- TRACE 1 abort main.lua:23 -- failed to allocate mcode memory

---- TRACE flush

If anyone has fix or workaround suggestions I'm happy to test them, although I don't think I have the skills or time necessary to try to make my own changes to LuaJIT's source code.

@Vixeliz
Copy link

Vixeliz commented Jun 4, 2022

Is this something someone with little experience on the luajit code base or similar code bases could possibly be able to research and learn how to do in a reasonable time(idk a few months). If so where would said person even start to try and learn what's necessary to fix this

@MikuAuahDark
Copy link

I don't think this is a trivial issue. I think this requires lots of changes to the LuaJIT if this ever happening.

Quoting Mike Pall above:

I'm more inclined to eventually fix the underlying problem: remove the necessity for limited-range jumps and calls to support code in the LuaJIT core. Replace with address load+jump/call and/or trampolines, when we can't get a convenient mcode allocation. Some architectures need both approaches. This is quite architecture-dependent, though, and would need lots of testing.

@folays
Copy link

folays commented Mar 1, 2023

Hi, I'm gonna chime in some new recent observations on this, to help others diagnose.
I encountered failed to allocate mcode memory in my traces.

I had problems to reproduce in a testbed environment.

  • Problematic setup : macOS 11.6.2 , LuaJIT linked-in via Cgo (of Go language)
  • Testned setup : LuaJIT binary

Of course on LuaJIT v2.1 HEAD.

-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<
HERE is the non-problematic testbed setup which yields non-problematic behaviour ;

I created this branch https://github.com/folays/LuaJIT/tree/bench_mcode_alloc from latest v2.1.
My commit folays@78d746b

  • exposes global Lua G.lj_vm_exit_handler() which returns the uintptr_t of where it is found.
  • bench_mcode_alloc.lua which is a LuaJIT script which will sub-launch 1000 C language (LuaJIT library) *State to "eat" some memory.

So you would have to compile my branch because you would need the new global function for what's following ;

Every % 100 loop bench_mcode_alloc.lua will check every possible new mcode_alloc() candidate regions to diagnose the %free availability of them. You launch this script with luajit -e __main__=true bench_mcode_alloc.lua

Here are the result :

TARGET lj_vm_exit_handler 0x10ac24ebb
TARGET 0x10ac20000
RANGE 0x3fe00000
TOTAL RANGES : 32704
[  100] RANGES AVAILABLE: yes 15990 (48.893%) no 16714 ; vmsize 1536.000000 MB
[  200] RANGES AVAILABLE: yes 15689 (47.973%) no 17015 ; vmsize 2457.600000 MB
[  300] RANGES AVAILABLE: yes 15387 (47.049%) no 17317 ; vmsize 2867.200000 MB
[  400] RANGES AVAILABLE: yes 15087 (46.132%) no 17617 ; vmsize 3276.800000 MB
[  500] RANGES AVAILABLE: yes 14787 (45.215%) no 17917 ; vmsize 3686.400000 MB
[  600] RANGES AVAILABLE: yes 14487 (44.297%) no 18217 ; vmsize 4198.400000 MB
[  700] RANGES AVAILABLE: yes 14187 (43.380%) no 18517 ; vmsize 4812.800000 MB
[  800] RANGES AVAILABLE: yes 13885 (42.457%) no 18819 ; vmsize 5222.400000 MB
[  900] RANGES AVAILABLE: yes 13585 (41.539%) no 19119 ; vmsize 5529.600000 MB
[ 1000] RANGES AVAILABLE: yes 13285 (40.622%) no 19419 ; vmsize 6144.000000 MB
END

Every of the 1000 children does :

  • ffi.C.malloc() allocate AND POPULATE the pages of 3 "blocks" : 8000 kB, 1 kB, 3 MB
  • ffi.C.malloc() allocate (but do not populate) one block of 1 MB
  • I don't ffi.C.free() anything nor "close" the LuaJIT states.

(well let's keep in mind that the problem is not to populate or not those pages, the problem is only hoarding the virtual memory regions which could have satisfied mcode_alloc())

As you can see on the results above, there wasn't really any problem.
After 1000 runs, there was still 40% of the possible 32704 "areas" available for mcode_alloc()

Hint : I found 32704 because I did not wanted to be erroneous in my calculus, so instead I opted for a "loop" over 1 million quick-and-dirty statistical PRNG-generation of the possible target address, so that I statistically found 99.99% of the possible target addresses in the same way that mcode_alloc() does.

Hint 2 : To "check" freeness of the regions, I do some low effort, I just mmap(requested_address) (without MAP_FIXED) and then unmap afterwards, and I figured out that if mmap() did not yield the requested_address, probably it was already allocated. Would not prod such a thing, but for the diagnosis here, the assumption seems to be okay.

Maybe If I were to actively trying to malloc() some specific sizes I could have achieved to lower to 0% the free rate quicker, but it seems that when not actively trying to cause problem, seems all good.

-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<
Here is the problematic setup ;
LuaJIT linked in Go (Cgo) via #cgo png-config luajit

Sadly I'm not being able to share the whole code, but I can let you know at least that if the LuaJIT<->Go bridge were doing something outrageous I would have know it, because I'm the author.

Agreed, I'm doing a lot more thing to initialise a workable environment and initialise some helpers exposed to Lua etc..

But still I was able to observe the following :

func main() {
	Lmonitor := luajit.NewState()
	Lmonitor.ReasonableDefaults()
	Lmonitor.RunCodeFatal("luajit_mcode.diagnose(({...})[1])", 0)

	for j := 1; j <= 100; j++ {
		L := luajit.NewState()
		L.ReasonableDefaults()

		L.RunString(`ffi.C.malloc(1024 * 1024)`) // 1 MB

		if j%10 == 0 {
			Lmonitor.RunCodeFatal("luajit_mcode.diagnose(({...})[1])", j)
		}
	}
}

Here above I'm :

  • initialise one Lmonitor LuaJIT instance (which runs similar code than the bench_mcode_alloc.lua talked about above)
  • initialise 100 LuaJIT instances, inside which I could launch some ffi.C.malloc() or not.

Please not that even with the ffi.C.malloc() commented, the 100 instances won't "totally dry run" ; My Go code is still preparing some stuff and whatnot, I even have a for 1,100 do end to force at least one mcode region in each instance, some Lua functions are run, some little things are done, but NO BIG ALLOCATION. Only small init stuff, defining functions, etc.. I'm not doing such things are allocating multi-kB array in my "base" initialisation code.

The runs WITHOUT the ffi.C.malloc runs like these :

[    0] MCODE RANGES: [available  5935 (18.15%)] [not available 26769] ; virtual memory size  1433.600 MB
[   10] MCODE RANGES: [available  5627 (17.21%)] [not available 27077] ; virtual memory size  1433.600 MB
[   20] MCODE RANGES: [available  5607 (17.14%)] [not available 27097] ; virtual memory size  1433.600 MB
[   30] MCODE RANGES: [available  5587 (17.08%)] [not available 27117] ; virtual memory size  1433.600 MB
[   40] MCODE RANGES: [available  5567 (17.02%)] [not available 27137] ; virtual memory size  1433.600 MB
[   50] MCODE RANGES: [available  5547 (16.96%)] [not available 27157] ; virtual memory size  1433.600 MB
[   60] MCODE RANGES: [available  5511 (16.85%)] [not available 27193] ; virtual memory size  1433.600 MB
[   70] MCODE RANGES: [available  5491 (16.79%)] [not available 27213] ; virtual memory size  1433.600 MB
[   80] MCODE RANGES: [available  5471 (16.73%)] [not available 27233] ; virtual memory size  1433.600 MB
[   90] MCODE RANGES: [available  5451 (16.67%)] [not available 27253] ; virtual memory size  1433.600 MB
[  100] MCODE RANGES: [available  5431 (16.61%)] [not available 27273] ; virtual memory size  1433.600 MB

The "virtual memory size" is the TOTAL macOS's bin/vmmap of the whole vmem.

As you can see, things are not dramatic there. After 100 runs I still have plenty of mcode regions available.

Then If I uncomment the `L.RunStringFatal("ffi.C.malloc(1024 * 1024)") things go south :

[    0] MCODE RANGES: [available  5903 (18.05%)] [not available 26801] ; virtual memory size  1433.600 MB
[   10] MCODE RANGES: [available  1659 ( 5.07%)] [not available 31045] ; virtual memory size  1638.400 MB
[   20] MCODE RANGES: [available  1639 ( 5.01%)] [not available 31065] ; virtual memory size  1638.400 MB
[   30] MCODE RANGES: [available  1619 ( 4.95%)] [not available 31085] ; virtual memory size  1638.400 MB
[   40] MCODE RANGES: [available  1463 ( 4.47%)] [not available 31241] ; virtual memory size  1843.200 MB
[   50] MCODE RANGES: [available  1443 ( 4.41%)] [not available 31261] ; virtual memory size  1843.200 MB
[   60] MCODE RANGES: [available  1423 ( 4.35%)] [not available 31281] ; virtual memory size  1945.600 MB
[   70] MCODE RANGES: [available  1403 ( 4.29%)] [not available 31301] ; virtual memory size  1945.600 MB
[   80] MCODE RANGES: [available  1383 ( 4.23%)] [not available 31321] ; virtual memory size  1945.600 MB
[   90] MCODE RANGES: [available  1363 ( 4.17%)] [not available 31341] ; virtual memory size  1945.600 MB
[  100] MCODE RANGES: [available  1343 ( 4.11%)] [not available 31361] ; virtual memory size  1945.600 MB

As you can see, this time the ffi.C.malloc() eats away very near lj_vm_exit_handler() very quickly.

This is a testbed, If I'm doing something useful in my larger codebase, I'm going down to 0% (all 327xx regions) unavialable in something like 32 runs.

Especially Go allocations also are eating near lj_vm_exit_handler().

I still have a mystery left : I don't know why Go+LuaJIT has a more prominent problem than LuaJIT standalone. I did not tryied C+LuaJIT. Probably in some cases, LuaJIT code is not loaded at the same address. Maybe Go calls bin/ld with some flags which tight together all libs even closer, I dunno.

Of course I know that I should close & reuse my LuaJIT instance. I was prototyping some stuff so I didn't do it yet.

But I wanted to show some numbers to help people figuring out what's going when they trigger this problem.

@folays
Copy link

folays commented Mar 1, 2023

Then finally hovering around this problem, I think I have a thought.

Besides modifying LuaJIT to not be restricted by relative jumps anymore, which is probably something that Mike Pall and maybe a few living beings from another stellar plane can do, there may be a portable solution which would probably fix the problem of tightness in virtual memory and malloc()'s (C malloc, Go malloc, whatever which would allocate memory, even the random kernel mmap()).

There could be :

  • some automatic init() at the LuaJIT's library loading (if it's a portable hebaviour)
  • or a charge on the user to do some init stuff

The init stuff I think about is :

  • To automatically (or ask the user to) gives the address of lj_vm_exit_handler() to the user and ask him to virtual-memory-reserve as much as it could in the "relative jump size" area around it

And then a "hook / callback" which LuaJIT would call to take ownership of those reserved region address.

The idea is to reserve virtual memory size as soon as possible (2GB on x86) around lj_vm_exit_handler() to prevent other mallocs() to take them, and to let LuaJIT having a way for it to take ownership of those pages.

@folays
Copy link

folays commented Mar 1, 2023

Furthers explorations below (and may I precise that it's on a macOS Intel - not M1)

I saw (with otool -L /usr/local/bin/luajit) that LuaJIT does not link to libluajit, but embed the code, so it may skew observations.

I'm going to run the aforementioned bench_mcode_alloc.lua from above, in 2 very simples setups :

  • with the minimal C
  • with the minimal Go+CGo

Both have their "runs" configured to eat for each run :

  • 800 kB allocated and populated
  • 1 kB only allocated not populated
  • 1 MB only allocated not populated

(I think I observed that populating the allocations were changing behaviour, possibly the malloc implementation (or the kernel) is figuring out if you are using your allocations or not, which could constitute a deciding factor in the malloc implem.)

This is the C code (minus standard #include) :

int main() {
    lua_State *L;

	L = luaL_newstate();
	luaL_openlibs(L);

	lua_pushstring(L, "__main__");
	lua_pushboolean(L, 1);
	lua_settable(L, LUA_GLOBALSINDEX);

	luaL_loadfile(L, "../bench_mcode_alloc.lua");
	lua_pcall(L, 0, 0, 0);
}

This is the Go code (minus same include than C version :

func main() {
	L := C.luaL_newstate()
	C.luaL_openlibs(L)

	C.lua_pushstring(L, C.CString("__main__"))
	C.lua_pushboolean(L, 1)
	C.lua_settable(L, C.LUA_GLOBALSINDEX)

	C.luaL_loadfile(L, C.CString("../bench_mcode_alloc.lua"))
	C.lua_pcall(L, 0, 0, 0)
}

Both are compiled with pkg-config --cflags/--libs luajit ;

The C version yields :

$ otool -L test-c
test-c:
	/usr/local/lib/libluajit-5.1.2.dylib (compatibility version 2.1.0, current version 2.1.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1311.0.0)

TARGET lj_vm_exit_handler 0x104e7a29b
TARGET 0x104e70000
RANGE 0x3fe00000
TOTAL RANGES : 32704
[   20] RANGES AVAILABLE: yes 16229 (49.624%) no 16475 ; vmsize 1024.000000 MB
[   40] RANGES AVAILABLE: yes 16169 (49.440%) no 16535 ; vmsize 1228.800000 MB
[   60] RANGES AVAILABLE: yes 16109 (49.257%) no 16595 ; vmsize 1228.800000 MB
[   80] RANGES AVAILABLE: yes 16049 (49.074%) no 16655 ; vmsize 1228.800000 MB
[  100] RANGES AVAILABLE: yes 15989 (48.890%) no 16715 ; vmsize 1331.200000 MB
[  120] RANGES AVAILABLE: yes 15926 (48.697%) no 16778 ; vmsize 1331.200000 MB
[  140] RANGES AVAILABLE: yes 15866 (48.514%) no 16838 ; vmsize 1433.600000 MB
[  160] RANGES AVAILABLE: yes 15806 (48.330%) no 16898 ; vmsize 1638.400000 MB
[  180] RANGES AVAILABLE: yes 15746 (48.147%) no 16958 ; vmsize 1638.400000 MB
[...]
[  400] RANGES AVAILABLE: yes 15086 (46.129%) no 17618 ; vmsize 2252.800000 MB
[...]
[  620] RANGES AVAILABLE: yes 14424 (44.105%) no 18280 ; vmsize 2662.400000 MB
[...]
[  780] RANGES AVAILABLE: yes 13942 (42.631%) no 18762 ; vmsize 2969.600000 MB

Not very dramatic above.

The Go version yields :

$ otool  -L test-go
test-go:
	/usr/local/lib/libluajit-5.1.2.dylib (compatibility version 2.1.0, current version 2.1.0)
	/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1311.0.0)

TARGET lj_vm_exit_handler 0x10011929b
TARGET 0x100110000
RANGE 0x3fe00000
TOTAL RANGES : 32704
[   20] RANGES AVAILABLE: yes 1016 (3.107%) no 31688 ; vmsize 2048.000000 MB
[   40] RANGES AVAILABLE: yes 698 (2.134%) no 32006 ; vmsize 2150.400000 MB
[   60] RANGES AVAILABLE: yes 366 (1.119%) no 32338 ; vmsize 2252.800000 MB
[   80] RANGES AVAILABLE: yes 306 (0.936%) no 32398 ; vmsize 2355.200000 MB
[  100] RANGES AVAILABLE: yes 246 (0.752%) no 32458 ; vmsize 2457.600000 MB
[  120] RANGES AVAILABLE: yes 186 (0.569%) no 32518 ; vmsize 2457.600000 MB
[  140] RANGES AVAILABLE: yes 126 (0.385%) no 32578 ; vmsize 2457.600000 MB
[  160] RANGES AVAILABLE: yes 50 (0.153%) no 32654 ; vmsize 2457.600000 MB
[  180] RANGES AVAILABLE: yes 1 (0.003%) no 32703 ; vmsize 2457.600000 MB << failed to allocate mcode memory
[  200] RANGES AVAILABLE: yes 1 (0.003%) no 32703 ; vmsize 2560.000000 MB
[  220] RANGES AVAILABLE: yes 1 (0.003%) no 32703 ; vmsize 2662.400000 MB
[  240] RANGES AVAILABLE: yes 1 (0.003%) no 32703 ; vmsize 2662.400000 MB
[  260] RANGES AVAILABLE: yes 1 (0.003%) no 32703 ; vmsize 2662.400000 MB

=== CONCLUSIONS

On the frequency of encountering "failed to allocate mcode memory", on macOS

  • When using C, seems okay in the minimal 10-lines test
  • When using Go + cgo, quickly dramatic

The Go versions is not doing any real work, and allocating repeatedly only 3-fixed-size buffer.

In a real workload constituted of many more bufsize, and possibly aggravated by how the Go runtime may handle the runtime 's buffer pool of memory, LuaJIT quickly triggers "failed to allocate mcode memory" due to the tightness and exhaustion of the virtual memory size in the "relative jump" range.

Without even doing ANY WORK AT ALL :

  • for C, at startup, the main() enters with ~ 50% of the 327xx regions of 32kB (sizemcode) available
  • for Go, at startup, the main() enters with only ~ 3% of those 327xx regions of 32kB available

Note : of course Go had NOT done "any work at all" once entered into main().. The runtime probably already started to
allocate and initialise some bookkeeping memory areas.

Note : My measure of "availability" somehow starts near 50%, possibly I have some signedness bug in my "check range availability algorithm", which would not change the fact that 0-availability is quickly encountered.

@folays
Copy link

folays commented Mar 1, 2023

So my point is that, at least in the Go case, on MacOS (Intel) it's already too late in Go's init() func to try to run some "reserving" procedure (which would be virtual-memory-allocating the "relative jump area" (2 GB on x86_64) prematurely) in the aim of gifting them to LuaJIT when it needs them.

Because on MacOS-Intel-Go it is already too late once the Go runtime enter main() : due to how the platform tight close together everything in virtual memory, LuaJIT is left with only 3% of the 2 GB pool near lj_vm_exit_handler (when not doing ANY real workload, once your start malloc()'ing, it goes down very very quickly)

I'm out of solutions.

Maybe figuring out :

  • (a) If there is a way to tell to the platform (at link-time, or run-time) to move the libluajit library mmap()'ing (I mean, the .so/.dylib) somewhere else, to not tight it close to other memory areas, or maybe specify a given memory address.
  • (b) otherwise LuaJIT library would need... I dunno... to have a virtual size of 2 GB... Maybe one of the symbol could be huuuuge of NOPs or huge .data section... When loaded the kernel mmap() anyway so the size would not matter ? Then LuaJIT could know that it should re-use his own library mmap()'ings full of useless NOPs to re-use them for mcode ?

Kind Regards,

@folays
Copy link

folays commented Mar 2, 2023

Finally I can alleviate the problem on my platform (macOS 11.6 ; intel processor ; linked via Cgo)

You can compile libluajit.so with LDFLAGS="-Wl,-image_base,0x488800000000" (and then make install)

The 0x488800000000 was chosen manually, but there is no specifics here. It's near the top of the 48 bits (~47,5 bits), with enough room before the 48 bits to not constrain the mcode region.

C run :

TARGET lj_vm_exit_handler 0x10c1b629b
TARGET 0x10c1b0000
RANGE 0x3fe00000
TOTAL RANGES : 32704
[   20] RANGES AVAILABLE: yes 16229 (49.624%) no 16475 ; vmsize 1126.400000 MB
[...] // linear usage
[ 1000] RANGES AVAILABLE: yes 13281 (40.610%) no 19423 ; vmsize 3379.200000 MB

Go run :

TARGET lj_vm_exit_handler 0x48880000529b
TARGET 0x488800000000
RANGE 0x3fe00000
TOTAL RANGES : 32704
[   20] RANGES AVAILABLE: yes 32673 (99.905%) no 31 ; vmsize 1945.600000 MB
[...] // linear usage
[ 1000] RANGES AVAILABLE: yes 31693 (96.909%) no 1011 ; vmsize 4096.000000 MB

As you can see the C version (compiled with "default" bin/gcc on macOS so Xcode's gcc+ld) does not pick up the requested address.

Whereas the Go version honours it (linked with the default cgo tools / behaviour), which fixes the problem.

@mraleph
Copy link
Member

mraleph commented Mar 2, 2023

Here is a patch I have made to alleviate this problem internally. The changes are described in the commit message. It's X64 only, but should be trivially portable to ARM64 as well. It's not really a full blown solution, but it kinda sorta works.

@ShadowRoi
Copy link

MacOS arm 64(Apple Silicon) this issue seems to be happening as well with me. I assume this issue won't be fixed any time soon sadly. In a software I am trying to run I have no choice but to fall back to Lua

@zhengying
Copy link

MacOS arm 64(Apple Silicon) this issue seems to be happening as well with me. I assume this issue won't be fixed any time soon sadly. In a software I am trying to run I have no choice but to fall back to Lua

@ShadowRoi I saw that @mraleph posted a patch for this issue. Have you tried it yet?

@MarioSieg
Copy link

Same issue here.
LuaJIT fails to allocate machine code areas for each trace on a M3 Pro (AArch64).
I'll try again if my machine just booted and the total consumed memory is below 2G.
There really should be a workaround/fix because the issue will grow even more in the future...

@ShadowRoi
Copy link

@zhengying I have no idea how to apply this patch and from what I see this patch is made only for x86 not for arm64.

@MarioSieg Just the fact that we have reached M3 series + and this issue is from 2017 this is not looking good in getting fixed at all..
Even if a workaround is made it won't be ideal if this is not merged into the main. The software will have to be made in a way to check that if MacOS and architecture ARM then disable jit entirely..

@MarioSieg
Copy link

@ShadowRoi Yeah I'll also disable the JIT on aarch64 for now, but it's a pity to have the JIT disabled because of a problem like this. I know that on x86-64 the direct jump range is a signed 32-bit integer, so +-2G and on aarch64 it's much smaller because it's encoded into the 32-bit fixed instructions. I wonder how other JIT compilers handle this problem - is trampoline stub emission the only solution? @mraleph
For our game engine I'll might do a workaround by allocating the jump range amount of bytes of virtual memory very early in the program and hack LuaJIT's mcode allocator to return slices from the vmem, which is inside the jump interval.
Any other ideas? @MikePall

@folays
Copy link

folays commented Feb 29, 2024

@MarioSieg , on my platform (MacOS x86_64), what fixed the problem was to compile libluajit.so with LDFLAGS="-Wl,-image_base,0x488800000000" ; Probably the flags vary depending on the platform AND the linker / dynamic linker.

The important matter was that upon loading libluajit.so as a dynamic library dependency from a binary, that somehow the dynamic loader was hinted to load (~mmap) the "code" of the library, at an isolated address far from others virtual memory allocations.

On x86_64 this was because LuaJIT uses... I don't remember... relative jump or near jump or something like that.

I don't know your platform nor which compiler you use, but I suggest that maybe instead of modifying the library, maybe you could also hint either the linker or the dynamic loader to which address it should load a particular library ?

I choose my address (0x488800000000) semi-arbitrarily. Far away from everything else, and easily recognizable.

@corsix
Copy link

corsix commented Feb 29, 2024

I'll try again if my machine just booted and the total consumed memory is below 2G.

This remark seems to confuse several things. Old versions (e.g. 1.x, 2.0) of LuaJIT had a restriction on data memory needing to be in the low 2GB of virtual address space, but this restriction does not apply to the 64-bit ports in the v2.1 branch. Virtual address space is per-process, so shouldn't have any particular relation to boot time. Finally, this issue is about executable memory rather than data memory; the v2.1 x86_64 restriction on executable memory is +/-2GB from the lj_vm_exit_handler symbol, while the v2.1 arm64 restriction on executable memory is +/-127MB from the lj_vm_exit_handler symbol.


@mraleph's comment above addresses this for x86_64, allowing an arbitrary point in virtual address space to be used as the center of the +/-2GB range. @folays's -image_base comment above is also a neat way of putting the lj_vm_exit_handler symbol at an arbitrary point in the virtual address space.

For arm64 in particular, I've just written https://github.com/corsix/LuaJIT/commits/arm64-jump-range/, which should remove the restriction entirely, though it'll still try to cluster things together in +/-127MB regions when possible. It takes a more complex approach than @mraleph's, on the grounds that a single arbitrary +/-127MB range would still be quite small (compared to +/-2GB).

@ShadowRoi
Copy link

@corsix Whenever I try the make command to build it, it fails in MacOS(Apple Silicon) for whatever reason.
image

@corsix
Copy link

corsix commented Mar 1, 2024

Per the documented build instructions:

Note for macOS: you must set the MACOSX_DEPLOYMENT_TARGET environment variable to a value supported by your toolchain:

MACOSX_DEPLOYMENT_TARGET=XX.YY make

Personally I use MACOSX_DEPLOYMENT_TARGET=10.13, but you'll want to use something based on a combination of the minimum macOS requirement imposed on your users and/or the XCode toolchain versions you have installed.

@MarioSieg
Copy link

@MarioSieg , on my platform (MacOS x86_64), what fixed the problem was to compile libluajit.so with LDFLAGS="-Wl,-image_base,0x488800000000" ; Probably the flags vary depending on the platform AND the linker / dynamic linker.

The important matter was that upon loading libluajit.so as a dynamic library dependency from a binary, that somehow the dynamic loader was hinted to load (~mmap) the "code" of the library, at an isolated address far from others virtual memory allocations.

On x86_64 this was because LuaJIT uses... I don't remember... relative jump or near jump or something like that.

I don't know your platform nor which compiler you use, but I suggest that maybe instead of modifying the library, maybe you could also hint either the linker or the dynamic loader to which address it should load a particular library ?

I choose my address (0x488800000000) semi-arbitrarily. Far away from everything else, and easily recognizable.

Thanks for the information and interesting approach.
My 3 supported platforms are Windows/Linux x86-64 and OSX AArch64 - the problem only arised on AArch64 (as the relative reachable jump range is much lower). I'll try your approach too!

@MarioSieg
Copy link

I'll try again if my machine just booted and the total consumed memory is below 2G.

This remark seems to confuse several things. Old versions (e.g. 1.x, 2.0) of LuaJIT had a restriction on data memory needing to be in the low 2GB of virtual address space, but this restriction does not apply to the 64-bit ports in the v2.1 branch. Virtual address space is per-process, so shouldn't have any particular relation to boot time. Finally, this issue is about executable memory rather than data memory; the v2.1 x86_64 restriction on executable memory is +/-2GB from the lj_vm_exit_handler symbol, while the v2.1 arm64 restriction on executable memory is +/-127MB from the lj_vm_exit_handler symbol.

@mraleph's comment above addresses this for x86_64, allowing an arbitrary point in virtual address space to be used as the center of the +/-2GB range. @folays's -image_base comment above is also a neat way of putting the lj_vm_exit_handler symbol at an arbitrary point in the virtual address space.

For arm64 in particular, I've just written https://github.com/corsix/LuaJIT/commits/arm64-jump-range/, which should remove the restriction entirely, though it'll still try to cluster things together in +/-127MB regions when possible. It takes a more complex approach than @mraleph's, on the grounds that a single arbitrary +/-127MB range would still be quite small (compared to +/-2GB).

Thanks for the patch Peter,
I've applied to it our branch last night and now I can finally see assembly jumps when enabling JIT dumps, so the mcode allocations don't fail anymore. I'll also try the dynamic linking with address hint approach - but you patch already helps alot.
Would be a pity to disable the nice JIT because of this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests