New GC #1058

achaulk · 2023-08-29T04:53:23Z

Fixes #38

This is a type-segregated dense bitmap arena based allocator

Ephemeron tables are supported. __gc tables are also supported, for any table allocated by table.newgc(). There are non-trivial performance impacts from allowing any table to have __gc set, even more so if it can be set at any time, but it is easy enough to provide a special arena. The problem comes from simply not touching non-traversed objects at all, the header sweep marks them free and if __gc could be set at any time they would have to all be inspected, just in case. setmetatable() could do the check but would have to do it even when compiled

__gc behaviour generally matches 5.4, it will be called once until it is permanently resurrected by being marked (even if it is temporarily resurrected multiple times to run other finalizers) and then it can be called again, however finalizers are called in unspecified order

Objects generally retain the same flags as before, but use a black-black-gray scheme instead of white-white-black. This does make barriers more expensive and makes it difficult to separate white/new from white/dead but allows all objects to change from black->white for free, it also allows not pulling in arena headers during mutator execution.

Data colocation is no longer guaranteed but is done if possible. This could potentially be guaranteed still at an allocation performance penalty as it will have to scan. Non-colocated data is compacted when arena utilization is low and it is safe to do so. Small udata colocation is guaranteed to support internal usage that assumes an attached payload.

Only some objects are converted. Strings aren't because of spec requirements for stable pointers to C, plus the design of the string table makes the arena somewhat pointless. Threads & prototypes are unlikely to exist in quantities large enough to justify an arena. cdata probably requires an implementation that lets them keep their stable pointer and always colocated properties. Traces require immovable IR but could do something like udata does to provide that

Still needs non-x64 support. 32-bit platforms will have problems because of the layered bitmap design, 32/32 naturally works out to a 16 kB arena size which is probably a little small.

Assumes an intrinsic tzcount() or equivalent. Assumes a 256-bit SIMD intrinsic is available and the makefile is changed accordingly, but that could be selected at runtime with a fallback scalar implementation

src/lj_arena.c

MarioSieg · 2023-12-26T15:18:19Z

What's the status here? @MikePall

achaulk · 2023-12-30T20:04:31Z

What's the status here? MikePall

Probably on indefinite hold

This can't land in 2.1. Aside from breaking non-x64 platforms as-is it's a bad idea for a production branch to make a huge change like this. I'm keeping this branch rebased on top of 2.1 so long as there aren't major conflicts, but that's largely for my own use right now.

It could land in 3.0, if it existed, and people were going to do the other platform work. I'm not familiar enough with non-x86 assembly to do that confidently, and I don't have hardware to test anyway.

This is also about two orders of magnitude larger than Mike seems to want patches to be, it implements an AFAICT novel algorithm with no papers or examples to compare it to, the patch is quite complex overall, and I am effectively a rando who has never submitted a patch before much less a huge one, so if it ever does get to the point of being reviewed it will take a while.

And that assumes minimal disagreement about the various design changes I've made - data relocation, data non-colocation, SIMD/intrinsics usage, ephemeron tables, how to do __gc, change of __gc to running in unspecified order, what to do about the algorithm being worse on 32-bit, etc

Frityet · 2024-01-17T08:27:39Z

How usable is this standalone? I have a project that will only run on x86_64 and is performance critical, is this usable?

Brugarolas · 2024-01-18T05:23:35Z

It's usable. I have a custom LuaJIT runtime (with this GC, a CMake config, Mimalloc, some hand-crafted assembly memset & memcpy functions for Mimalloc, an event-loop with Libuv and 3-4 micro-optimizations) I did just because I was bored, and so far any problem.

But:

I have not tested it in a real battle, only with benchmarks
Don't expect a massive performance gain, in my tests it has showed about a 5-6% faster (with Mimalloc, ect) maybe in real applications the improvements are bigger but not on benchmarks.
I don't know if it's this New GC fault or any of the new stuff I've added, but you'll definitively want to disable some compiler optimization flags like unroll loops or loop vectorizer or in general anything related to loops if you don't want to have random SIGSEGV crashes. Didn't find which flag was causing the error, but definitively some loop related (LLVM, I did not test with GCC).
You'll want to compile with AVX, AVX2, BMI and (I'm not sure about the last one) BMI2 instruction sets enabled. They are mandatory. Though I guess you are already compiling with -march=native or something so this shouldn't be a problem.

Now a couple of unsolicited tips:

Seriously, try with Mimalloc and Folly's memset & memcpy, I suspect that most of my performance improvements come from there. Only for avoiding the PLT when calling GlibC is worth the test.
In this PRs section and in the author's page there are more optimizations you may want to check out.

Frityet · 2024-01-18T07:36:33Z

It's usable. I have a custom LuaJIT runtime (with this GC, a CMake config, Mimalloc, some hand-crafted assembly memset & memcpy functions for Mimalloc, an event-loop with Libuv and 3-4 micro-optimizations) I did just because I was bored, and so far any problem.

But:

1. I have not tested it in a real battle, only with benchmarks

2. Don't expect a massive performance gain, in my tests it has showed about a 5-6% faster (with Mimalloc, ect) maybe in real applications the improvements are bigger but not on benchmarks.

3. I don't know if it's this New GC fault or any of the new stuff I've added, but you'll definitively want to disable some compiler optimization flags like unroll loops or loop vectorizer or in general anything related to loops if you don't want to  have random SIGSEGV crashes. Didn't find which flag was causing the error, but definitively some loop related (LLVM, I did not test with GCC).

4. You'll want to compile with AVX, AVX2, BMI and (I'm not sure about the last one) BMI2 instruction sets enabled. They are mandatory. Though I guess you are already compiling with -march=native or something so this shouldn't be a problem.

Now a couple of unsolicited tips:

* Seriously, try with Mimalloc and Folly's memset & memcpy, I suspect that most of my performance improvements come from there. Only for avoiding the PLT when calling GlibC is worth the test.

* In this PRs section and in the author's page there are more optimizations  you may want to check out.

Thank you! Is the source of your custom runtime available?

achaulk · 2024-01-24T04:52:49Z

A few points

AVX, AVX2, BMI are hard requirements for now. BMI2 is not but aside from maybe a VIA there are no processors that have AVX2 and not BMI2 so why not.

In theory bsf can be dropped in for tzcount in almost every place it's used and SSE4.1 + popcount is likely sufficient for a 128-bit version but that's not implemented yet

The maximum speedup possible is going to depend on how much time is already spent inside the allocator / GC (ignoring allocated object address cache effects). The more garbage created and the higher the object count the bigger benefit because allocation, traversal and deallocation are substantially cheaper now. On the other hand, very small object counts can potentially hurt as there's no difference between sweeping 1 object and 1000 objects, and even a linked list traverse can be fast if it all fits in cache

I do have other patches on my fork but they are very unstable in every sense of the word

This branch is stable-ish in that I do run tests and have fixed all issues I've run into, but there could still be some issues I haven't seen. I generally test msvc + gcc/linux with default flags

Anyway, the string patch is getting close, cdata is next

achaulk · 2024-03-11T07:41:51Z

The string patch is mostly solid. Performance is great for large numbers of strings, mediocre for small numbers or when ephemeral strings vastly outweigh the number of persistent strings. The fixed cost of sweep + free for two arenas probably outweighs the savings on a per-item basis for most benchmarks. On large numbers, lookup/insert is broadly similar, allocation is faster, sweep is faster if most strings live, removing a string is somewhat slower

Arbitrary fixing of strings is still supported for all strings. Strings retain existing semantics, immovable and the payload located at str+1. All strings are guaranteed 16 byte aligned/sized now

<=15B are allocated from a bitmap like other types

16B are allocated from essentially the arena design from the original proposal
LJ_HUGE_STR_THRESHOLD are allocated one per huge arena

Stringtable has been rewritten. New stringtable is a hashtable of blocks that chain directly instead of single chain via string headers. Chained blocks are allocated from arenas. Strings contain a unique ID to their block and the stringtable contains (hash, ptr) for each string. Each block occupies 3 cachelines (one keys, two data) and uses SIMD lookups - it's not really much slower to test 64 bytes than it is to test 4

Sweeping no longer iterates the table but instead only removes entries for freed strings based on arena bits. Tiny strings are generally collected lazily, on the theory we delay accessing object memory until we are going to write to it anyway. The downside of this design is that if many strings are freed relative to the total count the hashtable work that was implicit and free in the list traversal becomes explicit here.

There is a theoretical maximum string count of about 2.7 billion - it was impractical to have anywhere near this many strings in the previous design given the expected cache-miss-per-string traversal time. This could be raised by using the unused byte but it's not worth the complexity

There are a few optimizations relating to compacting the stringtable entries that aren't yet implemented.

achaulk force-pushed the ngc branch 3 times, most recently from 1bef7b0 to 000e4a2 Compare September 6, 2023 05:45

achaulk marked this pull request as ready for review September 6, 2023 05:46

Mooshua reviewed Sep 10, 2023

View reviewed changes

src/lj_arena.c Outdated Show resolved Hide resolved

achaulk force-pushed the ngc branch 3 times, most recently from 070812d to 4d28e93 Compare September 19, 2023 23:48

achaulk force-pushed the ngc branch 2 times, most recently from a2560f6 to 592c652 Compare December 14, 2023 07:17

achaulk force-pushed the ngc branch 2 times, most recently from c6c9997 to 32a315e Compare January 27, 2024 19:04

Albert added 2 commits March 11, 2024 02:08

New GC

e403e4a

Strings

6b22c1d

achaulk force-pushed the ngc branch from 32a315e to 6b22c1d Compare March 11, 2024 06:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New GC #1058

New GC #1058

achaulk commented Aug 29, 2023 •

edited

MarioSieg commented Dec 26, 2023

achaulk commented Dec 30, 2023

Frityet commented Jan 17, 2024

Brugarolas commented Jan 18, 2024

Frityet commented Jan 18, 2024

achaulk commented Jan 24, 2024

achaulk commented Mar 11, 2024 •

edited

New GC #1058

Are you sure you want to change the base?

New GC #1058

Conversation

achaulk commented Aug 29, 2023 • edited

MarioSieg commented Dec 26, 2023

achaulk commented Dec 30, 2023

Frityet commented Jan 17, 2024

Brugarolas commented Jan 18, 2024

Frityet commented Jan 18, 2024

achaulk commented Jan 24, 2024

achaulk commented Mar 11, 2024 • edited

achaulk commented Aug 29, 2023 •

edited

achaulk commented Mar 11, 2024 •

edited