Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New GC #1058

Open
wants to merge 2 commits into
base: v2.1
Choose a base branch
from
Open

New GC #1058

wants to merge 2 commits into from

Conversation

achaulk
Copy link

@achaulk achaulk commented Aug 29, 2023

Fixes #38

This is a type-segregated dense bitmap arena based allocator

Ephemeron tables are supported. __gc tables are also supported, for any table allocated by table.newgc(). There are non-trivial performance impacts from allowing any table to have __gc set, even more so if it can be set at any time, but it is easy enough to provide a special arena. The problem comes from simply not touching non-traversed objects at all, the header sweep marks them free and if __gc could be set at any time they would have to all be inspected, just in case. setmetatable() could do the check but would have to do it even when compiled

__gc behaviour generally matches 5.4, it will be called once until it is permanently resurrected by being marked (even if it is temporarily resurrected multiple times to run other finalizers) and then it can be called again, however finalizers are called in unspecified order

Objects generally retain the same flags as before, but use a black-black-gray scheme instead of white-white-black. This does make barriers more expensive and makes it difficult to separate white/new from white/dead but allows all objects to change from black->white for free, it also allows not pulling in arena headers during mutator execution.

Data colocation is no longer guaranteed but is done if possible. This could potentially be guaranteed still at an allocation performance penalty as it will have to scan. Non-colocated data is compacted when arena utilization is low and it is safe to do so. Small udata colocation is guaranteed to support internal usage that assumes an attached payload.

Only some objects are converted. Strings aren't because of spec requirements for stable pointers to C, plus the design of the string table makes the arena somewhat pointless. Threads & prototypes are unlikely to exist in quantities large enough to justify an arena. cdata probably requires an implementation that lets them keep their stable pointer and always colocated properties. Traces require immovable IR but could do something like udata does to provide that

Still needs non-x64 support. 32-bit platforms will have problems because of the layered bitmap design, 32/32 naturally works out to a 16 kB arena size which is probably a little small.

Assumes an intrinsic tzcount() or equivalent. Assumes a 256-bit SIMD intrinsic is available and the makefile is changed accordingly, but that could be selected at runtime with a fallback scalar implementation

@achaulk achaulk marked this pull request as ready for review September 6, 2023 05:46
src/lj_arena.c Outdated Show resolved Hide resolved
@MarioSieg
Copy link

What's the status here? @MikePall

@achaulk
Copy link
Author

achaulk commented Dec 30, 2023

What's the status here? MikePall

Probably on indefinite hold

This can't land in 2.1. Aside from breaking non-x64 platforms as-is it's a bad idea for a production branch to make a huge change like this. I'm keeping this branch rebased on top of 2.1 so long as there aren't major conflicts, but that's largely for my own use right now.

It could land in 3.0, if it existed, and people were going to do the other platform work. I'm not familiar enough with non-x86 assembly to do that confidently, and I don't have hardware to test anyway.

This is also about two orders of magnitude larger than Mike seems to want patches to be, it implements an AFAICT novel algorithm with no papers or examples to compare it to, the patch is quite complex overall, and I am effectively a rando who has never submitted a patch before much less a huge one, so if it ever does get to the point of being reviewed it will take a while.

And that assumes minimal disagreement about the various design changes I've made - data relocation, data non-colocation, SIMD/intrinsics usage, ephemeron tables, how to do __gc, change of __gc to running in unspecified order, what to do about the algorithm being worse on 32-bit, etc

@Frityet
Copy link

Frityet commented Jan 17, 2024

How usable is this standalone? I have a project that will only run on x86_64 and is performance critical, is this usable?

@Brugarolas
Copy link

It's usable. I have a custom LuaJIT runtime (with this GC, a CMake config, Mimalloc, some hand-crafted assembly memset & memcpy functions for Mimalloc, an event-loop with Libuv and 3-4 micro-optimizations) I did just because I was bored, and so far any problem.

But:

  1. I have not tested it in a real battle, only with benchmarks
  2. Don't expect a massive performance gain, in my tests it has showed about a 5-6% faster (with Mimalloc, ect) maybe in real applications the improvements are bigger but not on benchmarks.
  3. I don't know if it's this New GC fault or any of the new stuff I've added, but you'll definitively want to disable some compiler optimization flags like unroll loops or loop vectorizer or in general anything related to loops if you don't want to have random SIGSEGV crashes. Didn't find which flag was causing the error, but definitively some loop related (LLVM, I did not test with GCC).
  4. You'll want to compile with AVX, AVX2, BMI and (I'm not sure about the last one) BMI2 instruction sets enabled. They are mandatory. Though I guess you are already compiling with -march=native or something so this shouldn't be a problem.

Now a couple of unsolicited tips:

  • Seriously, try with Mimalloc and Folly's memset & memcpy, I suspect that most of my performance improvements come from there. Only for avoiding the PLT when calling GlibC is worth the test.
  • In this PRs section and in the author's page there are more optimizations you may want to check out.

@Frityet
Copy link

Frityet commented Jan 18, 2024

It's usable. I have a custom LuaJIT runtime (with this GC, a CMake config, Mimalloc, some hand-crafted assembly memset & memcpy functions for Mimalloc, an event-loop with Libuv and 3-4 micro-optimizations) I did just because I was bored, and so far any problem.

But:

1. I have not tested it in a real battle, only with benchmarks

2. Don't expect a massive performance gain, in my tests it has showed about a 5-6% faster (with Mimalloc, ect) maybe in real applications the improvements are bigger but not on benchmarks.

3. I don't know if it's this New GC fault or any of the new stuff I've added, but you'll definitively want to disable some compiler optimization flags like unroll loops or loop vectorizer or in general anything related to loops if you don't want to  have random SIGSEGV crashes. Didn't find which flag was causing the error, but definitively some loop related (LLVM, I did not test with GCC).

4. You'll want to compile with AVX, AVX2, BMI and (I'm not sure about the last one) BMI2 instruction sets enabled. They are mandatory. Though I guess you are already compiling with -march=native or something so this shouldn't be a problem.

Now a couple of unsolicited tips:

* Seriously, try with Mimalloc and Folly's memset & memcpy, I suspect that most of my performance improvements come from there. Only for avoiding the PLT when calling GlibC is worth the test.

* In this PRs section and in the author's page there are more optimizations  you may want to check out.

Thank you! Is the source of your custom runtime available?

@achaulk
Copy link
Author

achaulk commented Jan 24, 2024

A few points

AVX, AVX2, BMI are hard requirements for now. BMI2 is not but aside from maybe a VIA there are no processors that have AVX2 and not BMI2 so why not.

In theory bsf can be dropped in for tzcount in almost every place it's used and SSE4.1 + popcount is likely sufficient for a 128-bit version but that's not implemented yet

The maximum speedup possible is going to depend on how much time is already spent inside the allocator / GC (ignoring allocated object address cache effects). The more garbage created and the higher the object count the bigger benefit because allocation, traversal and deallocation are substantially cheaper now. On the other hand, very small object counts can potentially hurt as there's no difference between sweeping 1 object and 1000 objects, and even a linked list traverse can be fast if it all fits in cache

I do have other patches on my fork but they are very unstable in every sense of the word

This branch is stable-ish in that I do run tests and have fixed all issues I've run into, but there could still be some issues I haven't seen. I generally test msvc + gcc/linux with default flags

Anyway, the string patch is getting close, cdata is next

@achaulk
Copy link
Author

achaulk commented Mar 11, 2024

The string patch is mostly solid. Performance is great for large numbers of strings, mediocre for small numbers or when ephemeral strings vastly outweigh the number of persistent strings. The fixed cost of sweep + free for two arenas probably outweighs the savings on a per-item basis for most benchmarks. On large numbers, lookup/insert is broadly similar, allocation is faster, sweep is faster if most strings live, removing a string is somewhat slower

Arbitrary fixing of strings is still supported for all strings. Strings retain existing semantics, immovable and the payload located at str+1. All strings are guaranteed 16 byte aligned/sized now

<=15B are allocated from a bitmap like other types

16B are allocated from essentially the arena design from the original proposal
LJ_HUGE_STR_THRESHOLD are allocated one per huge arena

Stringtable has been rewritten. New stringtable is a hashtable of blocks that chain directly instead of single chain via string headers. Chained blocks are allocated from arenas. Strings contain a unique ID to their block and the stringtable contains (hash, ptr) for each string. Each block occupies 3 cachelines (one keys, two data) and uses SIMD lookups - it's not really much slower to test 64 bytes than it is to test 4

Sweeping no longer iterates the table but instead only removes entries for freed strings based on arena bits. Tiny strings are generally collected lazily, on the theory we delay accessing object memory until we are going to write to it anyway. The downside of this design is that if many strings are freed relative to the total count the hashtable work that was implicit and free in the list traversal becomes explicit here.

There is a theoretical maximum string count of about 2.7 billion - it was impractical to have anywhere near this many strings in the previous design given the expected cache-miss-per-string traversal time. This could be raised by using the unused byte but it's not worth the complexity

There are a few optimizations relating to compacting the stringtable entries that aren't yet implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New Garbage Collector
5 participants