-
Notifications
You must be signed in to change notification settings - Fork 931
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New GC #1058
base: v2.1
Are you sure you want to change the base?
New GC #1058
Conversation
1bef7b0
to
000e4a2
Compare
070812d
to
4d28e93
Compare
a2560f6
to
592c652
Compare
What's the status here? @MikePall |
Probably on indefinite hold This can't land in 2.1. Aside from breaking non-x64 platforms as-is it's a bad idea for a production branch to make a huge change like this. I'm keeping this branch rebased on top of 2.1 so long as there aren't major conflicts, but that's largely for my own use right now. It could land in 3.0, if it existed, and people were going to do the other platform work. I'm not familiar enough with non-x86 assembly to do that confidently, and I don't have hardware to test anyway. This is also about two orders of magnitude larger than Mike seems to want patches to be, it implements an AFAICT novel algorithm with no papers or examples to compare it to, the patch is quite complex overall, and I am effectively a rando who has never submitted a patch before much less a huge one, so if it ever does get to the point of being reviewed it will take a while. And that assumes minimal disagreement about the various design changes I've made - data relocation, data non-colocation, SIMD/intrinsics usage, ephemeron tables, how to do __gc, change of __gc to running in unspecified order, what to do about the algorithm being worse on 32-bit, etc |
How usable is this standalone? I have a project that will only run on x86_64 and is performance critical, is this usable? |
It's usable. I have a custom LuaJIT runtime (with this GC, a CMake config, Mimalloc, some hand-crafted assembly memset & memcpy functions for Mimalloc, an event-loop with Libuv and 3-4 micro-optimizations) I did just because I was bored, and so far any problem. But:
Now a couple of unsolicited tips:
|
Thank you! Is the source of your custom runtime available? |
A few points AVX, AVX2, BMI are hard requirements for now. BMI2 is not but aside from maybe a VIA there are no processors that have AVX2 and not BMI2 so why not. In theory bsf can be dropped in for tzcount in almost every place it's used and SSE4.1 + popcount is likely sufficient for a 128-bit version but that's not implemented yet The maximum speedup possible is going to depend on how much time is already spent inside the allocator / GC (ignoring allocated object address cache effects). The more garbage created and the higher the object count the bigger benefit because allocation, traversal and deallocation are substantially cheaper now. On the other hand, very small object counts can potentially hurt as there's no difference between sweeping 1 object and 1000 objects, and even a linked list traverse can be fast if it all fits in cache I do have other patches on my fork but they are very unstable in every sense of the word This branch is stable-ish in that I do run tests and have fixed all issues I've run into, but there could still be some issues I haven't seen. I generally test msvc + gcc/linux with default flags Anyway, the string patch is getting close, cdata is next |
c6c9997
to
32a315e
Compare
The string patch is mostly solid. Performance is great for large numbers of strings, mediocre for small numbers or when ephemeral strings vastly outweigh the number of persistent strings. The fixed cost of sweep + free for two arenas probably outweighs the savings on a per-item basis for most benchmarks. On large numbers, lookup/insert is broadly similar, allocation is faster, sweep is faster if most strings live, removing a string is somewhat slower Arbitrary fixing of strings is still supported for all strings. Strings retain existing semantics, immovable and the payload located at str+1. All strings are guaranteed 16 byte aligned/sized now <=15B are allocated from a bitmap like other types
Stringtable has been rewritten. New stringtable is a hashtable of blocks that chain directly instead of single chain via string headers. Chained blocks are allocated from arenas. Strings contain a unique ID to their block and the stringtable contains (hash, ptr) for each string. Each block occupies 3 cachelines (one keys, two data) and uses SIMD lookups - it's not really much slower to test 64 bytes than it is to test 4 Sweeping no longer iterates the table but instead only removes entries for freed strings based on arena bits. Tiny strings are generally collected lazily, on the theory we delay accessing object memory until we are going to write to it anyway. The downside of this design is that if many strings are freed relative to the total count the hashtable work that was implicit and free in the list traversal becomes explicit here. There is a theoretical maximum string count of about 2.7 billion - it was impractical to have anywhere near this many strings in the previous design given the expected cache-miss-per-string traversal time. This could be raised by using the unused byte but it's not worth the complexity There are a few optimizations relating to compacting the stringtable entries that aren't yet implemented. |
Fixes #38
This is a type-segregated dense bitmap arena based allocator
Ephemeron tables are supported. __gc tables are also supported, for any table allocated by table.newgc(). There are non-trivial performance impacts from allowing any table to have __gc set, even more so if it can be set at any time, but it is easy enough to provide a special arena. The problem comes from simply not touching non-traversed objects at all, the header sweep marks them free and if __gc could be set at any time they would have to all be inspected, just in case. setmetatable() could do the check but would have to do it even when compiled
__gc behaviour generally matches 5.4, it will be called once until it is permanently resurrected by being marked (even if it is temporarily resurrected multiple times to run other finalizers) and then it can be called again, however finalizers are called in unspecified order
Objects generally retain the same flags as before, but use a black-black-gray scheme instead of white-white-black. This does make barriers more expensive and makes it difficult to separate white/new from white/dead but allows all objects to change from black->white for free, it also allows not pulling in arena headers during mutator execution.
Data colocation is no longer guaranteed but is done if possible. This could potentially be guaranteed still at an allocation performance penalty as it will have to scan. Non-colocated data is compacted when arena utilization is low and it is safe to do so. Small udata colocation is guaranteed to support internal usage that assumes an attached payload.
Only some objects are converted. Strings aren't because of spec requirements for stable pointers to C, plus the design of the string table makes the arena somewhat pointless. Threads & prototypes are unlikely to exist in quantities large enough to justify an arena. cdata probably requires an implementation that lets them keep their stable pointer and always colocated properties. Traces require immovable IR but could do something like udata does to provide that
Still needs non-x64 support. 32-bit platforms will have problems because of the layered bitmap design, 32/32 naturally works out to a 16 kB arena size which is probably a little small.
Assumes an intrinsic tzcount() or equivalent. Assumes a 256-bit SIMD intrinsic is available and the makefile is changed accordingly, but that could be selected at runtime with a fallback scalar implementation