Practical reference compression and implementation #468

dmlloyd · 2021-04-23T14:55:54Z

dmlloyd
Apr 23, 2021
Maintainer

I want to have a dedicated thread about reference/pointer compression in LLVM and how it might be accomplished.

At time of writing, we're representing references as unsigned integer values. The actual value of references is presently simply a pointer, cast to an integer, so effectively we're simply using pointers as references (but with some extra complication, which may detrimentally impact LLVM optimizations).

The choice of using an integer type for references is based on an off-the-cuff assumption that it is necessary to do so, since there is no way to achieve a compressed pointer type in LLVM. So I'd like to explore two lines of thought: how far can we take the integer reference idea, and how feasible is it to establish support for compressed pointers on the LLVM side?

Integer refereces

There are (at least) two ways to look at integer references. In one view, it is an opaque set of bits that can be mathematically munged into a usable address, and vice-versa. In the other view, it is an index into one or more possible data structures in which any possible object is contained.

The current implementation could be viewed as an index into the array of possible addresses starting at address zero. This view is not well-supported by LLVM or by modern systems in general, which are generally a bit prickly about the semantics of things that are relative to address zero; thus, the reality is that we're using the "opaque bits" strategy along with inttoptr/ptrtoint - a strategy that has already caused some difficulties for us (and in fact is that it's generally problematic; check out the llvm-dev discussions on the topic). The current implementation of integer references seems to have every disadvantage of every possible approach, with none of the advantages. Therefore, for the purposes of discussion, I'm going to ignore the current implementation beyond acknowledging that the code which supports this approach could potentially be evolvable into something better.

Something better

There are a few potentially interesting approaches to pointer compression in the wild. HotSpot is generally using a base address plus a bit-shifted smaller-than-a-pointer (32 bit in practice) offset. ~~I was not able to find information on OpenJ9 but I assume that it uses a similar implementation~~ OpenJ9 uses a similar approach but without a base pointer. The bit shifting approach entails aligning all potential object addresses on a small power-of-two boundary (8 or 16 for example); practically speaking, this would be (at a minimum) the smallest power of two which is larger than the minimum size and alignment of an object. Using a larger shift may allow more objects on the heap before OOME, at the cost of more memory wastage due to excess padding between smaller or inconveniently-sized objects.

Implementing this algorithm using our current strategy is definitely possible. To get a usable pointer out of a reference value, we'd perform the arithmetic: zero-extend to the pointer size, bit shift, and use various means to add to the base pointer (either convert the base pointer with ptrtoint, add, and convert back, or getelementptr against an i8* view of the base pointer).

However, this suffers from many of the same problems we already have with relying on inttoptr and ptrtoint. It also does not consider stack allocations (which I'll cover later).

The bit shift of the compressed pointer is derived from the size of the object. In effect, it is an array index into an array of aligned, minimally-sized objects (where some of the objects may spill outside of their containers). LLVM appears to have better infrastructure for tracking array accesses than integer-to-pointer conversions and arithmetic; therefore, this strategy seems best employed as a getelementptr from the heap base, using the reference value as an index into the heap area.

Stack allocation

One of the more interesting ideas we've been discussing for a while now is stack allocation. Compressed reference values conflict with this idea in a few ways.

Firstly, the address range of a compressed pointer is significantly limited compared to a full 64-bit pointer. Assuming that compressed pointers are 32 bits, the absolute maximum number of objects is 2³². However, once this is spread across multiple memory areas, the actual number could be significantly decreased. For example, if I had 10 threads each with 2MB stacks, the stacks alone would consume 2,621,440 potential object references (regardless of the depth of the current calling context), assuming they were perfectly packed together (i.e. no guard pages for stack overflow, among other things). Java applications frequently consist of dozens, hundreds, or even thousands of threads, contributing significantly to wastage.

Additionally, stacks must be allocated in close proximity to the heap and be aggressively recycled in order to maintain acceptable locality. Any unused memory in between stacks or the heap areas would consume reference space.

Overall, this wastage seems unreasonable given that a stack reference would necessarily be invisible outside of the current thread.

One possible strategy to avoid this kind of wastage could be to divide the reference index space into two sub-spaces: stack and heap. The simplest division would involve dividing the space in half. This would limit the maximum number of heap objects to 2³¹, however it would also allow for effectively infinite stack objects (2³¹ stack objects per thread, of which the upper limit would be 2³¹).

Dividing the space means that there would be two base pointers to consider: the global heap base pointer, and the thread-local stack base pointer. One bit of the reference value would be examined to determine which space to build from. On most CPUs, there is special circuitry for testing the high (sign) bit of a number, so this seems like a logical choice.

Pseudo-code for this calculation might look something like this:

java_lang_Object_ptr objPtr = objRef < 0 ? heapBase[-objRef] : stackBase[objRef]

Certain types might always be stack-allocated (or heap-allocated), thus this check could be simplified in those cases. In practice, if one of the two memory areas could be configured to grow downwards, the negation operation could be elided and negative numbers could be used for that area. CPUs with conditional instructions and integrated shifters could potentially optimize this computation down to three or maybe even two instructions.

LLVM side

Another possible approach would involve introducing some scheme of pointer compression on the LLVM side of things (for example, adding the ability to configure one or more address spaces to use a compression algorithm on pointers, such that pointers can be converted to and from that address space using addrspacecast ... to).

Other approaches

I'm interested to hear if anyone has any other ideas for a novel approach to pointer compression. Also, any thoughts or feedback on these thoughts are very welcome.

DanHeidinga · 2021-04-23T15:08:46Z

DanHeidinga
Apr 23, 2021
Collaborator

I was not able to find information on OpenJ9 but I assume that it uses a similar implementation.

OpenJ9 uses a carefully aligned pointer so that the compressed version is simply a shift of the right number of bits.

1 reply

dmlloyd May 20, 2021
Maintainer Author

I was thinking, if we can get the linker to assign a symbol as the heap base, it could be randomized but also be a constant value at run time rather than having to load a dynamic heap base from some global var somewhere. This might be an interesting compromise between having a fixed heap base (or a zero heap base) and having a fully run-time-selected heap base.

DanHeidinga · 2021-04-23T15:38:30Z

DanHeidinga
Apr 23, 2021
Collaborator

For Stack Allocation, we should split it into two cases:

Scalarization - this is where the the object identity doesn't matter, there are no escapes, and we can decompose the object into its individual fields and manage their lifetimes separately. This is the gold standard and the preferred case.
Stack allocated objects - this where we need to keep the object as a single entity on the stack - in a regular object representation -as it may escape to callees who are known to never put it on the heap.

Scalarization is the better outcome and avoids needing to represent stack objects as objects (avoiding the compressed refs issues). This is what Hotspot aims for with their Escape Analysis and mostly avoids stack allocation, partly due to the compressed refs issues and that they can't control the placement of the OS stacks making it difficult to keep them in a suitable range of the heap.

OpenJ9 uses stack allocation, even with compressed refs, because it uses a separate java stack and carefully controls the placement of them (in the low 4g of memory) so that the pointer compression scheme works. This also allows J9 to have stack allocated objects that can freely point to both heap allocated and other stack allocated objects.

True stack allocation (not scalarization) requires a lot of tradeoffs that ripple through and limit the rest of the design. Are those limitations worth it given the target use cases? Can we get far enough with aggressive inlining and scalarization that stack allocation is unnecessary?

3 replies

dmlloyd Apr 23, 2021
Maintainer Author

Scalarization is the better outcome and avoids needing to represent stack objects as objects (avoiding the compressed refs issues). This is what Hotspot aims for with their Escape Analysis and mostly avoids stack allocation, partly due to the compressed refs issues and that they can't control the placement of the OS stacks making it difficult to keep them in a suitable range of the heap.

This is a good point. I think in our case, scalarization is effectively a given; the simple load-tracking optimizer already gets part of the way there. We do need a pass which can delete unused allocations to have a fully functional (if simple) form of scalar replacement, but in the end (in our case at least) scalarization is primarily a function of eliminating redundant stores and loads, and then eliminating the consequently unused allocations. In fact it was functional for a while when New nodes were (incorrectly) untethered from the program order graph. Thus the most basic form of scalar replacement is already well within our grasp. I believe that increasing the efficacy of scalarization beyond trivial cases is going to depend on inlining though (constructors mainly, but trivial usages of the base reference also could be eliminated by inlining). It might be worth discussing scalarization as a separate topic since (in my view) we definitely want it, independently of whether we support stack allocation.

True stack allocation (not scalarization) requires a lot of tradeoffs that ripple through and limit the rest of the design. Are those limitations worth it given the target use cases? Can we get far enough with aggressive inlining and scalarization that stack allocation is unnecessary?

The reason that stack allocation itself is interesting beyond this case is that heavily request-driven usage patterns may hypothetically allocate large numbers of short-lived objects that are potentially non-thread-escaping and non-scope-escaping, yet still pass through multiple layers of API. Stack allocated objects are particularly interesting for this case because they introduce no GC allocation pressure. Stack-allocated objects and data are also always pinned for their lifetime, making them interesting for native OS and library interactions. And finally, scalarization might be impractical for cases where our inlining capability breaks down.

DanHeidinga Apr 23, 2021
Collaborator

I believe that increasing the efficacy of scalarization beyond trivial cases is going to depend on inlining though (constructors mainly, but trivial usages of the base reference also could be eliminated by inlining).

This is right. The more you can inline the more you can scalarize. Though if you look at some of what we've been talking about for Project Valhalla, we could also generate scalarized calling conventions (or multiple entry points for a given function) to extend the range without needing to inline as far.

The reason that stack allocation itself is interesting beyond this case is that heavily request-driven usage patterns may hypothetically allocate large numbers of short-lived objects that are potentially non-thread-escaping and non-scope-escaping, yet still pass through multiple layers of API. Stack allocated objects are particularly interesting for this case because they introduce no GC allocation pressure.

Stack allocation is one approach for this but it's not the only one. In fact, these objects don't have to live on the stack to get the benefits. All we're really talking about here is a bump-pointer allocated space that get's recycled en-mass on function return. This can be achieved with a special thread local allocation buffer (TLAB / TLH depending on your preferred JVM's terminology).

A former coworker ended up at Microsoft looking at stack allocation for Hotspot and they were, last I heard from mailing list discussions, pursuing this kind of side-buffer approach

dmlloyd Apr 23, 2021
Maintainer Author

I believe that increasing the efficacy of scalarization beyond trivial cases is going to depend on inlining though (constructors mainly, but trivial usages of the base reference also could be eliminated by inlining).

This is right. The more you can inline the more you can scalarize. Though if you look at some of what we've been talking about for Project Valhalla, we could also generate scalarized calling conventions (or multiple entry points for a given function) to extend the range without needing to inline as far.

Scalarized parameter expansion could be very interesting. Using escape analysis, we might be able to identify cases where we can scalarize one or more object-typed arguments using similar criteria that we would use to evaluate an object for stack allocation eligibility. This could result in a similar effect to stack allocation, however with potentially higher density (as unused fields and/or, in some cases, the object header might not need to be allocated).

Addendum: at present we don't have any infrastructure to support this. Off the cuff, it feels like it could be similar to outlining and/or specialization.

DanHeidinga · 2021-04-23T15:46:44Z

DanHeidinga
Apr 23, 2021
Collaborator

Azul is listed as the existence proof for using LLVM for a JVM. My understanding (and I wish I could find the link back) is that they have their own build of LLVM (called Orca?) uses the standard LLVM optimizations on opaque pointers then once the pointers are converted into a more direct form, they limit the optimization passes that LLVM can do as the representation is fragile. So the system groups optimizations into two sets of passes - with opaque object pointers, and the minimal passes on the fragile representation.

We likely don't want to maintain and ship our own version of LLVM. Without a way to control the LLVM passes (which I think means a custom LLVM build with our own PassManager), we're somewhat constrained on how we can present pointers without having them "lost" for the GC

1 reply

aviansie-ben Apr 26, 2021
Collaborator

We do have some control over LLVM passes. The opt executable that LLVM provides for running its optimizations on bitcode allows you to specify explicitly what passes to run, so there's no reason we can't control that ourselves. The only restrictions are that we may have issues with compatibility between LLVM versions (if we want to run passes from new versions, we would need to make the list of passes conditional upon the version of LLVM in use) and that we cannot write our own custom passes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Practical reference compression and implementation #468

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Practical reference compression and implementation #468

dmlloyd Apr 23, 2021 Maintainer

Integer refereces

Something better

Stack allocation

LLVM side

Other approaches

Replies: 4 comments · 5 replies

DanHeidinga Apr 23, 2021 Collaborator

dmlloyd May 20, 2021 Maintainer Author

DanHeidinga Apr 23, 2021 Collaborator

dmlloyd Apr 23, 2021 Maintainer Author

DanHeidinga Apr 23, 2021 Collaborator

dmlloyd Apr 23, 2021 Maintainer Author

DanHeidinga Apr 23, 2021 Collaborator

aviansie-ben Apr 26, 2021 Collaborator

dmlloyd
Apr 23, 2021
Maintainer

Replies: 4 comments 5 replies

DanHeidinga
Apr 23, 2021
Collaborator

dmlloyd May 20, 2021
Maintainer Author

DanHeidinga
Apr 23, 2021
Collaborator

dmlloyd Apr 23, 2021
Maintainer Author

DanHeidinga Apr 23, 2021
Collaborator

dmlloyd Apr 23, 2021
Maintainer Author

DanHeidinga
Apr 23, 2021
Collaborator

aviansie-ben Apr 26, 2021
Collaborator