Practical reference compression and implementation #468
Replies: 4 comments 5 replies
-
OpenJ9 uses a carefully aligned pointer so that the compressed version is simply a shift of the right number of bits. |
Beta Was this translation helpful? Give feedback.
-
For Stack Allocation, we should split it into two cases:
Scalarization is the better outcome and avoids needing to represent stack objects as objects (avoiding the compressed refs issues). This is what Hotspot aims for with their Escape Analysis and mostly avoids stack allocation, partly due to the compressed refs issues and that they can't control the placement of the OS stacks making it difficult to keep them in a suitable range of the heap. OpenJ9 uses stack allocation, even with compressed refs, because it uses a separate java stack and carefully controls the placement of them (in the low 4g of memory) so that the pointer compression scheme works. This also allows J9 to have stack allocated objects that can freely point to both heap allocated and other stack allocated objects. True stack allocation (not scalarization) requires a lot of tradeoffs that ripple through and limit the rest of the design. Are those limitations worth it given the target use cases? Can we get far enough with aggressive inlining and scalarization that stack allocation is unnecessary? |
Beta Was this translation helpful? Give feedback.
-
Azul is listed as the existence proof for using LLVM for a JVM. My understanding (and I wish I could find the link back) is that they have their own build of LLVM (called Orca?) uses the standard LLVM optimizations on opaque pointers then once the pointers are converted into a more direct form, they limit the optimization passes that LLVM can do as the representation is fragile. So the system groups optimizations into two sets of passes - with opaque object pointers, and the minimal passes on the fragile representation. We likely don't want to maintain and ship our own version of LLVM. Without a way to control the LLVM passes (which I think means a custom LLVM build with our own PassManager), we're somewhat constrained on how we can present pointers without having them "lost" for the GC |
Beta Was this translation helpful? Give feedback.
-
I want to have a dedicated thread about reference/pointer compression in LLVM and how it might be accomplished.
At time of writing, we're representing references as unsigned integer values. The actual value of references is presently simply a pointer, cast to an integer, so effectively we're simply using pointers as references (but with some extra complication, which may detrimentally impact LLVM optimizations).
The choice of using an integer type for references is based on an off-the-cuff assumption that it is necessary to do so, since there is no way to achieve a compressed pointer type in LLVM. So I'd like to explore two lines of thought: how far can we take the integer reference idea, and how feasible is it to establish support for compressed pointers on the LLVM side?
Integer refereces
There are (at least) two ways to look at integer references. In one view, it is an opaque set of bits that can be mathematically munged into a usable address, and vice-versa. In the other view, it is an index into one or more possible data structures in which any possible object is contained.
The current implementation could be viewed as an index into the array of possible addresses starting at address zero. This view is not well-supported by LLVM or by modern systems in general, which are generally a bit prickly about the semantics of things that are relative to address zero; thus, the reality is that we're using the "opaque bits" strategy along with
inttoptr
/ptrtoint
- a strategy that has already caused some difficulties for us (and in fact is that it's generally problematic; check out the llvm-dev discussions on the topic). The current implementation of integer references seems to have every disadvantage of every possible approach, with none of the advantages. Therefore, for the purposes of discussion, I'm going to ignore the current implementation beyond acknowledging that the code which supports this approach could potentially be evolvable into something better.Something better
There are a few potentially interesting approaches to pointer compression in the wild. HotSpot is generally using a base address plus a bit-shifted smaller-than-a-pointer (32 bit in practice) offset.
I was not able to find information on OpenJ9 but I assume that it uses a similar implementationOpenJ9 uses a similar approach but without a base pointer. The bit shifting approach entails aligning all potential object addresses on a small power-of-two boundary (8 or 16 for example); practically speaking, this would be (at a minimum) the smallest power of two which is larger than the minimum size and alignment of an object. Using a larger shift may allow more objects on the heap before OOME, at the cost of more memory wastage due to excess padding between smaller or inconveniently-sized objects.Implementing this algorithm using our current strategy is definitely possible. To get a usable pointer out of a reference value, we'd perform the arithmetic: zero-extend to the pointer size, bit shift, and use various means to add to the base pointer (either convert the base pointer with
ptrtoint
, add, and convert back, orgetelementptr
against ani8*
view of the base pointer).However, this suffers from many of the same problems we already have with relying on
inttoptr
andptrtoint
. It also does not consider stack allocations (which I'll cover later).The bit shift of the compressed pointer is derived from the size of the object. In effect, it is an array index into an array of aligned, minimally-sized objects (where some of the objects may spill outside of their containers). LLVM appears to have better infrastructure for tracking array accesses than integer-to-pointer conversions and arithmetic; therefore, this strategy seems best employed as a
getelementptr
from the heap base, using the reference value as an index into the heap area.Stack allocation
One of the more interesting ideas we've been discussing for a while now is stack allocation. Compressed reference values conflict with this idea in a few ways.
Firstly, the address range of a compressed pointer is significantly limited compared to a full 64-bit pointer. Assuming that compressed pointers are 32 bits, the absolute maximum number of objects is 232. However, once this is spread across multiple memory areas, the actual number could be significantly decreased. For example, if I had 10 threads each with 2MB stacks, the stacks alone would consume 2,621,440 potential object references (regardless of the depth of the current calling context), assuming they were perfectly packed together (i.e. no guard pages for stack overflow, among other things). Java applications frequently consist of dozens, hundreds, or even thousands of threads, contributing significantly to wastage.
Additionally, stacks must be allocated in close proximity to the heap and be aggressively recycled in order to maintain acceptable locality. Any unused memory in between stacks or the heap areas would consume reference space.
Overall, this wastage seems unreasonable given that a stack reference would necessarily be invisible outside of the current thread.
One possible strategy to avoid this kind of wastage could be to divide the reference index space into two sub-spaces: stack and heap. The simplest division would involve dividing the space in half. This would limit the maximum number of heap objects to 231, however it would also allow for effectively infinite stack objects (231 stack objects per thread, of which the upper limit would be 231).
Dividing the space means that there would be two base pointers to consider: the global heap base pointer, and the thread-local stack base pointer. One bit of the reference value would be examined to determine which space to build from. On most CPUs, there is special circuitry for testing the high (sign) bit of a number, so this seems like a logical choice.
Pseudo-code for this calculation might look something like this:
Certain types might always be stack-allocated (or heap-allocated), thus this check could be simplified in those cases. In practice, if one of the two memory areas could be configured to grow downwards, the negation operation could be elided and negative numbers could be used for that area. CPUs with conditional instructions and integrated shifters could potentially optimize this computation down to three or maybe even two instructions.
LLVM side
Another possible approach would involve introducing some scheme of pointer compression on the LLVM side of things (for example, adding the ability to configure one or more address spaces to use a compression algorithm on pointers, such that pointers can be converted to and from that address space using
addrspacecast ... to
).Other approaches
I'm interested to hear if anyone has any other ideas for a novel approach to pointer compression. Also, any thoughts or feedback on these thoughts are very welcome.
Beta Was this translation helpful? Give feedback.
All reactions