Research task: Heap & Garbage Collection design proposal #444
Replies: 34 comments 3 replies
-
I've recently been researching the feasibility of using compressed references with an LLVM codegen using statepoint for generating stack maps. Unfortunately, there is one particularly nasty issue I've encountered that can cause incorrect stack maps when using the obvious solution for this. Imagine the following Java code compiled with a compression scheme which is simply to truncate pointers to 32 bits (it's a bit naive, but it makes the example easier to illustrate): public class Test {
public static Object a;
public static Object b;
public static void bar() { ... }
public static void foo() {
Object c = a;
bar();
b = c;
}
} The obvious way to generate define void @foo() {
%0 = load i32, i32* @a
%1 = inttoptr i32 %0 to i8 addrspace(1)*
call void @bar()
%2 = ptrtoint i8 addrspace(1)* %1 to i32
store i32 %2, i32* @b
} However, the LLVM optimizer can interfere with this and legally eliminate the pointer held across the call to define void @foo() {
%0 = load i32, i32* @a
call void @bar()
store i32 %0, i32* @b
} Since this transformation can occur before RS4GC would be run, the collected pointer will no longer be visible when stack maps are generated (after all, it now appears to be a simple |
Beta Was this translation helpful? Give feedback.
-
There's been an ongoing discussion (in LLVM chats) about |
Beta Was this translation helpful? Give feedback.
-
While you could do that for decompression, to perform compression you'd need to perform pointer subtraction which AFAIK unavoidably involves a The fundamental issue here is that the LLVM optimizer needs to treat the loads/stores as atomic operations returning/taking an |
Beta Was this translation helpful? Give feedback.
-
In your original example, neither |
Beta Was this translation helpful? Give feedback.
-
I'm not quite sure I understand exactly what you're referring to (the value of public class Test {
public static volatile Object a;
public static void foo() {
Object aOld = a;
a = null;
for (int i = 0; i < 1000000; i++) {
System.out.println(new Object());
}
a = aOld;
}
public static void main(String[] args) {
a = new Object();
foo();
System.out.println(a);
}
} In this case, the original value of |
Beta Was this translation helpful? Give feedback.
-
You're right, I read too quickly and misparsed the text, sorry about that.
I see, that makes sense. |
Beta Was this translation helpful? Give feedback.
-
So I've come up with a potential solution that will allow compressed references to work, although it's not an ideal solution by any means. The idea is to force LLVM to treat the reference loads/stores as being atomic with their respective decompression/compression sequences until just before RS4GC by using "intrinsic-like" functions. In the main bitcode file, two extra functions would be declared: attributes #0 = { nounwind argmemonly readonly "gc-leaf-function" }
declare i8 addrspace(1)* @qcc.loadref(i32* nocapture) #0
attributes #1 = { nounwind argmemonly writeonly "gc-leaf-function" }
declare void @qcc.storeref(i8 addrspace(1)* readnone, i32* nocapture) #1 These functions would be used in place of normal loads/stores and have their attributes set to allow LLVM to make as many optimizations as possible without allowing any of the unsafe optimizations that would allow attributes #0 = { alwaysinline nounwind argmemonly readonly "gc-leaf-function" }
define linkonce i8 addrspace(1)* @qcc.loadref(i32* nocapture %ptr) #0 {
%compressed = load i32, i32* %ptr
%decompressed = inttoptr i32 %compressed to i8 addrspace(1)*
ret i8 addrspace(1)* %decompressed
}
attributes #1 = { alwaysinline nounwind argmemonly writeonly "gc-leaf-function" }
define linkonce void @qcc.storeref(i8 addrspace(1)* readnone %decompressed, i32* nocapture %ptr) #1 {
%compressed = ptrtoint i8 addrspace(1)* %decompressed to i32
store i32 %compressed, i32* %ptr
ret void
} Importantly, these new definitions have the additional Doing this seems to allow compressed references to be used while preventing LLVM from making unsafe optimizations that would result in stack maps missing collected references and without interfering too heavily with its ability to optimize functions. |
Beta Was this translation helpful? Give feedback.
-
I don't have much LLVM experience and definitely not with using it for a GC but it seems to me that The example here looks very similar to the example in https://llvm.org/docs/Statepoints.html#explicit-representation. What am I missing? |
Beta Was this translation helpful? Give feedback.
-
Typically, the explicit representation of statepoints isn't used until after a number of other optimizations have run, as having the statepoints explicit restricts the optimizer's ability to perform transformations (e.g. inlining won't occur through an The problem arises in how RS4GC transforms between the abstract and explicit representations. It assumes that all collected references will be pointers and for the example GC strategy the rule is that all pointers with |
Beta Was this translation helpful? Give feedback.
-
Thanks for the clarification @aviansie-ben |
Beta Was this translation helpful? Give feedback.
-
In the interests of simplicity - which is an implicit principle to which I would strongly adhere - I think the correct course is to disallow the combination of compressed references and mark & sweep GC until such time that LLVM has support for non-pointer reference objects. Pursuant to that, we have internal resources associated with the LLVM project that could potentially help facilitate the necessary contributions to implement this feature on the LLVM side; with luck, we would then be able to automatically enable the feature on LLVM 12 or 13 or whatever version first implements this capability. @aviansie-ben if I coordinate a meeting with our internal LLVM resources, would you be able and willing to describe our requirements for LLVM in this area to support non-pointer references? |
Beta Was this translation helpful? Give feedback.
-
I’ve been reading LLVM’s GC related documents, but would like to understand what QCC needs to do. Are followings what QCC should do at least?
At runtime, a GC will use the stack map information to identify pointers. |
Beta Was this translation helpful? Give feedback.
-
I only have some of the answers needed here, but I will share what I know.
As far as I understand, yes we will likely use
I don't know the answer to this one.
Yes and no. We should not compile the raw stack map information from LLVM into the final executable; instead, we should read it from the object file(s) after calling I am concerned about using I think we need to examine the requirements for GC independently of a proposed implementation to understand the true constraints. Only then can we be perfectly clear about the cost/benefit of particular approaches. This is something that we have not explicitly done yet but I do not picture a good outcome without having this discussion. |
Beta Was this translation helpful? Give feedback.
-
Thanks for sharing the information and your thought. On |
Beta Was this translation helpful? Give feedback.
-
@michihirohorie Apologies for the delay in replying here. Seems like the notification for this issue got lost in a sea of other notifications last week 😅.
Not all pointers will need
For the PlaceSafepoints pass to work, we also need to add a function called
As @dmlloyd mentioned, we should probably pre-process these stack maps, as they're not particularly efficient to traverse at runtime and are meant to be processed before use. |
Beta Was this translation helpful? Give feedback.
-
I've been looking through the code to figure out what changes would need to be made to support basic GC maps, with Statepoint being my primary target here. The primary issue right now is that the QCC IL does not consistently keep reference type information intact through to the codegen. They are being treated as integers by the LLVM codegen when encountered, and lowering (particularly from the layout plugin) will start adding casts from reference types to pointer types that will no longer necessarily be valid once we start requiring GC maps. Both of these issues need to be fixed to make sure that reference types make their way through to the LLVM codegen so that we can ensure that we know what's a GC reference and to ensure that the optimizer doesn't do any funny business. Firstly, we need to fix the LLVM codegen to avoid representing reference types as integer types and to use Secondly, we would need to remove the lowering of operations involving reference types to operations involving normal pointer types. Unfortunately, treating references as pointers in this manner is unsafe, as it effectively throws out information about the fact that the reference needs to be included in GC maps and violates Statepoint's view of references as "non-integral pointers", meaning that it could result in a situation where a reference exists across a GC point that does not appear in the stack map. Currently, I know this is happening in Finally, we'd need to figure out what to do about allocating new objects. The current nogc plugin will simply |
Beta Was this translation helpful? Give feedback.
-
Overall makes sense. On the specific point on allocation, the usual invariant is to not allow a GC safe point to sneak into the sensitive heart of the allocation sequence (after raw memory is allocated, before the object header is initialized and the fields zeroed). |
Beta Was this translation helpful? Give feedback.
-
What I thought was to prepare a new type representing the non-integral pointer type to distinguish it from normal pointer type. But when we introduce the compressed pointer in the future, the type hierarchies will become complex…
Currently PlaceSafeopints pass can only insert safepoints in the method entry and loop backedge. Correct? |
Beta Was this translation helpful? Give feedback.
-
@dgrove-oss This is also how I've seen it done and this would work in theory. The problem with this approach here comes down to trying to get LLVM to actually do that. We don't have direct control over where LLVM will inject the safepoints unless we're willing to modify LLVM. AFAICT, to avoid LLVM messing with the allocation sequence, we would need to encapsulate the allocation + initialization in a separate function that will not be inlined before safepoint insertion.
@michihirohorie That's correct, but there are 2 issues with relying on this to avoid GC safepoints occurring during object allocation. Firstly, the LLVM optimizer is not required to keep instructions necessarily in the order we emit them in the initial LLVM IR. If there's a loop before the allocation that doesn't contain any instructions that would have observable side effects to other threads or to the allocator itself, the LLVM optimizer would technically be free to move that loop between allocation and object initialization, since doing so would not change observable program behaviour (most parts of the GC, including where safepoints end up and whether references are live, are not generally considered observable for optimization reasons). Secondly, array allocations with computed sizes require a loop as part of object initialization to zero out the array contents. If the array were a reference array, this could cause a GC safepoint to occur while there's still garbage data in the array, which would be interpreted as references and likely cause the GC to crash when scanning the object. |
Beta Was this translation helpful? Give feedback.
-
If we don't have good control over placement of SafePoints than our current strategy of doing a series of inline stores of |
Beta Was this translation helpful? Give feedback.
-
I see what you mean, thanks @aviansie-ben I have a basic question. Which of LLVM side and the runtime has the responsibility to allocate space for objects? It sounds LLVM has the responsibility, but is it possible to ask the runtime the memory allocation? If we could let the runtime to allocate memory, we wouldn’t encounter the problem around the allocation and initialization. Maybe I need to understand which of LLVM side and the runtime firstly allocates the heap space and then tell the other about the information of the allocated heap, how to associate the allocated heap space with |
Beta Was this translation helpful? Give feedback.
-
I cannot escape the conclusion that we must place safepoint polls and statepoint-ify our calls explicitly and should not use I'd like to remove some of the obscurity around LLVM GC. I think we should be looking at the LLVM GC facilities as a series of atoms with concrete behavior which we can assemble in a way that suits us (and potentially modify or improve), rather than looking at it as a single black box with a more ambiguous and/or complex behavioral contract that we must serve in order to utilize GC at all. But in order to do so, we have to understand what each atom does (and why), and what (if any) minimal LLVM passes are required to make them function.
We cannot process LLVM bitcode output meaningfully without a large amount of additional investment, so any approach that would require this is still a non-starter in my view. Processing statepoint stack map output after LLVM compilation completes is however realistically very feasible, so it makes more sense to proceed with this in mind, and this seems to support the view of treating each intrinsic as a usable atom with independent behavior. |
Beta Was this translation helpful? Give feedback.
-
@dmlloyd I don't understand why you don't like the Statepoint approach so much. In my understanding, the both approaches have pros. and cons. and we are investigating them in detail. Statepoint
Cons.
Explicit Representation
Cons.
What I really don't understand is why you insist on the Explicit Representation without waiting for the investigation Ben and Michi are doing now. Or if you foresee any show-stoppers, please share it with us. I also with you to elaborate a bit more about how to deal with the cons. of the Explicit Representation approach. |
Beta Was this translation helpful? Give feedback.
-
I agree the initial release won't need to support relocation, but how about the final version? Since GC support code is scattered across the code, we need a design that is "ready" for relocation. I thinks one of the factors to decide the design is the target applications. I know Quarkus mainly focus on cloud-native microservices, and such applications are often short running or easily restartable. However, once this project is published as RedHat-IBM-lead project, GBS teams will use QCC to build their customers' applications, regardless if they are short-running or not. You might think using Quarkus for long-running application is a wrong use of this technology, but the customers won't care if they use the technology in a right way, but they only care if it is useful for their business. So I think we should design GC support to be ready for object relocation. |
Beta Was this translation helpful? Give feedback.
-
These are good questions. From my perspective it's not about pros/cons, it's about feasibility. By explicitly wrapping every call with a statepoint (and manually correlating call site nodes with statepoint IDs), we will have everything we need to build the IP-to-method tables we need in order to support stack walking - something that is needed for exception support and other things in the JDK. If we don't do this, then AFAICT there is no practical solution for correlating our call sites nodes to call sites in the final executable, making stack walking a much more difficult problem to solve. Therefore I don't see how pushing statepoint placement to LLVM is possible for us as things stand at present, whereas explicit representation is definitely possible. If we want to try to use LLVM for this, then we need a concrete proposal for how we can force every method call into the stack map with a known ID that we can track back to method call nodes. I'm not opposed to supporting relocation - quite the opposite in fact. But I do think we need a clear understanding of exactly how the relocation intrinsic functions, if we are to proceed with it. |
Beta Was this translation helpful? Give feedback.
-
Hmm... I'm afraid, but I have to say I don't understand your points. Precisely, I know you think the Explicit Representation is a feasible way to support GC with LLVM. However, what you have mentioned so far about the Abstract Machine Model is just you don't know. I don't understand why you don't wait for Dan and Michi's investigation and compare two approaches. Further, I think we should distinguish between feasibility and practicability. For example, if a work item is known to be feasible and it requires 20 person-year, then it is impractical for this team. We can't afford to work on such a big item at this moment (and maybe in the future, either). If we take the Explicit Representation, we have to implement almost all optimization algorithms. How much extra workload will we need? When will we release Qbicc as an OSS project? If we find out the Abstract Machine approach is feasible, isn't it better to work on implementing static whole program analyses rather than re-implementing traditional optimizations? That's why Ben and Michi are investigating the feasibility. What is the reason we should stop this investigation? If you know something more detail, please share it. |
Beta Was this translation helpful? Give feedback.
-
I am open to comparing more than one approach as soon as a second approach can be defined. If the abstract machine model approach is feasible, then presumably it can be described at least in general terms now: how would we use the stack map information, and how would stack walking work? Would we be required to have the ability to process post-optimization LLVM bitcode? I do believe that the questions I've asked need answers regardless of approach.
I agree, impractical approaches are effectively infeasible for us right now.
I'm not convinced that we will be required to implement almost all optimization algorithms if we explicitly place statepoints. But we also know that we already cannot rely solely on LLVM for all optimizations, since LLVM simply doesn't have awareness of Java semantics. The answer seems unchanged to me: there are certain optimizations we must implement, some we should implement, some we can easily implement, and some we do not need to implement. Nobody has yet proposed a specific difference in the composition of these sets based on statepoint placement strategy; I think this is part of what needs to be discussed. To answer this question: "When will we release Qbicc as an OSS project?" - we're working on that now; the current plan is to open it to the public as soon as possible (there are certain things that have to be done first, including completing the project rename, but these things are not expected to take long). I would estimate that the time before we make the project public would be measured in weeks rather than months. |
Beta Was this translation helpful? Give feedback.
-
@dmlloyd I don't think using the explicit relocation form of Statepoint would be a good idea. Firstly, AFAIK the explicit relocation form is generally considered to be an internal implementation detail of LLVM (at least moreso than the abstract machine representation) and how it works is subject to change quite significantly between LLVM versions. In fact, the high-level documentation for Statepoint seems to be somewhat out-of-date in how it represents the explicit relocations, as these intrinsics have been changed in more recent LLVM versions to use a named Additionally, this would require that we implement a pass that makes relocations explicit in qbicc as well (yes, we could theoretically avoid this for non-relocating GCs, but we'll eventually want the GC to be able to relocate objects anyway). This requires some non-trivial and potentially error-prone global analysis. Implementing this ourselves seems to me to be completely unnecessary considering that LLVM already supports this and having this done prior to running LLVM doesn't seem to have any significant benefits AFAICT. In terms of how the abstract machine model could be made to work, here's an overview of how I'd personally imagine the compile process would work:
None of this requires any manual processing of LLVM bitcode files. All the transformations required to make the bitcode files work can be done through existing LLVM utilities. Alternatively, we could use the As for placing safepoints ourselves prior to emitting the LLVM IR, that is certainly a possibility and the LLVM documentation does advise doing that in certain cases (namely if safepoints have behaviour other than simply performing GC operations). Really, the only downside of that is potentially losing out on some LLVM loop optimizations due to inserting the safepoints early in loops. These optimizations can be some of the most powerful LLVM has to offer (e.g. auto-vectorization of loops) and I don't think it'd be practical to implement many of these ourselves as they can be very finicky to implement correctly and would require a good cost model to avoid degrading performance. If we decide to insert the safepoints early, we'd potentially be looking at giving up those optimizations entirely for the foreseeable future. Maybe this is a tradeoff we're willing to make though? [1] https://llvm.org/docs/LangRef.html#gc-live-operand-bundles |
Beta Was this translation helpful? Give feedback.
-
Hi All, I am also interested in this topic and this is the discussion I am maintaining to show the information/codes I gathered. ballerina-platform/nballerina#298. Currently, I am trying to programmatically get the actual locations of the heap references. For that, we have done
I am having a problem of getting these locations programmatically, because those offsets are given according to function's stack frame. If I explain it more, please consider the below code.
If we try to get all heap references from Really appreciate your input on this. |
Beta Was this translation helpful? Give feedback.
-
Just a couple loose ends here I thought I might want to call out as I'm wrapping my work here up:
[1] https://llvm.org/docs/Statepoints.html#recording-on-stack-regions |
Beta Was this translation helpful? Give feedback.
-
We need an overall garbage collection design proposal. We have certain requirements:
And non-requirements:
In addition, the overall design should be able to support a "epsilon"/no-GC mode of operation.
This implies certain design parameters:
The garbage collector design proposal must include an SPI that can be implemented by multiple heap+GC implementations. In addition, this may require tight backend integration (i.e. LLVM). It should also include a description of allocation strategy and reference encoding and format, along with any requirements that would apply to object layout (such as object header bits and things of that nature).
Beta Was this translation helpful? Give feedback.
All reactions