Research task: Heap & Garbage Collection design proposal #444

dmlloyd · 2020-10-01T14:40:58Z

dmlloyd
Oct 1, 2020
Maintainer

We need an overall garbage collection design proposal. We have certain requirements:

Favor a heavily request-oriented usage profile
Optimize for smaller (but not necessarily tiny) heaps
Optimize for fast startup (i.e. minimize setup time)
Optimize for minimal pausing

And non-requirements:

QCC applications cannot unload classes; therefore a permanent "generation" is indicated and support for unloading classes is not required

In addition, the overall design should be able to support a "epsilon"/no-GC mode of operation.

This implies certain design parameters:

Exploring stack allocation in conjunction with (closed-world) escape analysis
Probably a generational copying collector as an implementation
Support for compressed references on 64-bit platforms

The garbage collector design proposal must include an SPI that can be implemented by multiple heap+GC implementations. In addition, this may require tight backend integration (i.e. LLVM). It should also include a description of allocation strategy and reference encoding and format, along with any requirements that would apply to object layout (such as object header bits and things of that nature).

aviansie-ben · 2020-10-13T14:32:11Z

aviansie-ben
Oct 13, 2020
Collaborator

I've recently been researching the feasibility of using compressed references with an LLVM codegen using statepoint for generating stack maps. Unfortunately, there is one particularly nasty issue I've encountered that can cause incorrect stack maps when using the obvious solution for this. Imagine the following Java code compiled with a compression scheme which is simply to truncate pointers to 32 bits (it's a bit naive, but it makes the example easier to illustrate):

public class Test {
  public static Object a;
  public static Object b;

  public static void bar() { ... }

  public static void foo() {
    Object c = a;
    bar();
    b = c;
  }
}

The obvious way to generate foo() using 32-bit references would be to use an i32 for reference fields, then expand/compress and cast to/from a collected pointer as appropriate:

define void @foo() {
  %0 = load i32, i32* @a
  %1 = inttoptr i32 %0 to i8 addrspace(1)*
  call void @bar()
  %2 = ptrtoint i8 addrspace(1)* %1 to i32
  store i32 %2, i32* @b
}

However, the LLVM optimizer can interfere with this and legally eliminate the pointer held across the call to bar() by transforming the previous IR into the following, since the inttoptr and ptrtoint casts are inverses of each other:

define void @foo() {
  %0 = load i32, i32* @a
  call void @bar()
  store i32 %0, i32* @b
}

Since this transformation can occur before RS4GC would be run, the collected pointer will no longer be visible when stack maps are generated (after all, it now appears to be a simple i32) and thus will not be included in the stack map at all. Since all pointer compression systems would (AFAICT) have to have their compression sequence start with a ptrtoint and their decompression sequence end with an inttoptr, this will likely be a problem regardless of the exact sequences used. I'm currently researching some ways of working around this and I think I have one that should theoretically work but that's still a work-in-progress right now.

0 replies

dmlloyd · 2020-10-13T14:39:00Z

dmlloyd
Oct 13, 2020
Maintainer Author

There's been an ongoing discussion (in LLVM chats) about ptrtoint and the data that it loses which has led me to the conclusion that we should probably avoid it as much as possible. What about using getelementptr instead? That's the normal way to access an item in an array - and we could consider a compressed reference space as an offset into an array of 4- or 8-byte aligned bytes.

0 replies

aviansie-ben · 2020-10-13T14:58:16Z

aviansie-ben
Oct 13, 2020
Collaborator

While you could do that for decompression, to perform compression you'd need to perform pointer subtraction which AFAIK unavoidably involves a ptrtoint. However, even if you could avoid the ptrtoint/inttoptr, this doesn't resolve the fundamental problem as LLVM can still see through a getelementptr followed by a pointer subtraction and remove the two, thus resulting in the same problem of an i32 reference living across a GC point and thus being missed in the stack map.

The fundamental issue here is that the LLVM optimizer needs to treat the loads/stores as atomic operations returning/taking an i8 addrspace(1)* that cannot be messed with until RS4GC is run and the objects which must be included in stack maps are finalized. Once RS4GC is run, it's safe for LLVM to start messing with those, but doing anything to them before that can result in the stack maps missing objects.

0 replies

dmlloyd · 2020-10-13T16:27:17Z

dmlloyd
Oct 13, 2020
Maintainer Author

In your original example, neither c nor a are really live - in fact, taking your example exactly as-is, they would be eliminated before the bytecode parser even finishes reading the bytecode. Normally the value would have to be used in order to stay alive - in which case I would expect we'd have a pointer throughout (I'm not sure there is a great deal of value in compressing the value on the stack or in registers to pass to a method, but maybe I'm wrong there). Could you produce an example where a variable which definitely should be alive ends up getting lost to GC due to compression?

0 replies

aviansie-ben · 2020-10-13T17:34:23Z

aviansie-ben
Oct 13, 2020
Collaborator

I'm not quite sure I understand exactly what you're referring to (the value of c should be live from the point of view of foo as it's written into the static field b and could thus be used anywhere), but here's a more complete example that would result in an object being used after being freed assuming we don't succeed in inlining foo and that one of the allocations in foo triggers a GC:

public class Test {
  public static volatile Object a;

  public static void foo() {
    Object aOld = a;
    a = null;

    for (int i = 0; i < 1000000; i++) {
      System.out.println(new Object());
    }

    a = aOld;
  }

  public static void main(String[] args) {
    a = new Object();
    foo();
    System.out.println(a);
  }
}

In this case, the original value of a is held live only in the local variable aOld while the allocations are performed and aOld is only used to write back into a (which is later re-read by main and thus is live), meaning that LLVM is free to optimize away the decompression/compression sequences as the decompressed pointer is never used directly. Assuming that one of these allocations triggers a GC, the value of aOld would not be included in the root set due to that optimization turning it into an i32 in foo and thus the original value of a would be collected despite being used later.

0 replies

dmlloyd · 2020-10-13T17:47:43Z

dmlloyd
Oct 13, 2020
Maintainer Author

I'm not quite sure I understand exactly what you're referring to (the value of c should be live from the point of view of foo as it's written into the static field b and could thus be used anywhere)

You're right, I read too quickly and misparsed the text, sorry about that.

here's a more complete example that would result in an object being used after being freed assuming we don't succeed in inlining foo and that one of the allocations in foo triggers a GC:
public class Test {
  public static volatile Object a;

  public static void foo() {
    Object aOld = a;
    a = null;

    for (int i = 0; i < 1000000; i++) {
      System.out.println(new Object());
    }

    a = aOld;
  }

  public static void main(String[] args) {
    a = new Object();
    foo();
    System.out.println(a);
  }
}
In this case, the original value of a is held live only in the local variable aOld while the allocations are performed and aOld is only used to write back into a (which is later re-read by main and thus is live), meaning that LLVM is free to optimize away the decompression/compression sequences as the decompressed pointer is never used directly. Assuming that one of these allocations triggers a GC, the value of aOld would not be included in the root set due to that optimization turning it into an i32 in foo and thus the original value of a would be collected despite being used later.

I see, that makes sense.

0 replies

aviansie-ben · 2020-10-13T20:59:10Z

aviansie-ben
Oct 13, 2020
Collaborator

So I've come up with a potential solution that will allow compressed references to work, although it's not an ideal solution by any means. The idea is to force LLVM to treat the reference loads/stores as being atomic with their respective decompression/compression sequences until just before RS4GC by using "intrinsic-like" functions. In the main bitcode file, two extra functions would be declared:

attributes #0 = { nounwind argmemonly readonly "gc-leaf-function" }
declare i8 addrspace(1)* @qcc.loadref(i32* nocapture) #0

attributes #1 = { nounwind argmemonly writeonly "gc-leaf-function" }
declare void @qcc.storeref(i8 addrspace(1)* readnone, i32* nocapture) #1

These functions would be used in place of normal loads/stores and have their attributes set to allow LLVM to make as many optimizations as possible without allowing any of the unsafe optimizations that would allow i32 compressed references to live across GC points. On this main bitcode file alone, opt could be run to run the desired optimizations. Since only the declarations are visible at this point, no inlining of these functions can occur and LLVM is forced not to mess with the sequences. After this, the module would be linked with an additional bitcode file defining these functions using llvm-link (definitions can be changed for better compression/decompression sequences):

attributes #0 = { alwaysinline nounwind argmemonly readonly "gc-leaf-function" }
define linkonce i8 addrspace(1)* @qcc.loadref(i32* nocapture %ptr) #0 {
  %compressed = load i32, i32* %ptr
  %decompressed = inttoptr i32 %compressed to i8 addrspace(1)*
  ret i8 addrspace(1)* %decompressed
}

attributes #1 = { alwaysinline nounwind argmemonly writeonly "gc-leaf-function" }
define linkonce void @qcc.storeref(i8 addrspace(1)* readnone %decompressed, i32* nocapture %ptr) #1 {
  %compressed = ptrtoint i8 addrspace(1)* %decompressed to i32
  store i32 %compressed, i32* %ptr
  ret void
}

Importantly, these new definitions have the additional alwaysinline attribute to allow them to be inlined unconditionally. After this, two mandatory optimizations would be run: the always inliner, which will inline only functions with alwaysinline (so any calls to @qcc.loadref and @qcc.storeref are effectively lowered in-place); and then RS4GC itself, which will create the appropriate statepoints for stack maps. This can be done by running opt --always-inline --rewrite-statepoints-for-gc. Theoretically, after this is done, more optimizations could be run to e.g. remove decompression/compression sequences that really are redundant since they do not live across a GC point, though this is not strictly necessary.

Doing this seems to allow compressed references to be used while preventing LLVM from making unsafe optimizations that would result in stack maps missing collected references and without interfering too heavily with its ability to optimize functions.

0 replies

DanHeidinga · 2020-10-15T04:20:32Z

DanHeidinga
Oct 15, 2020
Collaborator

I don't have much LLVM experience and definitely not with using it for a GC but it seems to me that @llvm.experimental.gc.statepoint provides the info needed to support this already. It's only at the statepoint that the GC can run so we only need to back store pointers somewhere that the GC can find and update them during the statepoint. The uncompressed form or any derived pointers don't need to be "valid" between statepoints.

The example here looks very similar to the example in https://llvm.org/docs/Statepoints.html#explicit-representation. What am I missing?

0 replies

aviansie-ben · 2020-10-15T13:21:46Z

aviansie-ben
Oct 15, 2020
Collaborator

Typically, the explicit representation of statepoints isn't used until after a number of other optimizations have run, as having the statepoints explicit restricts the optimizer's ability to perform transformations (e.g. inlining won't occur through an @llvm.experimental.gc.statepoint intrinsic). Usually, regular calls are transformed into the explicit statepoint representation after some other LLVM optimizations have been run using the RewriteStatepointsForGC pass (often shortened to RS4GC) as described in https://llvm.org/docs/Statepoints.html#rewritestatepointsforgc.

The problem arises in how RS4GC transforms between the abstract and explicit representations. It assumes that all collected references will be pointers and for the example GC strategy the rule is that all pointers with addrspace(1) are considered collected. Only values matching this description that are live across a callsite will be added to the list on an @llvm.experimental.gc.statepoint intrinsic added by RS4GC. By performing the transformation I explained previously, the LLVM optimizer violates this assumption and a collected reference can exist across a callsite with an i32 type, which RS4GC will not add to the list of values on the statepoint during the transformation.

0 replies

DanHeidinga · 2020-10-15T13:41:04Z

DanHeidinga
Oct 15, 2020
Collaborator

Thanks for the clarification @aviansie-ben

0 replies

dmlloyd · 2020-11-16T17:21:09Z

dmlloyd
Nov 16, 2020
Maintainer Author

In the interests of simplicity - which is an implicit principle to which I would strongly adhere - I think the correct course is to disallow the combination of compressed references and mark & sweep GC until such time that LLVM has support for non-pointer reference objects. Pursuant to that, we have internal resources associated with the LLVM project that could potentially help facilitate the necessary contributions to implement this feature on the LLVM side; with luck, we would then be able to automatically enable the feature on LLVM 12 or 13 or whatever version first implements this capability.

@aviansie-ben if I coordinate a meeting with our internal LLVM resources, would you be able and willing to describe our requirements for LLVM in this area to support non-pointer references?

0 replies

michihirohorie · 2021-03-16T12:17:21Z

michihirohorie
Mar 16, 2021
Collaborator

I’ve been reading LLVM’s GC related documents, but would like to understand what QCC needs to do.

Are followings what QCC should do at least?

Emit addrspace(1) for pointers and gc "statepoint-example" tag on method declarations  in the LLVM IR. (I don’t still understand what ‘1’ in addrspace(1) means: for example can we use ‘1’ for the nursery space and ‘2’ for the tenure space? )
Add the pass PlaceSafepoints ("-place-backedge-safepoints-impl" and “-place-safepoints”) to insert safe points when QCC executes llc to get an a.out.
Also add the pass RewriteStatepointsForGC(“-rewrite-statepoints-for-gc”). Then a.out includes stack map information.

At runtime, a GC will use the stack map information to identify pointers.

0 replies

dmlloyd · 2021-03-16T13:14:02Z

dmlloyd
Mar 16, 2021
Maintainer Author

I only have some of the answers needed here, but I will share what I know.

I’ve been reading LLVM’s GC related documents, but would like to understand what QCC needs to do.

Are followings what QCC should do at least?

Emit addrspace(1) for pointers and gc "statepoint-example" tag on method declarations  in the LLVM IR. (I don’t still understand what ‘1’ in addrspace(1) means: for example can we use ‘1’ for the nursery space and ‘2’ for the tenure space? )

As far as I understand, yes we will likely use "statepoint-example". I do not know if we can or should use multiple additional address spaces, and as you can see from the discussion above, there may be problems around non-pointer reference representation.

Add the pass PlaceSafepoints ("-place-backedge-safepoints-impl" and “-place-safepoints”) to insert safe points when QCC executes llc to get an a.out.

I don't know the answer to this one.

Also add the pass RewriteStatepointsForGC(“-rewrite-statepoints-for-gc”). Then a.out includes stack map information.

Yes and no. We should not compile the raw stack map information from LLVM into the final executable; instead, we should read it from the object file(s) after calling llc, and use it to construct our own data structures that the given GC implementation can consume, which will then be emitted and compiled into an additional object file to be linked into the final image.

I am concerned about using RewriteStatepointsForGC to be honest, given that it appears to be limited to X86_64 right now. In order to justify using Statepoint, we will have to be prepared to commit to expanding its implementation to the other platforms we want to attempt to target, including Power, ARM32 and AArch64, and even Z, and additionally we would want to commit to full support for integer references. This seems like a substantial amount of work for which we would need commitment from LLVM developers and/or other engineers with C++ experience. We must either come up with a plan to do this soon, or else explore other options than Statepoint.

I think we need to examine the requirements for GC independently of a proposed implementation to understand the true constraints. Only then can we be perfectly clear about the cost/benefit of particular approaches. This is something that we have not explicitly done yet but I do not picture a good outcome without having this discussion.

0 replies

michihirohorie · 2021-03-16T14:27:49Z

michihirohorie
Mar 16, 2021
Collaborator

Thanks for sharing the information and your thought. On RewriteStatepointsForGC, I’ve been also concerned about the only support for x86_64 at this moment, and agree on stopping and thinking whether we’re on the right track.

0 replies

aviansie-ben · 2021-03-22T18:57:46Z

aviansie-ben
Mar 22, 2021
Collaborator

@michihirohorie Apologies for the delay in replying here. Seems like the notification for this issue got lost in a sea of other notifications last week 😅.

Emit addrspace(1) for pointers and gc "statepoint-example" tag on method declarations  in the LLVM IR. (I don’t still understand what ‘1’ in addrspace(1) means: for example can we use ‘1’ for the nursery space and ‘2’ for the tenure space? )

Not all pointers will need addrspace(1), only the ones that refer to objects on the Java heap. C native pointers aren't managed by the GC, so they don't need to appear in the stack maps and shouldn't have this annotation. Using 1 as the address space is actually completely arbitrary and it's up to the GC strategy to decide which address spaces represent collected references and which represent raw pointers. statepoint-example just happens to use address space 1 for the purposes of signifying that a pointer points to a managed heap and should appear in the stack maps. Theoretically, we could use more than one address space to encode some sort of information about the pointer, but that would require LLVM modifications to add a new GC strategy and I don't see any need to do that.

Add the pass PlaceSafepoints ("-place-backedge-safepoints-impl" and “-place-safepoints”) to insert safe points when QCC executes llc to get an a.out.

For the PlaceSafepoints pass to work, we also need to add a function called gc.safepoint_poll that actually implements the polling logic. I'm also pretty sure we only need to manually run -place-safepoints. IIUC, -place-backedge-safepoints-impl is an internal implementation detail and that pass will be run automatically by PlaceSafepoints if required.

Also add the pass RewriteStatepointsForGC(“-rewrite-statepoints-for-gc”). Then a.out includes stack map information.

As @dmlloyd mentioned, we should probably pre-process these stack maps, as they're not particularly efficient to traverse at runtime and are meant to be processed before use.

0 replies

aviansie-ben · 2021-03-29T19:51:35Z

aviansie-ben
Mar 29, 2021
Collaborator

I've been looking through the code to figure out what changes would need to be made to support basic GC maps, with Statepoint being my primary target here. The primary issue right now is that the QCC IL does not consistently keep reference type information intact through to the codegen. They are being treated as integers by the LLVM codegen when encountered, and lowering (particularly from the layout plugin) will start adding casts from reference types to pointer types that will no longer necessarily be valid once we start requiring GC maps. Both of these issues need to be fixed to make sure that reference types make their way through to the LLVM codegen so that we can ensure that we know what's a GC reference and to ensure that the optimizer doesn't do any funny business.

Firstly, we need to fix the LLVM codegen to avoid representing reference types as integer types and to use addrspace(1) pointers instead. This incorrect type mapping currently happens implicitly in LLVMModuleNodeVisitor due to ReferenceType extending WordType. This change does have some implications on other parts of the LLVM codegen that deal with references, so other parts may also need to be changed to take this into account.

Secondly, we would need to remove the lowering of operations involving reference types to operations involving normal pointer types. Unfortunately, treating references as pointers in this manner is unsafe, as it effectively throws out information about the fact that the reference needs to be included in GC maps and violates Statepoint's view of references as "non-integral pointers", meaning that it could result in a situation where a reference exists across a GC point that does not appear in the stack map. Currently, I know this is happening in ObjectAccessLoweringBuilder but I suspect there may be other places that are also doing this that I haven't found in my initial scan of the code. I'm not currently sure how this is best tackled (perhaps we need some new third type that represents a collected pointer to a compound type?), so any feedback here would be appreciated.

Finally, we'd need to figure out what to do about allocating new objects. The current nogc plugin will simply malloc a block of memory, use the returned pointer to initialize the object's fields (including its header), and then cast it to a reference. However, doing something like this in the presence of GC maps would have several issues. There's no guarantee that a GC safepoint wouldn't be placed after the allocation of the memory to hold the object, but before the object's header is initialized. There's also no requirement that would force the LLVM optimizer to keep around a cast from a regular pointer to a collected addrspace(1) pointer. We probably need a deeper design discussion around how allocation would look in general in the presence of a real GC, though.

0 replies

dgrove-oss · 2021-03-29T21:02:55Z

dgrove-oss
Mar 29, 2021
Maintainer

Overall makes sense. On the specific point on allocation, the usual invariant is to not allow a GC safe point to sneak into the sensitive heart of the allocation sequence (after raw memory is allocated, before the object header is initialized and the fields zeroed).

0 replies

michihirohorie · 2021-03-30T09:23:11Z

michihirohorie
Mar 30, 2021
Collaborator

perhaps we need some new third type that represents a collected pointer to a compound type?

What I thought was to prepare a new type representing the non-integral pointer type to distinguish it from normal pointer type. But when we introduce the compressed pointer in the future, the type hierarchies will become complex…

There's no guarantee that a GC safepoint wouldn't be placed after the allocation of the memory to hold the object, but before the object's header is initialized.

Currently PlaceSafeopints pass can only insert safepoints in the method entry and loop backedge. Correct?

0 replies

aviansie-ben · 2021-03-31T13:51:33Z

aviansie-ben
Mar 31, 2021
Collaborator

On the specific point on allocation, the usual invariant is to not allow a GC safe point to sneak into the sensitive heart of the allocation sequence (after raw memory is allocated, before the object header is initialized and the fields zeroed).

@dgrove-oss This is also how I've seen it done and this would work in theory. The problem with this approach here comes down to trying to get LLVM to actually do that. We don't have direct control over where LLVM will inject the safepoints unless we're willing to modify LLVM. AFAICT, to avoid LLVM messing with the allocation sequence, we would need to encapsulate the allocation + initialization in a separate function that will not be inlined before safepoint insertion.

Currently PlaceSafeopints pass can only insert safepoints in the method entry and loop backedge. Correct?

@michihirohorie That's correct, but there are 2 issues with relying on this to avoid GC safepoints occurring during object allocation. Firstly, the LLVM optimizer is not required to keep instructions necessarily in the order we emit them in the initial LLVM IR. If there's a loop before the allocation that doesn't contain any instructions that would have observable side effects to other threads or to the allocator itself, the LLVM optimizer would technically be free to move that loop between allocation and object initialization, since doing so would not change observable program behaviour (most parts of the GC, including where safepoints end up and whether references are live, are not generally considered observable for optimization reasons).

Secondly, array allocations with computed sizes require a loop as part of object initialization to zero out the array contents. If the array were a reference array, this could cause a GC safepoint to occur while there's still garbage data in the array, which would be interpreted as references and likely cause the GC to crash when scanning the object.

0 replies

dgrove-oss · 2021-03-31T18:47:38Z

dgrove-oss
Mar 31, 2021
Maintainer

If we don't have good control over placement of SafePoints than our current strategy of doing a series of inline stores of zero_initializer to each instance field of a freshly allocated object (in the hope of being able to eliminate the redundant stores) isn't going to work. We'd instead need to have an invariant that the GC's allocator routine returns pre-zeroed storage so that as soon as we store an object header into it, it would be safe for the GC to scan the object (without seeing junk from uninitialized memory).

0 replies

michihirohorie · 2021-04-01T08:35:21Z

michihirohorie
Apr 1, 2021
Collaborator

I see what you mean, thanks @aviansie-ben

I have a basic question. Which of LLVM side and the runtime has the responsibility to allocate space for objects? It sounds LLVM has the responsibility, but is it possible to ask the runtime the memory allocation? If we could let the runtime to allocate memory, we wouldn’t encounter the problem around the allocation and initialization.

Maybe I need to understand which of LLVM side and the runtime firstly allocates the heap space and then tell the other about the information of the allocated heap, how to associate the allocated heap space with addrspace(n), etc. I’d appreciate if you would share your thought.

0 replies

dmlloyd · 2021-04-01T16:00:02Z

dmlloyd
Apr 1, 2021
Maintainer Author

If we don't have good control over placement of SafePoints than our current strategy of doing a series of inline stores of zero_initializer to each instance field of a freshly allocated object (in the hope of being able to eliminate the redundant stores) isn't going to work. We'd instead need to have an invariant that the GC's allocator routine returns pre-zeroed storage so that as soon as we store an object header into it, it would be safe for the GC to scan the object (without seeing junk from uninitialized memory).

I cannot escape the conclusion that we must place safepoint polls and statepoint-ify our calls explicitly and should not use RewriteStatepointsForGC or PlaceSafepoints at all. We are the ones who can know what the live ranges of values are, and where we can and cannot place safepoint polls. We are (or plan to be) performing much of the significant optimization work ourselves, and also can (and should) still have the ability to organize things in such a way as to enable machine-specific LLVM optimizations to a reasonable extent. As far as I understand, much of the discussed work is being proposed to serve these LLVM passes on the assumption that using them is the only feasible path, rather than to directly serve the requirements of GC (and stack walking, a necessary function which also requires statepoint information at call sites).

I'd like to remove some of the obscurity around LLVM GC. I think we should be looking at the LLVM GC facilities as a series of atoms with concrete behavior which we can assemble in a way that suits us (and potentially modify or improve), rather than looking at it as a single black box with a more ambiguous and/or complex behavioral contract that we must serve in order to utilize GC at all. But in order to do so, we have to understand what each atom does (and why), and what (if any) minimal LLVM passes are required to make them function.

@gc.experimental.gc.statepoint.* - it's clear that this intrinsic publishes the locations of the values in the argument list to the stack map, and also makes the values available for relocation. What happens when we put non-pointer values into the statepoint argument list? Do they make it into the stack map?
@gc.experimental.gc.relocate.* - What are the exact semantics of this operation? Is it only signalling to the compiler that the on-stack or in-register value may change after the method call, or is there more happening? What happens if you try to relocate a non-pointer value? Can/should we feasibly modify LLVM upstream to allow integer value types to be relocated, or is this purely dependent on the collector's rules? Is there a better way to represent the concept of on-stack/in-register changeability in LLVM, in a wider context than just GC? Bear in mind that our initial "real" GC implementation (be it OMR or otherwise) need not (and probably would not) support relocation (and thus would not need this intrinsic); a simple mark & sweep is the lowest bar to clear in this regard.
@gc.statepoint_poll - Is this even necessary? Can we just do this ourselves?

We cannot process LLVM bitcode output meaningfully without a large amount of additional investment, so any approach that would require this is still a non-starter in my view. Processing statepoint stack map output after LLVM compilation completes is however realistically very feasible, so it makes more sense to proceed with this in mind, and this seems to support the view of treating each intrinsic as a usable atom with independent behavior.

0 replies

kazunoriogata · 2021-04-02T10:43:36Z

kazunoriogata
Apr 2, 2021
Collaborator

@dmlloyd I don't understand why you don't like the Statepoint approach so much.

In my understanding, the both approaches have pros. and cons. and we are investigating them in detail.

Statepoint
Pros.

We can use LLVM optimizer for classical optimizations

Cons.

There are some items we need to investigate more, such as how to control insertion of GC safe points and how to handle non-integral pointers
QCC IR appears to be modified to handle reference type correctly in LLVM IR

Explicit Representation
Pros.

We can control everything w.r.t. GC support code

Cons.

We can't use LLVM optimizer, so we need to implement all optimizations, regardless of if they are generic classical ones.
We need to identify which low-level optimization is applicable (if we still use "reasonable" low-level optimization, as you mentioned)

What I really don't understand is why you insist on the Explicit Representation without waiting for the investigation Ben and Michi are doing now. Or if you foresee any show-stoppers, please share it with us. I also with you to elaborate a bit more about how to deal with the cons. of the Explicit Representation approach.

0 replies

kazunoriogata · 2021-04-02T10:56:45Z

kazunoriogata
Apr 2, 2021
Collaborator

Bear in mind that our initial "real" GC implementation (be it OMR or otherwise) need not (and probably would not) support relocation (and thus would not need this intrinsic); a simple mark & sweep is the lowest bar to clear in this regard.

I agree the initial release won't need to support relocation, but how about the final version? Since GC support code is scattered across the code, we need a design that is "ready" for relocation.

I thinks one of the factors to decide the design is the target applications. I know Quarkus mainly focus on cloud-native microservices, and such applications are often short running or easily restartable. However, once this project is published as RedHat-IBM-lead project, GBS teams will use QCC to build their customers' applications, regardless if they are short-running or not. You might think using Quarkus for long-running application is a wrong use of this technology, but the customers won't care if they use the technology in a right way, but they only care if it is useful for their business.

So I think we should design GC support to be ready for object relocation.

0 replies

dmlloyd · 2021-04-02T12:23:54Z

dmlloyd
Apr 2, 2021
Maintainer Author

These are good questions.

From my perspective it's not about pros/cons, it's about feasibility. By explicitly wrapping every call with a statepoint (and manually correlating call site nodes with statepoint IDs), we will have everything we need to build the IP-to-method tables we need in order to support stack walking - something that is needed for exception support and other things in the JDK. If we don't do this, then AFAICT there is no practical solution for correlating our call sites nodes to call sites in the final executable, making stack walking a much more difficult problem to solve. Therefore I don't see how pushing statepoint placement to LLVM is possible for us as things stand at present, whereas explicit representation is definitely possible. If we want to try to use LLVM for this, then we need a concrete proposal for how we can force every method call into the stack map with a known ID that we can track back to method call nodes.

I'm not opposed to supporting relocation - quite the opposite in fact. But I do think we need a clear understanding of exactly how the relocation intrinsic functions, if we are to proceed with it.

0 replies

kazunoriogata · 2021-04-05T07:40:19Z

kazunoriogata
Apr 5, 2021
Collaborator

Hmm... I'm afraid, but I have to say I don't understand your points. Precisely, I know you think the Explicit Representation is a feasible way to support GC with LLVM. However, what you have mentioned so far about the Abstract Machine Model is just you don't know. I don't understand why you don't wait for Dan and Michi's investigation and compare two approaches.

Further, I think we should distinguish between feasibility and practicability. For example, if a work item is known to be feasible and it requires 20 person-year, then it is impractical for this team. We can't afford to work on such a big item at this moment (and maybe in the future, either).

If we take the Explicit Representation, we have to implement almost all optimization algorithms. How much extra workload will we need? When will we release Qbicc as an OSS project? If we find out the Abstract Machine approach is feasible, isn't it better to work on implementing static whole program analyses rather than re-implementing traditional optimizations? That's why Ben and Michi are investigating the feasibility. What is the reason we should stop this investigation? If you know something more detail, please share it.

0 replies

dmlloyd · 2021-04-05T12:57:51Z

dmlloyd
Apr 5, 2021
Maintainer Author

Hmm... I'm afraid, but I have to say I don't understand your points. Precisely, I know you think the Explicit Representation is a feasible way to support GC with LLVM. However, what you have mentioned so far about the Abstract Machine Model is just you don't know. I don't understand why you don't wait for Dan and Michi's investigation and compare two approaches.

I am open to comparing more than one approach as soon as a second approach can be defined. If the abstract machine model approach is feasible, then presumably it can be described at least in general terms now: how would we use the stack map information, and how would stack walking work? Would we be required to have the ability to process post-optimization LLVM bitcode?

I do believe that the questions I've asked need answers regardless of approach.

Further, I think we should distinguish between feasibility and practicability. For example, if a work item is known to be feasible and it requires 20 person-year, then it is impractical for this team. We can't afford to work on such a big item at this moment (and maybe in the future, either).

I agree, impractical approaches are effectively infeasible for us right now.

If we take the Explicit Representation, we have to implement almost all optimization algorithms. How much extra workload will we need? When will we release Qbicc as an OSS project? If we find out the Abstract Machine approach is feasible, isn't it better to work on implementing static whole program analyses rather than re-implementing traditional optimizations? That's why Ben and Michi are investigating the feasibility. What is the reason we should stop this investigation? If you know something more detail, please share it.

I'm not convinced that we will be required to implement almost all optimization algorithms if we explicitly place statepoints. But we also know that we already cannot rely solely on LLVM for all optimizations, since LLVM simply doesn't have awareness of Java semantics. The answer seems unchanged to me: there are certain optimizations we must implement, some we should implement, some we can easily implement, and some we do not need to implement. Nobody has yet proposed a specific difference in the composition of these sets based on statepoint placement strategy; I think this is part of what needs to be discussed.

To answer this question: "When will we release Qbicc as an OSS project?" - we're working on that now; the current plan is to open it to the public as soon as possible (there are certain things that have to be done first, including completing the project rename, but these things are not expected to take long). I would estimate that the time before we make the project public would be measured in weeks rather than months.

0 replies

aviansie-ben · 2021-04-06T15:10:19Z

aviansie-ben
Apr 6, 2021
Collaborator

@dmlloyd I don't think using the explicit relocation form of Statepoint would be a good idea.

Firstly, AFAIK the explicit relocation form is generally considered to be an internal implementation detail of LLVM (at least moreso than the abstract machine representation) and how it works is subject to change quite significantly between LLVM versions. In fact, the high-level documentation for Statepoint seems to be somewhat out-of-date in how it represents the explicit relocations, as these intrinsics have been changed in more recent LLVM versions to use a named gc-live operand bundle instead of a variable number of arguments [1]. Not only does this introduce the risk of LLVM deciding to change how these work in the future and requiring extra unnecessary work for us, it also means that to be able to support multiple versions of LLVM, we would need support for multiple different forms. The abstract machine model on the other hand does not seem to have been changed in any backwards-incompatible ways like this since its introduction AFAICT.

Additionally, this would require that we implement a pass that makes relocations explicit in qbicc as well (yes, we could theoretically avoid this for non-relocating GCs, but we'll eventually want the GC to be able to relocate objects anyway). This requires some non-trivial and potentially error-prone global analysis. Implementing this ourselves seems to me to be completely unnecessary considering that LLVM already supports this and having this done prior to running LLVM doesn't seem to have any significant benefits AFAICT.

In terms of how the abstract machine model could be made to work, here's an overview of how I'd personally imagine the compile process would work:

In qbicc's LLVM backend, emit LLVM IR conforming to Statepoint's abstract machine requirements with DWARF line number information, alongside a separate LLVM IR module that implements any necessary pseudo-intrinsics
Run LLVM's opt utility on the emitted LLVM IR to run pre-RS4GC optimizations and PlaceSafepoints
Use LLVM's llvm-link utility to link together the bitcode files containing the actual code and the pseudo-intrinsics
Run LLVM's opt utility again to run AlwaysInliner (to inline the pseudo-intrinsics), RS4GC, and some other post-RS4GC optimizations
Run llc to compile the LLVM IR into one or more object files
Pre-process the LLVM stack maps present in the resulting object file(s):
1. Process each stack map record to determine the stack offsets of all live references and (using the DWARF debug info in the object file) the line numbers and inline sites for each callsite
2. Emit new stack maps in a more suitable form (lots of unanswered questions about what representation would be best here)
3. Strip the LLVM stack maps (and, if desired, DWARF debug info) from the object file
Use the system linker to link everything together

None of this requires any manual processing of LLVM bitcode files. All the transformations required to make the bitcode files work can be done through existing LLVM utilities.

Alternatively, we could use the statepoint-id attribute on each call site to assign each callsite a unique ID and keep track of line numbering and inlining information separately instead of using DWARF debug info. This should work so long as the LLVM inliner is never run prior to RS4GC. If LLVM's inliner were run, it would not record its inlining information anywhere except the DWARF debug info so we'd be unable to reconstruct the full chain of callsites.

As for placing safepoints ourselves prior to emitting the LLVM IR, that is certainly a possibility and the LLVM documentation does advise doing that in certain cases (namely if safepoints have behaviour other than simply performing GC operations). Really, the only downside of that is potentially losing out on some LLVM loop optimizations due to inserting the safepoints early in loops. These optimizations can be some of the most powerful LLVM has to offer (e.g. auto-vectorization of loops) and I don't think it'd be practical to implement many of these ourselves as they can be very finicky to implement correctly and would require a good cost model to avoid degrading performance. If we decide to insert the safepoints early, we'd potentially be looking at giving up those optimizations entirely for the foreseeable future. Maybe this is a tradeoff we're willing to make though?

[1] https://llvm.org/docs/LangRef.html#gc-live-operand-bundles

2 replies

dmlloyd Apr 6, 2021
Maintainer Author

In terms of how the abstract machine model could be made to work, here's an overview of how I'd personally imagine the compile process would work:

In qbicc's LLVM backend, emit LLVM IR conforming to Statepoint's abstract machine requirements with DWARF line number information, alongside a separate LLVM IR module that implements any necessary pseudo-intrinsics

Run LLVM's opt utility on the emitted LLVM IR to run pre-RS4GC optimizations and PlaceSafepoints

Use LLVM's llvm-link utility to link together the bitcode files containing the actual code and the pseudo-intrinsics

Run LLVM's opt utility again to run AlwaysInliner (to inline the pseudo-intrinsics), RS4GC, and some other post-RS4GC optimizations

Run llc to compile the LLVM IR into one or more object files

Pre-process the LLVM stack maps present in the resulting object file(s):

Process each stack map record to determine the stack offsets of all live references and (using the DWARF debug info in the object file) the line numbers and inline sites for each callsite

Emit new stack maps in a more suitable form (lots of unanswered questions about what representation would be best here)

Strip the LLVM stack maps (and, if desired, DWARF debug info) from the object file

Use the system linker to link everything together

None of this requires any manual processing of LLVM bitcode files. All the transformations required to make the bitcode files work can be done through existing LLVM utilities.

This seems workable if we can definitively map each call site; that seems the be the crux of the problem. Did you have thoughts as to how this would work? I know that DWARF has tags for inlined subroutines, but I don't know how complete we can expect that to be or whether we can reliably associate that with a call site.

Alternatively, we could use the statepoint-id attribute on each call site to assign each callsite a unique ID and keep track of line numbering and inlining information separately instead of using DWARF debug info. This should work so long as the LLVM inliner is never run prior to RS4GC. If LLVM's inliner were run, it would not record its inlining information anywhere except the DWARF debug info so we'd be unable to reconstruct the full chain of callsites.

That seems like it could work. You said that we need to run AlwaysInliner; I assume this metadata would be preserved through this stage because we would not have anything marked "always inline". Would you say that is a correct assumption?

If we can answer this definitively, it seems to me that (after the project rename) there are no blockers that would prevent us from immediately implementing steps 2, 3, 4 (perhaps initially excluding RS4GC), 5, and 7. This would allow us to follow up with experimentally implementing the remaining steps, maybe even allowing us to try out both approaches.

aviansie-ben Apr 8, 2021
Collaborator

This seems workable if we can definitively map each call site; that seems the be the crux of the problem. Did you have thoughts as to how this would work? I know that DWARF has tags for inlined subroutines, but I don't know how complete we can expect that to be or whether we can reliably associate that with a call site.

DWARF does keep around information that would allow for reconstructing the information necessary to generate a stack trace (i.e. method + line number information for inlined sites) via its DW_TAG_inlined_subroutine entries, which include ranges of addresses for which instructions are from which methods. While DWARF debug information is generally best effort, it tends to be much more reliable on call instructions since most optimizations don't do anything funny with them. That being said, there are a couple tail call optimizations that can make the stack trace impossible to perfectly reconstruct simply due to the nature of those optimizations. This problem isn't specific to using DWARF debug information, though, and could still be encountered if using another method of reconstructing the stack trace. I've not personally encountered any instances of DWARF line number and inlining information being incorrect across call instructions aside from this, though.

That seems like it could work. You said that we need to run AlwaysInliner; I assume this metadata would be preserved through this stage because we would not have anything marked "always inline". Would you say that is a correct assumption?

You're pretty much correct here. Only the pseudo-intrinsics would be marked as "always inline", so they'd be the only things that ever get inlined via LLVM in this hypothetical scenario. Most pseudo-intrinsics wouldn't contain any calls themselves, so there's no issue where we could end up with Java callsites getting inlined at this stage. For any pseudo-intrinsic that does contain a call, it does mean that we'd need to use separate LLVM functions for each individual call to the pseudo-intrinsic in order to assign unique IDs to the callsites within them. Doing that might have some impacts on compile times due to having to create more LLVM functions, but would otherwise be a workable solution.

KavinduZoysa · 2021-05-21T05:12:06Z

KavinduZoysa
May 21, 2021

Hi All,

I am also interested in this topic and this is the discussion I am maintaining to show the information/codes I gathered. ballerina-platform/nballerina#298.

Currently, I am trying to programmatically get the actual locations of the heap references. For that, we have done

generating the simple stack map and read it
manually verifying the heap references are there in the locations (regNum + offset), as shown in LLVM Stack Map.

I am having a problem of getting these locations programmatically, because those offsets are given according to function's stack frame. If I explain it more, please consider the below code.

fn foo() {
     heapRef2 = getHeapRef()
      // --> read stack map and get figure out heap references
}
fn main() {
    heapRef1 = getHeapRef()
    foo();
}

If we try to get all heap references from foo(), we should be able to get both heapRef1 and heapRef2. We can get the stack location of heapRef2 because its' offset is given with respect to foo()'s stack frame. But getting the locations of heapRef1 becomes a problem because its' offset is given with respect to main() frame's rsp. Even if we can walk through our stack, how can we determine main()'s rsp.

Really appreciate your input on this.

0 replies

aviansie-ben · 2021-06-02T17:51:27Z

aviansie-ben
Jun 2, 2021
Collaborator

Just a couple loose ends here I thought I might want to call out as I'm wrapping my work here up:

The current implementation of the Statepoint pseudo-intrinsics places them in the same LLVM module as the main code. This is fine as long as inlining and function attribute inference are never run, but would break if they were, e.g. by running any default optimization level using opt before RS4GC. In the long term, these would need to be placed in a separate module and linked in as I previously outlined.
The Statepoint pseudo-intrinsics do not yet declare any function/parameter attributes. This is conservatively correct, but would cause the LLVM optimizer to be very conservative about what the pseudo-intrinsics may do if used and assume they could contain any arbitrary code. This can be tightened up, though special thought would be needed to determine what function attributes are actually safe, as insertion of GC points might somewhat alter semantics, e.g. casting a reference to a pointer cannot safely be CSE'd pre-RS4GC since a not-yet-visible GC point may relocate the object.
There is one outstanding issue regarding collected references stored in stack-allocated memory that is not itself collected. Apparently, RS4GC does not yet support adding such references to the GC maps automatically [1]. To work around this, we'll need to do one of the following: track such pointers using some manually-updated data structure (slight perf impact, but use of alloca in this manner should be rare outside debug builds); fix RS4GC (and possibly other parts of LLVM) to handle this; or find some sort of hack to ensure that the address of the alloca itself is included in the GC map and somehow tell the stack walker about the extra indirection.

[1] https://llvm.org/docs/Statepoints.html#recording-on-stack-regions

1 reply

dmlloyd Jun 3, 2021
Maintainer Author

Just a couple loose ends here I thought I might want to call out as I'm wrapping my work here up:

The current implementation of the Statepoint pseudo-intrinsics places them in the same LLVM module as the main code. This is fine as long as inlining and function attribute inference are never run, but would break if they were, e.g. by running any default optimization level using opt before RS4GC. In the long term, these would need to be placed in a separate module and linked in as I previously outlined.

Just want to add a note I have a local simplification that causes the allocation functions to handle Object references instead of pointers, which eliminates the need for the pseudo-intrinsics from the vast majority of the program code. I want to do a little more with this idea before pushing it up but that might help avoid some of this unpleasantness.

Research task: Heap & Garbage Collection design proposal #444

dmlloyd Oct 1, 2020 Maintainer

Replies: 34 comments · 3 replies

aviansie-ben Oct 13, 2020 Collaborator

dmlloyd Oct 13, 2020 Maintainer Author

aviansie-ben Oct 13, 2020 Collaborator

dmlloyd Oct 13, 2020 Maintainer Author

aviansie-ben Oct 13, 2020 Collaborator

dmlloyd Oct 13, 2020 Maintainer Author

aviansie-ben Oct 13, 2020 Collaborator

DanHeidinga Oct 15, 2020 Collaborator

aviansie-ben Oct 15, 2020 Collaborator

DanHeidinga Oct 15, 2020 Collaborator

dmlloyd Nov 16, 2020 Maintainer Author

michihirohorie Mar 16, 2021 Collaborator

dmlloyd Mar 16, 2021 Maintainer Author

michihirohorie Mar 16, 2021 Collaborator

aviansie-ben Mar 22, 2021 Collaborator

aviansie-ben Mar 29, 2021 Collaborator

dgrove-oss Mar 29, 2021 Maintainer

michihirohorie Mar 30, 2021 Collaborator

aviansie-ben Mar 31, 2021 Collaborator

dgrove-oss Mar 31, 2021 Maintainer

michihirohorie Apr 1, 2021 Collaborator

dmlloyd Apr 1, 2021 Maintainer Author

kazunoriogata Apr 2, 2021 Collaborator

kazunoriogata Apr 2, 2021 Collaborator

dmlloyd Apr 2, 2021 Maintainer Author

kazunoriogata Apr 5, 2021 Collaborator

dmlloyd Apr 5, 2021 Maintainer Author

aviansie-ben Apr 6, 2021 Collaborator

dmlloyd Apr 6, 2021 Maintainer Author

aviansie-ben Apr 8, 2021 Collaborator

KavinduZoysa May 21, 2021

aviansie-ben Jun 2, 2021 Collaborator

dmlloyd Jun 3, 2021 Maintainer Author

dmlloyd
Oct 1, 2020
Maintainer

Replies: 34 comments 3 replies

aviansie-ben
Oct 13, 2020
Collaborator

dmlloyd
Oct 13, 2020
Maintainer Author

aviansie-ben
Oct 13, 2020
Collaborator

dmlloyd
Oct 13, 2020
Maintainer Author

aviansie-ben
Oct 13, 2020
Collaborator

dmlloyd
Oct 13, 2020
Maintainer Author

aviansie-ben
Oct 13, 2020
Collaborator

DanHeidinga
Oct 15, 2020
Collaborator

aviansie-ben
Oct 15, 2020
Collaborator

DanHeidinga
Oct 15, 2020
Collaborator

dmlloyd
Nov 16, 2020
Maintainer Author

michihirohorie
Mar 16, 2021
Collaborator

dmlloyd
Mar 16, 2021
Maintainer Author

michihirohorie
Mar 16, 2021
Collaborator

aviansie-ben
Mar 22, 2021
Collaborator

aviansie-ben
Mar 29, 2021
Collaborator

dgrove-oss
Mar 29, 2021
Maintainer

michihirohorie
Mar 30, 2021
Collaborator

aviansie-ben
Mar 31, 2021
Collaborator

dgrove-oss
Mar 31, 2021
Maintainer

michihirohorie
Apr 1, 2021
Collaborator

dmlloyd
Apr 1, 2021
Maintainer Author

kazunoriogata
Apr 2, 2021
Collaborator

kazunoriogata
Apr 2, 2021
Collaborator

dmlloyd
Apr 2, 2021
Maintainer Author

kazunoriogata
Apr 5, 2021
Collaborator

dmlloyd
Apr 5, 2021
Maintainer Author

aviansie-ben
Apr 6, 2021
Collaborator

dmlloyd Apr 6, 2021
Maintainer Author

aviansie-ben Apr 8, 2021
Collaborator

KavinduZoysa
May 21, 2021

aviansie-ben
Jun 2, 2021
Collaborator

dmlloyd Jun 3, 2021
Maintainer Author