Sorting out the object header, pinning, inflation, locks, and the GC contract #664

dmlloyd · 2021-09-30T17:36:13Z

dmlloyd
Sep 30, 2021
Maintainer

These topics are intertwined. Here are the facts as I see them:

Some objects must be pinned their entire life, like the initial Class objects that are deserialized into an array that is indexed by type ID and thus they cannot change or be moved, and are GC roots.
It is useful (and presently necessary) for some objects (j.l.Thread is the immediate example) to be directly allocated in an immovable generation.
It is inevitable that a generational GC will eventually recognize that scooting old objects around stops being profitable, and will eventually leave them in one place, and thus it seems reasonable to attempt to formalize this concept in the GC contract in two ways: first, by allowing objects to be permanently pinned later in life on an ad-hoc basis, and by allowing allocation requests to specify that an object be initially (and permanently) pinned.
Inflation techniques that involve expanding the size of the object necessarily require that the object move in memory, because the old location might not have enough space for the new data. Thus, objects which are expected to be permanently pinned but also may potentially be inflated in this way must be pre-inflated unless it can be statically proven that it will never be necessary. Alternatively, this kind of technique could be avoided in favor of other inflation solutions (including using external tables either for the additional information or for header information that is displaced by the additional information), however some solutions (like external tables) pose other GC complications.
Locks: Naively using POSIX mutexes is useful to an extent (they perform well and have a clear contract), however requires significant space (on x86_64 Linux, a pthread_mutex_t/pthread_cond_t pair requires 88 bytes of contiguous space). This implies that using POSIX mutexes "in line", while giving potentially good locality with fewer indirections, will consume significant memory unless inflated lazily. Additionally, neither POSIX mutexes nor conditions may safely be moved in memory (by specification). This means that the object would have to be pinned after inflation (and will require initially pinned allocations to be pre-inflated), and further prevents techniques such as creating the mutex and condition separately on demand.
Locks: Naively using a Java type such as ReentrantLock for the monitor and associated condition means that, using in-line object storage for the monitor, the allocated object itself would only need to accommodate one or two reference values (for just the lock or the lock+condition, respectively) which would be a maximum of between 8 and 16 bytes total depending on the platform and whether reference compression is in effect. Using inflation is simpler for this kind of solution than (for example) in-line POSIX mutexes, since the object can be safely moved at any time even if the lock has been inflated, and is "GC friendly" in that the inflated references can be treated as normal reachable objects. Out of line techniques (e.g. a map) don't have even these minimal storage requirements but generally require more indirection and may pose GC complications.
Locks: Introducing a single Java Monitor (final) class which implements both lock and condition behavior, along with in-line inflation, would have the benefits of the above ReentrantLock solution while only requiring one reference value. Using ReentrantLock to implement the functionality within that class would be a pragmatic short-term solution though would not necessarily be optimal in the long term. Another potentially superior solution which is easy enough to be done in the near term but may also be viable in the long term (depending on how OS-portable it is) would be to inline POSIX mutex and condition structures into the Monitor object as value fields (which btw should work today). This solution would require instances the Monitor object to always be pinned upon allocation as mentioned above. Acquisition of the Monitor for a given object instance could be accomplished using a getMonitor VM helper method which would read the hidden field, taking into account object inflation possibly at a later date. This would allow the monitor bytecodes to be implemented as a pair of method calls (get monitor, call lock/unlock) and the Object wait/notify methods to be implemented in pure Java (get monitor, call await/signal).
Identity hash code, at the risk of injury due to sudden change of direction: We've previously discussed how the identity hash code works in the context of moving objects, and the various states that arise from that: object with no IHC, object whose IHC has been queried but can be directly derived somehow from the object's state (e.g. location), and object whose IHC is explicitly specified on the object. We've discussed the idea that the IHC need not even be a full 32 bits in order to potentially perform well in a hash table. OTOH, OpenJDK's Project Lilliput has floated the idea that a longer IHC might fpr some reason be desirable, perhaps 64 or even 128 bits. In any event, some (maybe most) objects won't need an IHC (of any length) ever, and others (records for example) will always have IHC that are derivable from the object's state regardless of where it is located or any other factor. Therefore IHC is another candidate for inflation, in the event that we do not end up with bits to spare in the header due to things like GC and lock inflation eating them up. In particular we wouldn't want to force a large object header because of IHC; for example if we can get away with 16 bits of header bits and a 16 bit type ID, then stateless objects could have a minimum size of only 4 bytes on platforms which allow this minimal alignment, and never grow any larger in their lifetime.

If we use in-line inflation for both locks and IHC (which represents both a "best case" in that object size is potentially minimized as well as a "worst case" in that the GC needs awareness and we have to do some work to locate and/or create the extra fields), a potential approach presents itself.

Whereas naively placing extra data at the end of the object would solve the problem, the size of the object would have to be used as a part of the calculation as to where to find the extra data. It would be possible to avoid this issue by placing the inflated fields before the object in memory at negative offsets from the object's base pointer. If we only had one kind of inflated information, the calculation would be fairly simple: if the header bits indicate the data is not present, inflate the object; then, add a constant negative offset to the base pointer to get the correct data pointer.

With multiple data items that can be independently inflated, the offset computation becomes slightly more complex. Given a total order over the potential inflated data, the offset for a given item would depend (solely) on what other earlier items are present. The data should be sorted by size/alignment requirement in such a way as to minimize space wastage.

Here are some visual representations I've cooked up just now. The images are not necessarily proportional. Word sizes are variable among platforms, and type ID size might also vary, which would alter the layout in various ways, so I've deliberately left off any kind of numerical offsets. But I think for purposes of illustration it should be pretty clear what is going on.

Uninflated object

Inflated object, just monitor

Inflated object, just IHC

Fully inflated object

dmlloyd · 2021-09-30T18:18:57Z

dmlloyd
Sep 30, 2021
Maintainer Author

I forgot to cover temporary pinning, where an object needs to be held at one address but only for the duration of some short operation. I believe that GraalVM fulfills this need by preventing safepoints but I could be wrong about that.

Also worth noting that ad-hoc permanent object pinning likely requires some kind of safepoint coordination.

0 replies

DanHeidinga · 2021-09-30T18:46:18Z

DanHeidinga
Sep 30, 2021
Collaborator

Thanks for doing the initial write up on this. I'm going to quibble a couple of the initial points to ensure we all have the same view of the foundation we're building on.

I'll note that a lot of GC development has moved away from the idea of permanent generations as better algorithms have been developed that benefit from letting the GC have full(er) control and that there's a lot of research (and practice!) in this area we can learn from.

Some objects must be pinned their entire life, like the initial Class objects that are deserialized into an array that is indexed by type ID and thus they cannot change or be moved, and are GC roots.

Class objects - being immortal as we don't allow class unload - are probably the exception and thus are good candidates for a perm gen. That's not the only way we can do this though: our type ID indexed array could be an array of pointers to classes and treated as a gc root without requiring the classes to never move.

I'll also mention that perm gen / initial heap doesn't equal lower footprint for multiple deployments on the same machine. Any sharing will be quickly broken by the first write back to that heap segment. We should discount any footprint rationale on that basis.

It is useful (and presently necessary) for some objects (j.l.Thread is the immediate example) to be directly allocated in an immovable generation.

Allocating things in an immovable generation leads to fragmentation when the object dies and we can't compact the space. We should try to avoid relying on immovable gen as much as possible to avoid the dreaded OOM: PermGen problems.

The design @theresa-m and I had been talking about for this was to borrow from OpenJ9 and use a circular linked list of QThread structs that act as a GC root and provided the needed indirection to allow thread objects to be safely (indirectly) passed to native c functions. This kind of list of threads is needed anyway for cooperative safepoints and other async polling.

It is inevitable that a generational GC will eventually recognize that scooting old objects around stops being profitable, and will eventually leave them in one place, and thus it seems reasonable to attempt to formalize this concept in the GC contract in two ways: first, by allowing objects to be permanently pinned later in life on an ad-hoc basis, and by allowing allocation requests to specify that an object be initially (and permanently) pinned.

This is typically old gen or tenure space in a generational GC. These objects don't move often unless there is sufficient memory pressure that an old-gen collection is required, in which case being able to compact the space is critical to make effective use of the reclaimed memory.

Pinning complicates the GC's life. It's best to avoid it and use other tools - like Handles which provide a level of indirection - instead. Ripping pinning back out later is hard but usually necessary to get better GC efficiency later.

Do we want to base the design on extensive use of pinning which will limit our ability to use better GC tech in the future? I'd rather avoid introducing it if we can (and experience shows, we definitely can)

9 replies

dmlloyd Oct 1, 2021
Maintainer Author

@DanHeidinga wrote:

We need the other complexity any way to support JNI.

Just want to throw this bomb out now before we get too far into the weeds - I don't intend to support JNI in qbicc, maybe ever. And I would push for JNI to be optional for static images in any definition of static images that get into the JVMS, the JLS, or the JDK as part of Leyden or anything else. By the time static images have gained traction, I think the superior Panama mechanisms will be available.

@DanHeidinga wrote:

Adding non-immortal objects to our permgen increases complexity of the overall system rather than avoiding it.

Permgen should only hold objects that are truly immortal and created at build time - so that includes Class objects, interned Strings (and their char[]) referred to by ConstantPool entries (ie: ldc <String>), maybe Objects held by static final fields (but not the other objects these roots point to unless they are also static final), ClassLoaders, maybe others?

Why include the interned strings and constants and things then? Why not propose pushing everything other than the classes to the tenured space?

I believe that at present, our run time keeps all of our reachable global references (which include static fields and constant values which are references to objects) in a single sequential section for ease of GC (or at least, that was what was discussed), but I think these objects could still be movable, even the constants. Maybe it's beneficial to keep the immortal objects in a compact perm gen off to the side since they can always be optimally compact? But then we'd have to change the serialization algorithm a little bit to put the provably immortal objects into a different section (probably the same one as the class array really), so that the non-immortal objects can still be loaded directly into memory that would (hopefully) become our tenured space. Of course all of those immortal objects would need to be pre-inflated...

@DanHeidinga wrote:

We would want to avoid the load though. If we can translate from a type ID to a class reference without reading memory, that seems like a win to me.

I'd want to look at the instruction sequences to see if that load matters or not. Most uses of the .class aren't going to be affected by an indirection to get the Class object and shouldn't cause pipeline stalls in the normal case. The ones we want to watch are the accesses to the header, and vtable / itable dispatch sequences, and instanceof/checkcast sequences. There we want to minimize the number of dependent loads.

Agreed. But it is very common AFAICT to use .class in instanceof/checkcast situations and as a type token, and thus it would seem to me that both cases would benefit from losing the extra load (I mean, in the end, not having the load is always better than having it, even if only by a little). In any event, it seems a feasible change to make the class array to hold the actual object rather than references, and then change the Load() to ReferenceOf(), which would eliminate the load for every non-refarray case.

DanHeidinga Oct 1, 2021
Collaborator

Just want to throw this bomb out now before we get too far into the weeds - I don't intend to support JNI in qbicc, maybe ever. And I would push for JNI to be optional for static images in any definition of static images that get into the JVMS, the JLS, or the JDK as part of Leyden or anything else. By the time static images have gained traction, I think the superior Panama mechanisms will be available.

Panama vs. JNI is orthogonal to the point. Both mechanisms need a level of indirection (jni locals / handles / etc) to be able to support passing object pointers through C apis.

DanHeidinga Oct 1, 2021
Collaborator

Why include the interned strings and constants and things then? Why not propose pushing everything other than the classes to the tenured space?

If our Classes are immortal, then anything tied to their lifetime is also immortal. Interned Strings created from string literals at build time fit that profile. These are objects that can never die by fiat. It'd be fine if we didn't treat these extra objects as immortal and just let the GC rediscover their lifecycle. If we're going to have a permgen, then we may be able to find other immortal objects and benefit from that.

I believe that at present, our run time keeps all of our reachable global references (which include static fields and constant values which are references to objects) in a single sequential section for ease of GC (or at least, that was what was discussed), but I think these objects could still be movable, even the constants. Maybe it's beneficial to keep the immortal objects in a compact perm gen off to the side since they can always be optimally compact? But then we'd have to change the serialization algorithm a little bit to put the provably immortal objects into a different section (probably the same one as the class array really), so that the non-immortal objects can still be loaded directly into memory that would (hopefully) become our tenured space. Of course all of those immortal objects would need to be pre-inflated...

If we're counting on some objects never moving, then we'd be better to group them together and have an explicit space (ie: permgen) to codify that contract.

Tenure space objects can move due to compaction. So more care needs to be taken with pointers into tenure.

dgrove-oss Oct 1, 2021
Maintainer

We're writing the initial heap into a data segment of the executable. I think even if things in this segment become unreachable (garbage), we are very unlikely to bother trying to reuse the space for something else. It's not going to be organized in the same way as any of the dynamic memory that the GC is responsible for allocating/managing.

DanHeidinga Oct 1, 2021
Collaborator

We're not going to compact or move those objects but that's only for the ones created at build time. Which would mean we shouldn't rely on Thread objects for example being treated as permgen.

DanHeidinga · 2021-09-30T18:56:05Z

DanHeidinga
Sep 30, 2021
Collaborator

It's also worth re-reading the ObjectModel doc that @dgrove-oss put together as it lays out some of the design space and tradeoffs for locks and hashcodes: https://github.com/qbicc/qbicc/blob/main/docs/ObjectModel.adoc

1 reply

dmlloyd Oct 1, 2021
Maintainer Author

That's what got me thinking about this. We've made some progress and now some of the information in that doc is a bit outdated. Also, the fact that so many disparate yet interdependent things may require header bit coordination has made it a bit hard for me to imagine a good solution for a header bit allocation API. I've been thinking that it would be more practical to have a single point where everything is implemented, but that means we need to know (concretely) what we're doing in terms of locks and IHC as well as what we will need to reserve for GC. We're getting close to the point where we need these things working and so I'd love to see us come up with a plan that we can execute now as opposed to thinking what we might be able to do in the future.

dgrove-oss · 2021-10-01T14:03:14Z

dgrove-oss
Oct 1, 2021
Maintainer

One comment about inflation. As noted in the main description inflation usually involves moving the object (to get the space one needs for the inflated object). That implies it needs to be done during a GC cycle, because you have to find and forward all of the pointers to the object to refer to new location. So, inflation is a natural fit for dealing with identity hash codes (use address of object until the object is moved by GC, then inflate when copying to keep the original hash code value in the inflated form). It's a harder fit for locking since the need for a lock isn't tied to the GC cycle so directly.

That's why for locking I would tend to go for some variant of the lock nursery design (maybe not a great name, but its what we called it 20 years ago). This avoids inline inflation for locking, by making fairly accurate guesses (based on presence of synchronized methods) whether or not a type is likely to be locked frequently.

6 replies

dgrove-oss Oct 1, 2021
Maintainer

Yeah, this is the key design decision. Once you decide to get the lock offset from the Classs object, then you have a lot of flexibility in deciding which types get locks and which don't (and it is fairly easy to change).

dmlloyd Oct 1, 2021
Maintainer Author

@dgrove-oss wrote:

<...> It's a harder fit for locking since the need for a lock isn't tied to the GC cycle so directly.

That's why for locking I would tend to go for some variant of the lock nursery design (maybe not a great name, but its what we called it 20 years ago). This avoids inline inflation for locking, by making fairly accurate guesses (based on presence of synchronized methods) whether or not a type is likely to be locked frequently.

I've added Space- and Time-Efficient Implementation of the Java Object Model to the big reading list.

@DanHeidinga wrote:

For lock nursery, I went through the J9 code recently for discussions around Project Lilliput and J9 basically puts a lockword in almost all objects by default these days. It uses the space that would otherwise be wasted due to alignment or minimum object size to hold the lock word. Arrays are a special case that always uses lock nursery though. To find the lockword, there's an offset stored in the j9class structure so it doesn't need to be at a fixed offset (ie: not in the header).

In this case how big is the lockword? Is it a thin lock that in turn can be inflated, or an index into a table, or...?

DanHeidinga Oct 1, 2021
Collaborator

In this case how big is the lockword? Is it a thin lock that in turn can be inflated, or an index into a table, or...?

In J9, it's one "slot" in the object. The "slot" size depends on your whether the vm is 32/64 or compressed refs. J9 uses a thin-lock and inflates it on contention to hold a pointer to the j9objectmonitor_t structure. There's a write up in [1] that's quite detailed but focused on lock reservation. The initial sections show the lockword bits and states.

[1] https://github.com/eclipse-openj9/openj9/blob/master/doc/compiler/ObjectLockword.md

dmlloyd Oct 1, 2021
Maintainer Author

In this case how big is the lockword? Is it a thin lock that in turn can be inflated, or an index into a table, or...?

In J9, it's one "slot" in the object. The "slot" size depends on your whether the vm is 32/64 or compressed refs. J9 uses a thin-lock and inflates it on contention to hold a pointer to the j9objectmonitor_t structure. There's a write up in [1] that's quite detailed but focused on lock reservation. The initial sections show the lockword bits and states.

[1] https://github.com/eclipse-openj9/openj9/blob/master/doc/compiler/ObjectLockword.md

As long as I have access to your brain 🙂 do you have any information on how often a lock stays thin vs having to be inflated in a typical application?

DanHeidinga Oct 1, 2021
Collaborator

As long as I have access to your brain 🙂 do you have any information on how often a lock stays thin vs having to be inflated in a typical application?

It's very application and access-pattern dependent. I don't have data for this but my gut says that most locks stay thin as they are only lightly contended between a small number of threads.

Inflation doesn't have to be permanent though. It's possible for the locks to transition (deflate) back to the thin-lock state. One way to do this is to have the GC reset them if the lock isn't owned when the object's being processed. Then pool the locks to save allocating new ones in the common case....

dmlloyd · 2021-10-04T14:51:39Z

dmlloyd
Oct 4, 2021
Maintainer Author

A note on POSIX thread mutexes. The pthread_mutex_t type does not have a way to determine whether the current thread owns the mutex even if the current thread is guaranteed to be a POSIX thread, nor is there a mechanism which is equivalent to or usable for interruption. This means that the actual locking mechanism will need additional fields in order to fulfill the complete contract of object monitors and will have to use the condition to implement the corresponding behaviors, making POSIX mutexes even less attractive as a potential solution for locking in terms of space requirements, in addition to being immovable in memory.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sorting out the object header, pinning, inflation, locks, and the GC contract #664

{{title}}

Replies: 6 comments 16 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Sorting out the object header, pinning, inflation, locks, and the GC contract #664

dmlloyd Sep 30, 2021 Maintainer

Replies: 6 comments · 16 replies

dmlloyd Sep 30, 2021 Maintainer Author

DanHeidinga Sep 30, 2021 Collaborator

dmlloyd Oct 1, 2021 Maintainer Author

DanHeidinga Oct 1, 2021 Collaborator

DanHeidinga Oct 1, 2021 Collaborator

dgrove-oss Oct 1, 2021 Maintainer

DanHeidinga Oct 1, 2021 Collaborator

DanHeidinga Sep 30, 2021 Collaborator

dmlloyd Oct 1, 2021 Maintainer Author

dgrove-oss Oct 1, 2021 Maintainer

dgrove-oss Oct 1, 2021 Maintainer

dmlloyd Oct 1, 2021 Maintainer Author

DanHeidinga Oct 1, 2021 Collaborator

dmlloyd Oct 1, 2021 Maintainer Author

DanHeidinga Oct 1, 2021 Collaborator

dmlloyd Oct 4, 2021 Maintainer Author

dmlloyd
Sep 30, 2021
Maintainer

Replies: 6 comments 16 replies

dmlloyd
Sep 30, 2021
Maintainer Author

DanHeidinga
Sep 30, 2021
Collaborator

dmlloyd Oct 1, 2021
Maintainer Author

DanHeidinga Oct 1, 2021
Collaborator

DanHeidinga Oct 1, 2021
Collaborator

dgrove-oss Oct 1, 2021
Maintainer

DanHeidinga Oct 1, 2021
Collaborator

DanHeidinga
Sep 30, 2021
Collaborator

dmlloyd Oct 1, 2021
Maintainer Author

dgrove-oss
Oct 1, 2021
Maintainer

dgrove-oss Oct 1, 2021
Maintainer

dmlloyd Oct 1, 2021
Maintainer Author

DanHeidinga Oct 1, 2021
Collaborator

dmlloyd Oct 1, 2021
Maintainer Author

DanHeidinga Oct 1, 2021
Collaborator

dmlloyd
Oct 4, 2021
Maintainer Author