Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patched eclipselink 2.7.6 Having blowups in ReadLockManager.removeReadLock(ReadLockManager.java:97) The implementation of CacheKey.equals(CacheKey.java:331 and CacheKey.HashCode are not safe #2114

Open
99sono opened this issue Apr 19, 2024 · 9 comments

Comments

@99sono
Copy link

99sono commented Apr 19, 2024

We’ve encountered a rare but critical bug in a project using Oracle-patched EclipseLink 2.7.6, which was modified to align with newer versions like 2.7.9. This bug is associated with the CacheKey class in EclipseLink, which lacks robust equals and hashCode methods. The issue arises when a CacheKey.key unexpectedly becomes null, leading to unforeseen complications.

Historically, we’ve addressed this bug in our internally maintained versions of EclipseLink, specifically versions 2.4.2, 2.6.4, 2.6.5, and 2.6.7. These versions, released with various WebLogic versions, have been patched to include concurrency manager improvements.

Currently, there’s a bugfix for these older versions of EclipseLink that isn’t present in any of the 2.7.X tags. This fix should likely be integrated into the official branches to prevent future occurrences.

Although this bug is infrequent, its potential impact necessitates proactive measures. Here’s a brief overview of the bug we’ve recently re-encountered.

The stack-trace is sliced out, on purpose, to only cover eclipselink code and obscure any private information.

The application became unresponsive with all manner of exceptions in eclipselink, but at the point of origin we could see exceptions with a stack trace like this:

Caused by: java.lang.NullPointerException
	at org.eclipse.persistence.internal.identitymaps.CacheKey.equals(CacheKey.java:331) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.identitymaps.CacheKey.equals(CacheKey.java:320) ~[eclipselink-2.7.6-with-patches.jar]
	at java.util.Vector.indexOf(Vector.java:431) ~[?:?]
	at java.util.Vector.indexOf(Vector.java:405) ~[?:?]
	at java.util.Vector.removeElement(Vector.java:668) ~[?:?]
	at java.util.Vector.remove(Vector.java:843) ~[?:?]
	at org.eclipse.persistence.internal.helper.ReadLockManager.removeReadLock(ReadLockManager.java:97) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.helper.ConcurrencyManager.removeReadLockFromReadLockManager(ConcurrencyManager.java:958) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.helper.ConcurrencyManager.releaseReadLock(ConcurrencyManager.java:691) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.identitymaps.CacheKey.releaseReadLock(CacheKey.java:482) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.cloneAndRegisterObject(UnitOfWorkImpl.java:1103) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.cloneAndRegisterObject(UnitOfWorkImpl.java:1014) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.sessions.UnitOfWorkIdentityMapAccessor.getAndCloneCacheKeyFromParent(UnitOfWorkIdentityMapAccessor.java:211) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.sessions.UnitOfWorkIdentityMapAccessor.getFromIdentityMap(UnitOfWorkIdentityMapAccessor.java:139) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.registerExistingObject(UnitOfWorkImpl.java:4093) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.registerExistingObject(UnitOfWorkImpl.java:4016) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.mappings.CollectionMapping.buildElementUnitOfWorkClone(CollectionMapping.java:313) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.mappings.CollectionMapping.buildElementClone(CollectionMapping.java:326) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.queries.MapContainerPolicy.buildCloneForValue(MapContainerPolicy.java:237) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.queries.MapContainerPolicy.addNextValueFromIteratorInto(MapContainerPolicy.java:171) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.mappings.CollectionMapping.buildCloneForPartObject(CollectionMapping.java:228) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.indirection.UnitOfWorkQueryValueHolder.buildCloneFor(UnitOfWorkQueryValueHolder.java:62) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.indirection.UnitOfWorkValueHolder.instantiateImpl(UnitOfWorkValueHolder.java:177) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.indirection.UnitOfWorkValueHolder.instantiate(UnitOfWorkValueHolder.java:238) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.internal.indirection.DatabaseValueHolder.getValue(DatabaseValueHolder.java:97) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.indirection.IndirectMap.buildDelegate(IndirectMap.java:129) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.indirection.IndirectMap.getDelegate(IndirectMap.java:416) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.indirection.IndirectMap$3.<init>(IndirectMap.java:1016) ~[eclipselink-2.7.6-with-patches.jar]
	at org.eclipse.persistence.indirection.IndirectMap.values(IndirectMap.java:1015) ~[eclipselink-2.7.6-with-patches.jar]
	at oucode.entity.SomeBusinessEntity.getSomeWhateverOneToManyRelationship(SomeBusinessEntity.java:316) ~[SomePrivateCode.jar:?]

The stack trace indicates that ReadLockManager.removeReadLock(ReadLockManager.java:97) encounters an unexpected scenario where it is possible for "null" to be added to ReadLockManager.
We do not know how it is possible, but facts are facts.
Crucial here, the CacheKey's equals method is not safeguarded against NullPointerException.
Specifically, the issue lies within this code segment:

https://github.com/eclipse-ee4j/eclipselink/blob/2.7.14/foundation/org.eclipse.persistence.core/src/org/eclipse/persistence/internal/identitymaps/CacheKey.java

  public boolean equals(CacheKey key) {
        if (key.key == null || this.key == null) { <----- this is line 331, notice that if the input parameter key is null we will have a blow up.
            return false;
        }
        return this.key.equals(key.key);
    }

We have the patched method to look like this:

public boolean equals(CacheKey key) {
        
        if(key == null) { <-------- quite important to have there.  
            // Bugfix related to tracker TRK-25534
            // make sure the key we are comparing against is not the null value
            // NPE exception on equals was probably not at all that the key to compare against was null
            // but rather that this instance had as primary key the null value since we also saw blow ups in the has code method
            // but since it is impossible to tell what data was corrupted we make this code safe to compare against null as well
            return false;
        }

        if (this == key) {
            return true;
        }

        if (key.key == null) {
            return false;
        }

     // Bugfix related to tracker TRK-25534
        // we do not know how it is possible that a cache key existed in the system with primary key set to null
        // but we have seen a stack trace revealing a release acquired read lock piece locking blowing up
        // because the HackingEclipseReadLockManager.removeReadLock was dealing with a cache key with null primary key
        if(this.key != null) {
            return this.key.equals(key.key);
        } else {
            // we could just return false since the only way to cache keys have the same creation number is if they
            // are the same object instance which is already being checked above
            return this.getCacheKeyInstanceCreationNumber() == key.getCacheKeyInstanceCreationNumber(); <----- something special to be explained further bellow.
        }
    }

Notice that the method above will not blow up if the input parameter key is the null object.

The hashCode() method is another area that lacked safety. We have patched the implementation of hashCode() to address this issue. Previously, it was implemented as follows:

public int hashCode() {
    return this.key.hashCode();
}

The code above is prone to failure if this.key is null. While the exact cause is unknown, we are certain that this scenario can occur.

To enhance safety, we have revised the implementation to be more robust against null values. The new implementation is as follows:

@Override
    public int hashCode() {
        if(this.key != null) {
            return this.key.hashCode();
        } else {
            // relates to TRK-25534
            // We do not know how eclipselink managed to get itself into this situation but we have
            // seen null pointer exceptions taking place while trying to build a massive dump justification on account of
            // cache keys found in the read lock manager where the with primary key was set to null
            // so we need to make this code safe
            final int prime = 31;
            int result = 1;
            result = prime * result + (int) (cacheKeyInstanceCreationNumber ^ (cacheKeyInstanceCreationNumber >>> 32));
            return result;
        }
    }

We will provide an attachment containing the fully patched CacheKey object.

Additionally, we have made further enhancements to the metadata of the CacheKey object. Specifically, we have reintroduced a segment of code at the beginning of the class that was previously added to earlier versions of EclipseLink. This same code has now been incorporated into version 2.7.6 as follows:

 public static final AtomicLong CACHE_KEY_INSTANCE_CREATION_NUMBER_GENERATOR = new AtomicLong(1l);

    /**
     * A technical instanace creation number we use as "last-second" saving value
     * in case the cache key is found to be corrupted with Null Key value as see in {@code TRK-25534}
     * we use this value to ensure that the EQUALS and HASH CODE methods never blow up on account of a null corrupted primary key.
     */
    protected final long cacheKeyInstanceCreationNumber = CACHE_KEY_INSTANCE_CREATION_NUMBER_GENERATOR.getAndIncrement();

    /**
     * Value we want to set into the field
     * {@link #org.eclipse.persistence.internal.identitymaps.CacheKey.primaryKeyAtTimeOfCreation}
     * if the cache key instances is created with the no argument constructor.
     */
    protected static final String PRIMARY_KEY_AT_TIME_OF_CONSTRUCTION_NULL_NO_ARGUMENT_CONTRUCTOR_USED = "PRIMARY_KEY_AT_TIME_OF_CONSTRUCTION_NULL_NO_ARGUMENT_CONTRUCTOR_USED";
    /**
     *  Value we want to set into the field
     * {@link #org.eclipse.persistence.internal.identitymaps.CacheKey.primaryKeyAtTimeOfCreation}
     * if the cache key instances is created with a constructor that allows the primary key as an argument but the primary key is null..
     */
    protected static final String PRIMARY_KEY_AT_TIME_OF_CONSTRUCTION_NULL_A_NULL_PRIMARY_KEY_WAS_PASSED_IN_TO_OBJECT_CONSTRUCTOR = "PRIMARY_KEY_AT_TIME_OF_CONSTRUCTION_NULL_A_NULL_PRIMARY_KEY_WAS_PASSED_IN_TO_OBJECT_CONSTRUCTOR";
    /**
     * This field relates to an unexpected situation see in issue
     * {@code TRK-25534}
     * whereby the manipualted eclipselink code created to trace the acuisition and contention of concurrency manager
     * resources started blowing up during the release of read locks because for whatever reasons
     * the cache key bein released from reading managed to find itself with the primary key set to null,
     * THis is something we never saw before.
     * Even though we normally never see the key field set to null since there is no purpose to cache key without a key
     * the fact is that it hapened and methods like Equals and HashCode were completely broken.
     * One could not longer remove cache keys from Lists, Vectors or check if the cache keys existed in an hash map.
     *
     * So this field is meant to help us know what was itented purpose with which the key was originally created.
     *
     *
     */
    protected final Object primaryKeyAtTimeOfCreation;

The updated code introduces two new fields: primaryKeyAtTimeOfCreation and cacheKeyInstanceCreationNumber, along with associated metadata. This enhancement allows us to meticulously monitor the lifecycle of a cache key. We have added two immutable fields that capture the original key of the cache key at the moment of its creation.

Furthermore, we have implemented a unique instance creation number for each object. This distinctive number enables us to definitively determine in the equals and hashCode methods whether two objects are the same instance.

Below, you can observe how the new immutable field primaryKeyAtTimeOfCreation is assigned during the object's construction phase:

protected CacheKey(){
        // make sure we trace the primary key for which the cache key was created in an immutable field
        this.primaryKeyAtTimeOfCreation = PRIMARY_KEY_AT_TIME_OF_CONSTRUCTION_NULL_NO_ARGUMENT_CONTRUCTOR_USED;
    }

    public CacheKey(Object primaryKey) {
        this.key = primaryKey;
        // make sure we trace the primary key for which the cache key was created in an immutable field
        if(primaryKey == null) {
            this.primaryKeyAtTimeOfCreation = PRIMARY_KEY_AT_TIME_OF_CONSTRUCTION_NULL_A_NULL_PRIMARY_KEY_WAS_PASSED_IN_TO_OBJECT_CONSTRUCTOR;
        } else {
            this.primaryKeyAtTimeOfCreation = primaryKey;
        }
    }

    public CacheKey(Object primaryKey, Object object, Object lockValue) {
@@ -115,6 +163,13 @@ public class CacheKey extends ConcurrencyManager implements Cloneable {
        if (object != null) {
            setObject(object);
        }

        // make sure we trace the primary key for which the cache key was created in an immutable field
        if(primaryKey == null) {
            this.primaryKeyAtTimeOfCreation = PRIMARY_KEY_AT_TIME_OF_CONSTRUCTION_NULL_A_NULL_PRIMARY_KEY_WAS_PASSED_IN_TO_OBJECT_CONSTRUCTOR;
        } else {
            this.primaryKeyAtTimeOfCreation = primaryKey;
        }
    }

    public CacheKey(Object primaryKey, Object object, Object lockValue, long readTime, boolean isIsolated) {
@@ -126,11 +181,19 @@ public class CacheKey extends ConcurrencyManager implements Cloneable {
        }
        this.readTime = readTime;
        this.isIsolated = isIsolated;

     // make sure we trace the primary key for which the cache key was created in an immutable field
        if(primaryKey == null) {
            this.primaryKeyAtTimeOfCreation = PRIMARY_KEY_AT_TIME_OF_CONSTRUCTION_NULL_A_NULL_PRIMARY_KEY_WAS_PASSED_IN_TO_OBJECT_CONSTRUCTOR;
        } else {
            this.primaryKeyAtTimeOfCreation = primaryKey;
        }
    }
	

In summary, we have fortified the class by implementing the following improvements:

  1. Revised equals and hashCode Methods: We revisited the implementation of the equals and hashCode methods to ensure they behave as expected under normal circumstances and remain resilient when encountering null objects or keys.

  2. Enhanced CacheKey Metadata: We have augmented the CacheKey metadata to better trace its challenging lifecycle and to identify the origins of corrupted cache keys in extensive dumps after the key has become null. To achieve this, we introduced new fields to the CacheKey that are established at the time of construction and are immutable. Specifically, these fields are:

    • protected final Object primaryKeyAtTimeOfCreation;
    • protected final long cacheKeyInstanceCreationNumber;

Additionally, for the comprehensive dump analysis that reports on the status of the cache keys, we have added the method:

  • public Object getKeyEnsuredToBeNotNull() which returns this.key if it is not null, or this.getPrimaryKeyAtTimeOfCreation() if this.key is null.

This methodical approach ensures that the CacheKey class is robust and reliable throughout its lifecycle.

Our patching efforts extended beyond the CacheKey. In the past, we've encountered issues with massive dumps causing failures when executing the method:

src/main/java/org/eclipse/persistence/internal/helper/ConcurrencyUtil.java
public String createToStringExplainingOwnedCacheKey(ConcurrencyManager concurrencyManager)

This was due to the following line of code, which proved problematic:

Object primaryKey = cacheKey.getKey();

To resolve this, we've implemented a safer alternative:

Object primaryKey = cacheKey.getKeyEnsuredToBeNotNull();
Object primaryKeyAtTimeOfCreation = cacheKey.getPrimaryKeyAtTimeOfCreation();
long cacheKeyInstanceCreationNumber = cacheKey.getCacheKeyInstanceCreationNumber();

With these changes, the trace message now includes two additional pieces of metadata: cacheKeyInstanceCreationNumber and primaryKeyAtTimeOfCreation, enhancing the diagnostic capabilities as demonstrated in the code snippet above. This ensures a more robust and fail-safe approach in handling cache keys within our system.

     return TraceLocalization.buildMessage("concurrency_util_owned_cache_key_is_cache_key"
                    , new Object[] {
                            // 0, 1
                            canonicalName, primaryKey,
                            //2
                    String.valueOf(System.identityHashCode(cacheKeyObject)),

                    // 3,4
                    cacheKeyClass, String.valueOf(System.identityHashCode(cacheKey)),
                    // 5,6,7
                    activeThread, concurrencyManager.getNumberOfReaders(), concurrencyManagerId,
                    // 8
                    ConversionManager.getDefaultManager().convertObject(concurrencyManagerCreationDate, String.class).toString(),
                    // metadata of number of times the cache key suffered increases in number readers

                    // 9 metadata of number of times the cache key suffered increases in number readers
                    cacheKey.getTotalNumberOfKeysAcquiredForReading(),
                    // 10
                    cacheKey.getTotalNumberOfKeysReleasedForReading(),
                    // 11
                    cacheKey.getTotalNumberOfKeysReleasedForReadingBlewUpExceptionDueToCacheKeyHavingReachedCounterZero(),
                    concurrencyManager.getDepth()});
                    // 12
                    concurrencyManager.getDepth()
                    //
                    // parameters 13, 14
                    , cacheKeyInstanceCreatonNumber, primaryKeyAtTimeOfCreation

In summary, we strongly recommend adopting our enhancements. We plan to test this patched version of EclipseLink 2.7.6 on the current project to determine if their ongoing problems get resolved.

In the uploaded TRK-32315_patchedCode.zip
you will find our modified files ConcurrencyUtil.java, CacheKey.java and TraceLocalizationResource.java.

TRK-32315_patchedCode.zip

We will keep you posted on this issue, in particular when we make a breakthrough.

@99sono
Copy link
Author

99sono commented Apr 25, 2024

We received feedback today from the project team that our patching has improved things, and they no longer need to perform application server restarts due to these NPExceptions. Normally projects do not experience this problem, but this specific project for whatever reason was being heavily affected by this bug. It might still be too early to determine if this is the end of the story, but we have consistently patched this issue for every version from 2.4.2 to 2.6.7 in the past. The fixes we implemented there will certainly not cause harm, as they all enhance the robustness of the code. However, they do not explain the strange phenomenon of cache keys becoming null in the primary key. Unfortunately, we lack the expert knowledge and experience in EclipseLink to provide such an explanation.

@99sono
Copy link
Author

99sono commented Apr 29, 2024

I wanted to provide an update regarding our ongoing efforts to address the issue. Unfortunately, our fix is not yet complete.

Recently, the project encountered the following stack trace while utilizing our latest patch:

Caused by: java.lang.NullPointerException
at org.eclipse.persistence.internal.helper.ConcurrencyUtil.createStringWithSummaryOfReadLocksAcquiredByThread(ConcurrencyUtil.java:891) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.ConcurrencyUtil.createInformationAboutAllResourcesAcquiredAndDeferredByThread(ConcurrencyUtil.java:1263) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.ConcurrencyUtil.createInformationAboutAllResourcesAcquiredAndDeferredByAllThreads(ConcurrencyUtil.java:1152) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.ConcurrencyUtil.dumpConcurrencyManagerInformationStep02(ConcurrencyUtil.java:627) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.ConcurrencyUtil.dumpConcurrencyManagerInformationStep01(ConcurrencyUtil.java:590) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.ConcurrencyUtil.dumpConcurrencyManagerInformationIfAppropriate(ConcurrencyUtil.java:513) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.ConcurrencyUtil.determineIfReleaseDeferredLockAppearsToBeDeadLocked(ConcurrencyUtil.java:178) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.WriteLockManager.acquireLocksForClone(WriteLockManager.java:171) ~[eclipselink_2_7_6_hacked.jar]

Despite our efforts, it remains possible to encounter NPExceptions while attempting to create the massive dump of information.

I will keep you informed of any progress as we continue to patch and improve the situation.

@99sono
Copy link
Author

99sono commented Apr 29, 2024

I am quite certain that I understand why our ConcurrencyUtil logic encountered issues within the org.eclipse.persistence.internal.helper.ConcurrencyUtil.createStringWithSummaryOfReadLocksAcquiredByThread(ReadLockManager, String) method. The reason is straightforward, in my opinion. It stems from the fact that the data structure we are iterating over is not a deep copy. Specifically, the read lock manager, org.eclipse.persistence.internal.helper.ReadLockManager.getMapThreadToReadLockAcquisitionMetadata(), returns an unmodifiable map. However, the underlying map and its list values can be modified by concurrent threads behind the scenes. To address this, the org.eclipse.persistence.internal.helper.ReadLockManager.getMapThreadToReadLockAcquisitionMetadata() method should return a deep clone of the map and its contained entries. This adjustment now makes much more sense. I will show the improved org.eclipse.persistence.internal.helper.ReadLockManager.getMapThreadToReadLockAcquisitionMetadata() in a second.

@99sono
Copy link
Author

99sono commented Apr 29, 2024

The following is what unbkreable code looks like:

   /** Getter for {@link #mapThreadToReadLockAcquisitionMetadata} returns a deep clone */
    public synchronized Map<Long, List<ReadLockAcquisitionMetadata>> getMapThreadToReadLockAcquisitionMetadata() {
        // We cannot simply return an unmodifiable map here. There are two reasons for this:
        // 1. The code consuming this unmodifiable map might be surprised by changes to the map itself caused by this
        // read lock manager.
        // If threads continue to acquire and release read locks, it could impact this object.
        // 2. Additionally, the list values contained in the map could also be affected by threads that are reading and
        // releasing read locks.
        // Our approach should be to provide the ConcurrencyUtil with a safe deep clone of the data structure for its
        // massive dumps.

        // (a) Let's start by creating an independent result map that the caller can safely iterate over.
        Map<Long, List<ReadLockAcquisitionMetadata>> resultMap = new HashMap<>();

        // (b) depp clone the data strcuture
        for (Entry<Long, List<ReadLockAcquisitionMetadata>> currentEntry : mapThreadToReadLockAcquisitionMetadata
                .entrySet()) {
            ArrayList<ReadLockAcquisitionMetadata> deepCopyOfCurrentListOfMetadata = new ArrayList<>(currentEntry.getValue());
            resultMap.put(currentEntry.getKey(), unmodifiableList(deepCopyOfCurrentListOfMetadata) );
        }

        // (c) even if our result map is deep clone of our internal sate
        // we still want to return it as unmodifable so that callers do not have the illusion
        // they are able to hack the state of the read lock manager from the outside.
        // If any code tris to manipulate the returned clone they should get a blow up to be dispelled of any illiusions
        return Collections.unmodifiableMap(resultMap);
    }

The modification to the org.eclipse.persistence.internal.helper.ReadLockManager.getMapThreadToReadLockAcquisitionMetadata() method is crucial. It ensures that the org.eclipse.persistence.internal.helper.ConcurrencyUtil.createStringWithSummaryOfReadLocksAcquiredByThread(ReadLockManager, String) function is not affected by the fact that, while generating the massive dump, it might encounter threads manipulating the hash map it is iterating over behind the scenes.

I will still strength the code of both the read lock manager and the ConcurrencyUtil with further level of paranoia/safety-check code to ensure I never see again a NPexception in this code area.

I will provide the patch later today.

@99sono
Copy link
Author

99sono commented Apr 29, 2024

In this post, I am presenting the second patch for this bug, which builds upon the first patch previously provided.

The associated commit message for these changes is as follows:

TRK-32315: Addressing Concurrency Issues in ReadLockManager

Continuing our efforts to resolve issue #2114, we have identified the root cause of the bug. The crux of the matter lies in the `ReadLockManager` providing the `ConcurrencyUtil` access to its state without performing deep copies.

Specifically, the `ReadLockManager` exposes its state through unmodifiable hash maps and lists. However, beneath these unmodifiable wrappers, the true data structures of the class remain accessible.

During the process of generating massive dumps, the `ConcurrencyUtil` iterates over metadata. Simultaneously, the read lock manager's data structures could undergo additions and removals of cache keys. These lists behind the scenes are not inherently thread-safe since they are always accessed within synchronized methods. However, an exception occurs when the concurrency util itself iterates over them, leading to mysterious NULL entries in the lists.

Fundamentally, the code itself does not logically allow NULL values to be entered into the read lock manager. However, the concurrency between the read lock manager's internal logic and the massive dump generation could inadvertently corrupt the data structure, resulting in this phenomenon.

0001-TRK-32315-Addressing-Concurrency-Issues-in-ReadLockM.patch

and the literal files are as follows
issue_2114_commit_f6dca865.zip

@rfelcman
Copy link
Contributor

rfelcman commented May 2, 2024

About NPE primaryKey

    public CacheKey(Object primaryKey) {
        this.key = primaryKey;
        // make sure we trace the primary key for which the cache key was created in an immutable field
        if(primaryKey == null) {
            this.primaryKeyAtTimeOfCreation = PRIMARY_KEY_AT_TIME_OF_CONSTRUCTION_NULL_A_NULL_PRIMARY_KEY_WAS_PASSED_IN_TO_OBJECT_CONSTRUCTOR;
        } else {
            this.primaryKeyAtTimeOfCreation = primaryKey;
        }
    }

I'm not sure about this approach, because we silently accept incorrect value null. At least I'll add some logging there (WARNING level with some small message and FINEST with stack trace).
But my question is: Have You got some idea why it happens? Or which conditions should trigger it?

@99sono
Copy link
Author

99sono commented May 3, 2024

Hi,

Regarding the issue of CacheKey being null, I I have no explanation.
It appears to be a rather rare condition, as only one project in version 2.7.6 has encountered this problem. We also faced a similar situation in version 2.6.5 (e.g. I could be wrong on the older version) . It is not a common pattern across projects.

Now, on the topic of the NPE working with data from org.eclipse.persistence.internal.helper.ReadLockManager.mapThreadToReadLockAcquisitionMetadata. This list can potentially lead to either a NullPointerException (NPException) in the ConcurrencyUtil when logging a massive dump, as seen in the stack trace chunk above, or issues within the ReadLockManager itself while adding or removing cache keys.

Regarding this specific matter:

  • Illusion of Null Values: Contrary to appearances, I believe the list does not contain any null values. The metadata add logic never introduced any null metadata—I've triple-checked this. The illusion arises from the ReadLockManager's attempt to secure its state by returning an unmodifiable map to the outside world. Unfortunately, this approach falls short because multiple threads could be simultaneously adding and releasing read locks while a massive dump iterates over the affected lists due to additions and removals. In essence, the list is not thread-safe, and it's essential to avoid looping over its elements while another thread accesses it for adding or removing elements.

So, the situation regarding the read lock metadata is quite clear.
The org.eclipse.persistence.internal.helper.ReadLockManager.getMapThreadToReadLockAcquisitionMetadata() should have been protecting its internal state much safer. Those ConccurrencyUtil was essentially handling lists it should never had direct access to.

As for updates, I am awaiting feedback from the project. They have been operating for quite some time before encountering this issue. It may take several weeks to receive confirmation that improvements are indeed underway.

Thank you.

@99sono
Copy link
Author

99sono commented May 3, 2024

Let me summarize the types of null pointer exceptions that are patch should be addressing.

This pattern A:

Caused by: java.lang.NullPointerException
at org.eclipse.persistence.internal.identitymaps.CacheKey.equals(CacheKey.java:331) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.identitymaps.CacheKey.equals(CacheKey.java:320) ~[eclipselink_2_7_6_hacked.jar]
at java.util.Vector.indexOf(Vector.java:431) ~[?:?]
at java.util.Vector.indexOf(Vector.java:405) ~[?:?]
at java.util.Vector.removeElement(Vector.java:668) ~[?:?]
at java.util.Vector.remove(Vector.java:843) ~[?:?]
at org.eclipse.persistence.internal.helper.ReadLockManager.removeReadLock(ReadLockManager.java:97) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.ConcurrencyManager.removeReadLockFromReadLockManager(ConcurrencyManager.java:958) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.ConcurrencyManager.releaseReadLock(ConcurrencyManager.java:691) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.identitymaps.CacheKey.releaseReadLock(CacheKey.java:482) ~[eclipselink_2_7_6_hacked.jar]

This pattern B:

Caused by: java.lang.NullPointerException
at org.eclipse.persistence.internal.helper.ConcurrencyUtil.createStringWithSummaryOfReadLocksAcquiredByThread(ConcurrencyUtil.java:891) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.ConcurrencyUtil.createInformationAboutAllResourcesAcquiredAndDeferredByThread(ConcurrencyUtil.java:1263) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.ConcurrencyUtil.createInformationAboutAllResourcesAcquiredAndDeferredByAllThreads(ConcurrencyUtil.java:1152) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.ConcurrencyUtil.dumpConcurrencyManagerInformationStep02(ConcurrencyUtil.java:627) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.ConcurrencyUtil.dumpConcurrencyManagerInformationStep01(ConcurrencyUtil.java:590) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.ConcurrencyUtil.dumpConcurrencyManagerInformationIfAppropriate(ConcurrencyUtil.java:513) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.ConcurrencyUtil.determineIfReleaseDeferredLockAppearsToBeDeadLocked(ConcurrencyUtil.java:178) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.WriteLockManager.acquireLocksForClone(WriteLockManager.java:171) ~[eclipselink_2_7_6_hacked.jar]

This pattern C:

Caused by: java.lang.NullPointerException
at org.eclipse.persistence.internal.helper.ReadLockManager.removeReadLock(ReadLockManager.java:84) ~[eclipselink_2_7_6_hacked.jar]

at org.eclipse.persistence.internal.helper.ConcurrencyManager.removeReadLockFromReadLockManager(ConcurrencyManager.java:958) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.helper.ConcurrencyManager.releaseReadLock(ConcurrencyManager.java:691) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.identitymaps.CacheKey.releaseReadLock(CacheKey.java:482) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.cloneAndRegisterObject(UnitOfWorkImpl.java:1103) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.cloneAndRegisterObject(UnitOfWorkImpl.java:1014) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.sessions.UnitOfWorkIdentityMapAccessor.getAndCloneCacheKeyFromParent(UnitOfWorkIdentityMapAccessor.java:211) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.sessions.UnitOfWorkIdentityMapAccessor.getFromIdentityMap(UnitOfWorkIdentityMapAccessor.java:139) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.registerExistingObject(UnitOfWorkImpl.java:4093) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.registerExistingObject(UnitOfWorkImpl.java:4016) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.mappings.CollectionMapping.buildElementUnitOfWorkClone(CollectionMapping.java:313) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.mappings.CollectionMapping.buildElementClone(CollectionMapping.java:326) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.internal.queries.ContainerPolicy.addNextValueFromIteratorInto(ContainerPolicy.java:217) ~[eclipselink_2_7_6_hacked.jar]
at org.eclipse.persistence.mappings.CollectionMapping.buildCloneForPartObject(CollectionMapping.java:228) ~[eclipselink_2_7_6_hacked.jar]

I do not have any NPExceptin for the cache key hashcode method.
But also that mehod should be made bullet-proof.

Pattern B and C are most likely heavily married to one another, it is that illusion of a null entry that I explained above.
Pattern A of how cache key can go Null is a mystery.
For the Pattern A I believe it is important to add that extra metadata to know what the CacheKey was like at its time of creation.

Thanks.

@99sono
Copy link
Author

99sono commented May 7, 2024

The project team tells me they are happy with the patched eclipselink, for the time being at least.
They have not encountered issues of this nature since 2024.04.29.
If the issue is not yet definitely resolved, you will hear from me again on this topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants