Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New eclipselink dead lock scenario discovered the logic of the sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal is not safe when it calls upon AbstractSession.getCacheKeyFromTargetSessionForMerge #2094

Open
99sono opened this issue Mar 13, 2024 · 5 comments

Comments

@99sono
Copy link

99sono commented Mar 13, 2024

We have discovered a completely new dead-lock pattern never encountered before.
Until now, we have been conviced that with the persistence.xml settings

 <!-- Object Building MUST ALWAYS use deferred locks. If cannot acquire it must let it go and continue forward -->
            <!-- waiting for whoever acquired the lock to arrive at the releaseDeferredLocks moment as well. -->
            <!-- The threads committing changes they automatically let go of the write locks when they fail to acquire them all -->
            <property name="eclipselink.cache.query-force-deferred-locks" value="true"/>
            <property name="eclipselink.cache.type.default" value="SoftWeak" />


            <!-- Other properties that we can use -->
            <!-- The properties we typically configure as system properties: -->
            <!-- org.eclipse.persistence.config.SystemProperties.CONCURRENCY_MANAGER_ACQUIRE_WAIT_TIME -->
            <!-- For example set JAVA_OPTIONS=%JAVA_OPTIONS% -Declipselink.concurrency.manager.waittime=10 -->

            <!-- They typically are also specified as pseristence unit properties: -->
            <!-- org.eclipse.persistence.config.PersistenceUnitProperties.CONCURRENCY_MANAGER_ACQUIRE_WAIT_TIME -->

            <!-- Persistence.xml properties that we typically configure with system property. -->
            <property name="eclipselink.concurrency.manager.waittime" value="10"/>
            <property name="eclipselink.concurrency.manager.maxsleeptime" value="10000"/>
            <property name="eclipselink.concurrency.manager.maxfrequencytodumptinymessage" value="5000"/>
            <property name="eclipselink.concurrency.manager.maxfrequencytodumpmassivemessage" value="20000"/>
            <property name="eclipselink.concurrency.manager.allow.interruptedexception" value="false"/>
            <property name="eclipselink.concurrency.manager.allow.exception" value="false"/>
            <property name="eclipselink.concurrency.manager.allow.concurrency.exception" value="false"/>
            <property name="eclipselink.concurrency.manager.allow.readlockstacktrace" value="false"/>

            <property name="eclipselink.concurrency.manager.object.building.semaphore" value="true"/>
            <property name="eclipselink.concurrency.manager.write.lock.manager.semaphore" value="true"/>
            <property name="eclipselink.concurrency.manager.object.building.no.threads" value="12"/>
            <property name="eclipselink.concurrency.manager.write.lock.manager.no.threads" value="2"/>
            <property name="eclipselink.concurrency.semaphore.max.time.permit" value="2000"/>
            <property name="eclipselink.concurrency.semaphore.log.timeout" value="2000"/>


And all the development to diagnose and prevent dead-locks that we would be out of the woods.
For the most part, this appears to be the case, but we have today discover a new scenario where dead-locks can indeed happen in the concurrency manager cache.

In this specific dead-lock, we have thread committing a transaction.
Thread Name: [ACTIVE] ExecuteThread: '444' for queue: 'weblogic.kernel.Default (self-tuning)'

This thread is working on commmiting a business transaction and it is in commit step where it politely waits for th release deferred locks algorithm to comlete.

Here is what it looks like:

"[ACTIVE] ExecuteThread: '444' for queue: 'weblogic.kernel.Default (self-tuning)'" 
   java.lang.Thread.State: TIMED_WAITING
        at java.base@11.0.16/java.lang.Thread.sleep(Native Method)
        at org.eclipse.persistence.internal.helper.ConcurrencyManager.releaseDeferredLock(ConcurrencyManager.java:653)
        at org.eclipse.persistence.internal.identitymaps.CacheKey.releaseDeferredLock(CacheKey.java:472)
        at org.eclipse.persistence.internal.helper.WriteLockManager.releaseAllAcquiredLocks(WriteLockManager.java:609)
        at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.mergeChangesIntoParent(UnitOfWorkImpl.java:3424)
        at org.eclipse.persistence.internal.sessions.RepeatableWriteUnitOfWork.mergeChangesIntoParent(RepeatableWriteUnitOfWork.java:381)
        at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.mergeClonesAfterCompletion(UnitOfWorkImpl.java:3527)
        at org.eclipse.persistence.transaction.AbstractSynchronizationListener.afterCompletion(AbstractSynchronizationListener.java:219)
        at org.eclipse.persistence.transaction.JTASynchronizationListener.afterCompletion(JTASynchronizationListener.java:81)
        at weblogic.transaction.internal.ServerSCInfo.doAfterCompletion(ServerSCInfo.java:1145)
        at weblogic.transaction.internal.ServerSCInfo.callAfterCompletionsForTier(ServerSCInfo.java:1121)
        at weblogic.transaction.internal.ServerSCInfo.callAfterCompletions(ServerSCInfo.java:1093)
        at weblogic.transaction.internal.ServerTransactionImpl.callAfterCompletions(ServerTransactionImpl.java:3773)
        at weblogic.transaction.internal.ServerTransactionImpl.afterCommittedStateHousekeeping(ServerTransactionImpl.java:3662)
        at weblogic.transaction.internal.ServerTransactionImpl.setCommitted(ServerTransactionImpl.java:3716)
        at weblogic.transaction.internal.ServerTransactionImpl.globalRetryCommit(ServerTransactionImpl.java:3432)
        at weblogic.transaction.internal.ServerTransactionImpl.globalCommit(ServerTransactionImpl.java:3327)
        at weblogic.transaction.internal.ServerTransactionImpl.internalCommit(ServerTransactionImpl.java:322)
        at weblogic.transaction.internal.ServerTransactionImpl.commit(ServerTransactionImpl.java:270)
        at weblogic.ejb.container.internal.MDListener.execute(MDListener.java:465)
        at weblogic.ejb.container.internal.MDListener.transactionalOnMessage(MDListener.java:361)
        at weblogic.ejb.container.internal.MDListener.onMessage(MDListener.java:297)
        at weblogic.jms.client.JMSSession.onMessage(JMSSession.java:5141)
        at weblogic.jms.client.JMSSession.execute(JMSSession.java:4795)
        at weblogic.jms.client.JMSSession.executeMessage(JMSSession.java:4190)
        at weblogic.jms.client.JMSSession.access$000(JMSSession.java:132)
        at weblogic.jms.client.JMSSession$UseForRunnable.run(JMSSession.java:5666)
        at weblogic.work.SelfTuningWorkManagerImpl$WorkAdapterImpl.run(SelfTuningWorkManagerImpl.java:677)
        at weblogic.invocation.ComponentInvocationContextManager._runAs(ComponentInvocationContextManager.java:352)
        at weblogic.invocation.ComponentInvocationContextManager.runAs(ComponentInvocationContextManager.java:337)
        at weblogic.work.LivePartitionUtility.doRunWorkUnderContext(LivePartitionUtility.java:57)
        at weblogic.work.PartitionUtility.runWorkUnderContext(PartitionUtility.java:41)
        at weblogic.work.SelfTuningWorkManagerImpl.runWorkUnderContext(SelfTuningWorkManagerImpl.java:651)
        at weblogic.work.ExecuteThread.execute(ExecuteThread.java:420)
        at weblogic.work.ExecuteThread.run(ExecuteThread.java:360)

So far so good.
This thread 444 is doing what ti is supposed to do.

This thread is essentially getting burned/waiting for the thread 137 to finish its job.

Summary of deferred locks (could not be acquired and cause thread to wait for object building to complete) of thread [ACTIVE] ExecuteThread: '444' for queue: 'weblogic.kernel.Default (self-tuning)' Listing of all DEFERRED Locks.
Thread Name: [ACTIVE] ExecuteThread: '444' for queue: 'weblogic.kernel.Default (self-tuning)' 
2 Deferred locks.
Deferred lock nr: 0 , Deferred cache key: --- CacheKey  (bla.bla.Product):  (primaryKey: 42,838) (object: Product[owner=BlaBla) (object hash code: 1477607487) (cacheKeyClass: org.eclipse.persistence.internal.identitymaps.HardCacheWeakIdentityMap.ReferenceCacheKey) (cacheKey hash code: 1464266971) (current cache key owner/activeThread: [ACTIVE] ExecuteThread: '137' for queue: 'weblogic.kernel.Default (self-tuning)') (getNumberOfReaders: 0)  (concurrencyManagerId: 81,898,140) (concurrencyManagerCreationDate: 2024-03-11 15:03:33.77)  (totalNumberOfTimeCacheKeyAcquiredForReading:  104)  (totalNumberOfTimeCacheKeyReleasedForReading:  104)  (totalNumberOfTimeCacheKeyReleasedForReadingBlewUpExceptionDueToCacheKeyHavingReachedCounterZero:  0)  (depth: 1) ---

Deferred lock nr: 1 , Deferred cache key: --- CacheKey  (bla.bla.Product):  (primaryKey: 42,838) (object: Product[owner=BlaBla (object hash code: 1477607487) (cacheKeyClass: org.eclipse.persistence.internal.identitymaps.HardCacheWeakIdentityMap.ReferenceCacheKey) (cacheKey hash code: 1464266971) (current cache key owner/activeThread: [ACTIVE] ExecuteThread: '137' for queue: 'weblogic.kernel.Default (self-tuning)') (getNumberOfReaders: 0)  (concurrencyManagerId: 81,898,140) (concurrencyManagerCreationDate: 2024-03-11 15:03:33.77)  (totalNumberOfTimeCacheKeyAcquiredForReading:  104)  (totalNumberOfTimeCacheKeyReleasedForReading:  104)  (totalNumberOfTimeCacheKeyReleasedForReadingBlewUpExceptionDueToCacheKeyHavingReachedCounterZero:  0)  (depth: 1) ---


So up until this point, we can say nothing to worry about.
The system would not be on a route dead lock.

But now comes the trouble/dead lock.

We now move over to our thread 137.
Out thread 137, has so far in its business logic found itself deferring on a total of 11 cache key owned by thread 444.
Here is an obfuscated listé


Summary of deferred locks (could not be acquired and cause thread to wait for object building to complete) of thread [ACTIVE] ExecuteThread: '137' for queue: 'weblogic.kernel.Default (self-tuning)' Listing of all DEFERRED Locks.
Thread Name: [ACTIVE] ExecuteThread: '137' for queue: 'weblogic.kernel.Default (self-tuning)' 
11 Deferred locks.
Deferred lock nr: 0 , Deferred cache key: --- CacheKey  (bla.blaOjeacBlaBla01):  (primaryKey: 31,591) (object: OjeacBlaBla01 [Whatever (current cache key Irrelevant01/activeThread: [ACTIVE] ExecuteThread: '444' for queue: 'weblogic.kernel.Default (self-tuning)') (getNumberOfReaders: 0)  (concurrencyManagerId: 81,890,165) (concurrencyManagerCreationDate: 2024-03-11 15:03:33.611)  (totalNumberOfTimeCacheKeyAcquiredForReading:  53)  (totalNumberOfTimeCacheKeyReleasedForReading:  53)  (totalNumberOfTimeCacheKeyReleasedForReadingBlewUpExceptionDueToCacheKeyHavingReachedCounterZero:  0)  (depth: 1) ---

Deferred lock nr: 1 , Deferred cache key: --- CacheKey  (bla.blaOjeacBlaBla01):  (primaryKey: 31,591) (object: OjeacBlaBla01 [Whatever (current cache key Irrelevant01/activeThread: [ACTIVE] ExecuteThread: '444' for queue: 'weblogic.kernel.Default (self-tuning)') (getNumberOfReaders: 0)  (concurrencyManagerId: 81,890,165) (concurrencyManagerCreationDate: 2024-03-11 15:03:33.611)  (totalNumberOfTimeCacheKeyAcquiredForReading:  53)  (totalNumberOfTimeCacheKeyReleasedForReading:  53)  (totalNumberOfTimeCacheKeyReleasedForReadingBlewUpExceptionDueToCacheKeyHavingReachedCounterZero:  0)  (depth: 1) ---

Deferred lock nr: 2 , Deferred cache key: --- CacheKey  (bla.blaOjeacBlaBla01):  (primaryKey: 31,591) (object: OjeacBlaBla01 [Whatever (current cache key Irrelevant01/activeThread: [ACTIVE] ExecuteThread: '444' for queue: 'weblogic.kernel.Default (self-tuning)') (getNumberOfReaders: 0)  (concurrencyManagerId: 81,890,165) (concurrencyManagerCreationDate: 2024-03-11 15:03:33.611)  (totalNumberOfTimeCacheKeyAcquiredForReading:  53)  (totalNumberOfTimeCacheKeyReleasedForReading:  53)  (totalNumberOfTimeCacheKeyReleasedForReadingBlewUpExceptionDueToCacheKeyHavingReachedCounterZero:  0)  (depth: 1) ---

Deferred lock nr: 3 , Deferred cache key: --- CacheKey  (bla.blaProduct):  (primaryKey: 31,561) (object: Product[Irrelevant01=IrrelevaacheKey) (cacheKey hash code: 225549932) (current cache key Irrelevant01/activeThread: [ACTIVE] ExecuteThread: '444' for queue: 'weblogic.kernel.Default (self-tuning)') (getNumberOfReaders: 0)  (concurrencyManagerId: 81,890,159) (concurrencyManagerCreationDate: 2024-03-11 15:03:33.611)  (totalNumberOfTimeCacheKeyAcquiredForReading:  107)  (totalNumberOfTimeCacheKeyReleasedForReading:  107)  (totalNumberOfTimeCacheKeyReleasedForReadingBlewUpExceptionDueToCacheKeyHavingReachedCounterZero:  0)  (depth: 1) ---

Deferred lock nr: 4 , Deferred cache key: --- CacheKey  (bla.blaProduct):  (primaryKey: 31,561) (object: Product[Irrelevant01=IrrelevaacheKey) (cacheKey hash code: 225549932) (current cache key Irrelevant01/activeThread: [ACTIVE] ExecuteThread: '444' for queue: 'weblogic.kernel.Default (self-tuning)') (getNumberOfReaders: 0)  (concurrencyManagerId: 81,890,159) (concurrencyManagerCreationDate: 2024-03-11 15:03:33.611)  (totalNumberOfTimeCacheKeyAcquiredForReading:  107)  (totalNumberOfTimeCacheKeyReleasedForReading:  107)  (totalNumberOfTimeCacheKeyReleasedForReadingBlewUpExceptionDueToCacheKeyHavingReachedCounterZero:  0)  (depth: 1) ---

Deferred lock nr: 5 , Deferred cache key: --- CacheKey  (bla.blaOjeacBlaBla01):  (primaryKey: 31,600) (object: OjeacBlaBla01 [510/818419628478) (current cache key Irrelevant01/activeThread: [ACTIVE] ExecuteThread: '444' for queue: 'weblogic.kernel.Default (self-tuning)') (getNumberOfReaders: 0)  (concurrencyManagerId: 81,890,190) (concurrencyManagerCreationDate: 2024-03-11 15:03:33.612)  (totalNumberOfTimeCacheKeyAcquiredForReading:  69)  (totalNumberOfTimeCacheKeyReleasedForReading:  69)  (totalNumberOfTimeCacheKeyReleasedForReadingBlewUpExceptionDueToCacheKeyHavingReachedCounterZero:  0)  (depth: 1) ---

Deferred lock nr: 6 , Deferred cache key: --- CacheKey  (bla.blaOjeacBlaBla01):  (primaryKey: 31,600) (object: OjeacBlaBla01 [510/818419628478) (current cache key Irrelevant01/activeThread: [ACTIVE] ExecuteThread: '444' for queue: 'weblogic.kernel.Default (self-tuning)') (getNumberOfReaders: 0)  (concurrencyManagerId: 81,890,190) (concurrencyManagerCreationDate: 2024-03-11 15:03:33.612)  (totalNumberOfTimeCacheKeyAcquiredForReading:  69)  (totalNumberOfTimeCacheKeyReleasedForReading:  69)  (totalNumberOfTimeCacheKeyReleasedForReadingBlewUpExceptionDueToCacheKeyHavingReachedCounterZero:  0)  (depth: 1) ---

Deferred lock nr: 7 , Deferred cache key: --- CacheKey  (bla.blaOjeacBlaBla01):  (primaryKey: 31,600) (object: OjeacBlaBla01 [510/818419628478) (current cache key Irrelevant01/activeThread: [ACTIVE] ExecuteThread: '444' for queue: 'weblogic.kernel.Default (self-tuning)') (getNumberOfReaders: 0)  (concurrencyManagerId: 81,890,190) (concurrencyManagerCreationDate: 2024-03-11 15:03:33.612)  (totalNumberOfTimeCacheKeyAcquiredForReading:  69)  (totalNumberOfTimeCacheKeyReleasedForReading:  69)  (totalNumberOfTimeCacheKeyReleasedForReadingBlewUpExceptionDueToCacheKeyHavingReachedCounterZero:  0)  (depth: 1) ---

Deferred lock nr: 8 , Deferred cache key: --- CacheKey  (bla.blaProduct):  (primaryKey: 31,570) (object: Product[Irrelevant01=IrrelevaCacheKey) (cacheKey hash code: 1249278418) (current cache key Irrelevant01/activeThread: [ACTIVE] ExecuteThread: '444' for queue: 'weblogic.kernel.Default (self-tuning)') (getNumberOfReaders: 0)  (concurrencyManagerId: 81,890,189) (concurrencyManagerCreationDate: 2024-03-11 15:03:33.612)  (totalNumberOfTimeCacheKeyAcquiredForReading:  158)  (totalNumberOfTimeCacheKeyReleasedForReading:  158)  (totalNumberOfTimeCacheKeyReleasedForReadingBlewUpExceptionDueToCacheKeyHavingReachedCounterZero:  0)  (depth: 1) ---

Deferred lock nr: 9 , Deferred cache key: --- CacheKey  (bla.blaProduct):  (primaryKey: 31,570) (object: Product[Irrelevant01=IrrelevaCacheKey) (cacheKey hash code: 1249278418) (current cache key Irrelevant01/activeThread: [ACTIVE] ExecuteThread: '444' for queue: 'weblogic.kernel.Default (self-tuning)') (getNumberOfReaders: 0)  (concurrencyManagerId: 81,890,189) (concurrencyManagerCreationDate: 2024-03-11 15:03:33.612)  (totalNumberOfTimeCacheKeyAcquiredForReading:  158)  (totalNumberOfTimeCacheKeyReleasedForReading:  158)  (totalNumberOfTimeCacheKeyReleasedForReadingBlewUpExceptionDueToCacheKeyHavingReachedCounterZero:  0)  (depth: 1) ---

Deferred lock nr: 10 , Deferred cache key: --- CacheKey  (bla.blaProductAssignment):  (primaryKey: 39,873) (object: ProductAssignment d=7923694-0012]]) (object hash code: 1249549488) (cacheKeyClass: org.eclipse.persistence.internal.identitymaps.HardCacheWeakIdentityMap.ReferenceCacheKey) (cacheKey hash code: 1194157559) (current cache key Irrelevant01/activeThread: [ACTIVE] ExecuteThread: '444' for queue: 'weblogic.kernel.Default (self-tuning)') (getNumberOfReaders: 0)  (concurrencyManagerId: 54,908,544) (concurrencyManagerCreationDate: 2024-03-11 14:54:36.199)  (totalNumberOfTimeCacheKeyAcquiredForReading:  0)  (totalNumberOfTimeCacheKeyReleasedForReading:  0)  (totalNumberOfTimeCacheKeyReleasedForReadingBlewUpExceptionDueToCacheKeyHavingReachedCounterZero:  0)  (depth: 1) ---
 

And up until here we do not necessarily have a problem.
The thread 444 is waiting for cache keys of thread 137 to be released or otherwise for thread 137 to also evolve to releaseDeferredLocks.
And thread 137 wants eleven cache keys that are being actively owned by the thread 444 but has used teh defer approach on the cach key.

So far, one would say "nothing out of the ordinary".

But then it comes the problem.
What is thread 137 actually doing right now?

It is doing the following:


"[ACTIVE] ExecuteThread: '137' for queue: 'weblogic.kernel.Default (self-tuning)'" 
   java.lang.Thread.State: WAITING
        at java.base@11.0.16/java.lang.Object.wait(Native Method)
        at java.base@11.0.16/java.lang.Object.wait(Object.java:328)
        at org.eclipse.persistence.internal.sessions.AbstractSession.getCacheKeyFromTargetSessionForMerge(AbstractSession.java:2858) 
		<-----  note, our thread 137 thread does not put itself passively  waiting and releasing 
		----- all of its locks. It aggressively defers on the lock it wishes to get and Waits forever without a timeout 
		----- this sort of logic seems to heavily contrast with the type of code we find in the
		----- org.eclipse.persistence.internal.helper.WriteLockManager.acquireRequiredLocksInternal(MergeManager, UnitOfWorkChangeSet)
		
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:782)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:716)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChanges(MergeManager.java:320)
        at org.eclipse.persistence.mappings.CollectionMapping.mergeIntoObject(CollectionMapping.java:1643)
        at org.eclipse.persistence.internal.descriptors.ObjectBuilder.mergeIntoObject(ObjectBuilder.java:4202)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:825)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:716)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChanges(MergeManager.java:320)
        at org.eclipse.persistence.mappings.ObjectReferenceMapping.mergeIntoObject(ObjectReferenceMapping.java:524)
        at org.eclipse.persistence.internal.descriptors.ObjectBuilder.mergeIntoObject(ObjectBuilder.java:4202)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:825)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:716)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChanges(MergeManager.java:320)
        at org.eclipse.persistence.mappings.ObjectReferenceMapping.mergeIntoObject(ObjectReferenceMapping.java:524)
        at org.eclipse.persistence.internal.descriptors.ObjectBuilder.mergeIntoObject(ObjectBuilder.java:4202)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:825)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:716)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChanges(MergeManager.java:320)
        at org.eclipse.persistence.mappings.CollectionMapping.mergeIntoObject(CollectionMapping.java:1643)
        at org.eclipse.persistence.internal.descriptors.ObjectBuilder.mergeIntoObject(ObjectBuilder.java:4202)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:825)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:716)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChanges(MergeManager.java:320)
        at org.eclipse.persistence.mappings.CollectionMapping.mergeIntoObject(CollectionMapping.java:1643)
        at org.eclipse.persistence.internal.descriptors.ObjectBuilder.mergeIntoObject(ObjectBuilder.java:4202)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:847)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:716)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChanges(MergeManager.java:320)
        at org.eclipse.persistence.mappings.CollectionMapping.mergeIntoObject(CollectionMapping.java:1643)
        at org.eclipse.persistence.internal.descriptors.ObjectBuilder.mergeIntoObject(ObjectBuilder.java:4202)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:847)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:716)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChanges(MergeManager.java:320)
        at org.eclipse.persistence.mappings.CollectionMapping.mergeIntoObject(CollectionMapping.java:1643)
        at org.eclipse.persistence.internal.descriptors.ObjectBuilder.mergeIntoObject(ObjectBuilder.java:4202)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:847)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:716)
        at org.eclipse.persistence.internal.sessions.MergeManager.mergeChanges(MergeManager.java:320)
        at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.mergeChangesIntoParent(UnitOfWorkImpl.java:3394)
        at org.eclipse.persistence.internal.sessions.RepeatableWriteUnitOfWork.mergeChangesIntoParent(RepeatableWriteUnitOfWork.java:381)
        at org.eclipse.persistence.internal.sessions.UnitOfWorkImpl.mergeClonesAfterCompletion(UnitOfWorkImpl.java:3527)
        at org.eclipse.persistence.transaction.AbstractSynchronizationListener.afterCompletion(AbstractSynchronizationListener.java:219)
        at org.eclipse.persistence.transaction.JTASynchronizationListener.afterCompletion(JTASynchronizationListener.java:81)
        at weblogic.transaction.internal.ServerSCInfo.doAfterCompletion(ServerSCInfo.java:1145)
        at weblogic.transaction.internal.ServerSCInfo.callAfterCompletionsForTier(ServerSCInfo.java:1121)
        at weblogic.transaction.internal.ServerSCInfo.callAfterCompletions(ServerSCInfo.java:1093)
        at weblogic.transaction.internal.ServerTransactionImpl.callAfterCompletions(ServerTransactionImpl.java:3773)
        at weblogic.transaction.internal.ServerTransactionImpl.afterCommittedStateHousekeeping(ServerTransactionImpl.java:3662)
        at weblogic.transaction.internal.ServerTransactionImpl.setCommitted(ServerTransactionImpl.java:3716)
        at weblogic.transaction.internal.ServerTransactionImpl.globalRetryCommit(ServerTransactionImpl.java:3432)
        at weblogic.transaction.internal.ServerTransactionImpl.globalCommit(ServerTransactionImpl.java:3327)
        at weblogic.transaction.internal.ServerTransactionImpl.internalCommit(ServerTransactionImpl.java:322)
        at weblogic.transaction.internal.ServerTransactionImpl.commit(ServerTransactionImpl.java:270)
        at weblogic.ejb.container.internal.BaseLocalObject.postInvoke1(BaseLocalObject.java:316)
        at weblogic.ejb.container.internal.BaseLocalObject.__WL_postInvokeTxRetry(BaseLocalObject.java:195)
        at weblogic.ejb.container.internal.SessionLocalMethodInvoker.invoke(SessionLocalMethodInvoker.java:46)
        at bla.bla.ProductScreenFacade_87crhu_NoIntfViewImpl.updateProduct(Unknown Source)

The above stack trace shows an area of code that we have never manipulated to avoid dead locks.
It is the very first time i encoutner this specific dead lock pattern.

The thread 137 is not behaving properly.

It essentially as part of the merge process went into:

org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(Object, ObjectChangeSet, ClassDescriptor, AbstractSession, UnitOfWorkImpl)


  // If the cache key was missing check the cache.
        // This occurs in the old merge, or if a new or changed object references an existing object that needs to be merged.
        if (cacheKey == null) {
            // #2 - old merge, #3 - nested, #6 referenced objects, #7 detached objects, #8 protected cache
            cacheKey = targetSession.getCacheKeyFromTargetSessionForMerge(implementation, builder, descriptor, this); <------- It went in here
            if (cacheKey != null){
                original = cacheKey.getObject();
            }
        }

Following that, it went into this method:

org.eclipse.persistence.internal.sessions.AbstractSession.getCacheKeyFromTargetSessionForMerge(Object, ObjectBuilder, ClassDescriptor, MergeManager)

protected CacheKey getCacheKeyFromTargetSessionForMerge(Object implementation, ObjectBuilder builder, ClassDescriptor descriptor, MergeManager mergeManager){
     Object original = null;
      Object primaryKey = builder.extractPrimaryKeyFromObject(implementation, this, true);
      if (primaryKey == null) {
          return null;
      }
      CacheKey cacheKey = null;
      if (mergeManager == null){
          cacheKey = getIdentityMapAccessorInstance().getCacheKeyForObject(primaryKey, implementation.getClass(), descriptor, true);
          if (cacheKey != null){
              cacheKey.checkReadLock();
          }
          return cacheKey;
      }

      cacheKey = getIdentityMapAccessorInstance().getCacheKeyForObject(primaryKey, implementation.getClass(), descriptor, true);
      if (cacheKey != null) {
          if (cacheKey.acquireReadLockNoWait()) {
              original = cacheKey.getObject();
              cacheKey.releaseReadLock();
          } else {
              if (!mergeManager.isTransitionedToDeferredLocks()) {
                  getIdentityMapAccessorInstance().getWriteLockManager().transitionToDeferredLocks(mergeManager);
              }
              cacheKey.acquireDeferredLock(); <------- (a) First, It defers on the lock it wishes to acquire.
---------- and then 
              original = cacheKey.getObject();
              if (original == null) {
                  synchronized (cacheKey) {
                      if (cacheKey.isAcquired()) {
                          try {
                              cacheKey.wait(); <---------- (b) It now greadely decides, I am quite convinced i am guaranteed
--------- To get what I need. I will just wait Forever, take however long it must take for the cache key owned by thread 444
-------- To be released to me.
-------- Well, the thread 444 is right now wanting two cache keys owned by the thread 137 to be released
------- So this is never going to happen. This wait is now effectively eternal
------- And it contrasts deeply with the code of  
------- org.eclipse.persistence.internal.helper.WriteLockManager.acquireRequiredLocksInternal(MergeManager, UnitOfWorkChangeSet)
------- Whereby the write lock manager when it is fighting to acquire the keys it needs, if it fails to acquire lets go of all of its acquired keys.

                          } catch (InterruptedException e) {
                              //ignore and return
                          }
                      }
                      original = cacheKey.getObject();
                  }
              }
              cacheKey.releaseDeferredLock();
          }
      }
      return cacheKey;
   }

I am not sure what is the right patch for this.

The very first obvious thing is that Eternal waits without a timeout are no go.

The second point I would have is that if feasible the logic of this method should somehow memic the logic we can find i the
org.eclipse.persistence.internal.helper.WriteLockManager.acquireRequiredLocksInternal(MergeManager, UnitOfWorkChangeSet)

The acquire required locks logic put the thread wanting its locks for writing behaving nicely.
If the thread is not getting what it wants that code wil then do

   // failed to acquire lock, release all acquired
                            // locks and place thread on waiting list
                            releaseAllAcquiredLocks(mergeManager);
                            // get cacheKey
@99sono
Copy link
Author

99sono commented Mar 13, 2024

Note:

This bug will be existing on every version of eclipse link 2.7.6 and above.

The version of eclipselink being used in this tracker is essential a 2.7.6 with patches.

Therefore, the closest eclipselink version to look at in terms of the stack traces presented is the 2.7.9:
https://github.com/eclipse-ee4j/eclipselink/blob/2.7.9/foundation/org.eclipse.persistence.core/src/org/eclipse/persistence/internal/helper/ConcurrencyUtil.java

That being said, I believe the core points of the stack traces of eclipselink for the interpreation of the analysis are all layed our bare in the analysis above.

If we need a one liner summary for this tracker is that:

We have a new pattern for eclipselnk deadlocks, whereby the org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(Object, ObjectChangeSet, ClassDescriptor, AbstractSession, UnitOfWorkImpl)
via the API cal to org.eclipse.persistence.internal.sessions.AbstractSession.getCacheKeyFromTargetSessionForMerge(Object, ObjectBuilder, ClassDescriptor, MergeManager)

goes into a "all-or-nothing" agressive attempt to acquire cache key with "deferral + followup eternal wait".

This all-or-nothing bet that the thread will just eventually stop waiting is a losing bet, there is very real risk that somebody else is holding onto keys of the agressive thread, and if anyone is holding onto keys of the agreesive 137 thread, the it is game over.

Our thread 444 did everything correct.
Patiently trying to release its deferred locks.
The thread 137 did everyrthing wrong, not letting go of its cache keys and wanting more.

Not really a one liner, but it makes the picture of the dead lock super clear.

@99sono
Copy link
Author

99sono commented Apr 23, 2024

Today I have been provided some data of a new event of this sort in a different project. This data will be provided via a private channel to the eclipselink team.

@99sono
Copy link
Author

99sono commented Apr 27, 2024

I intend to share my current approach for minimizing the risk of encountering this deadlock pattern. However, I must emphasize that this solution falls short of my ideal vision.

The approach I'm pursuing involves a substantial rewrite of the method:

org.eclipse.persistence.internal.sessions.AbstractSession.getCacheKeyFromTargetSessionForMerge(Object, ObjectBuilder, ClassDescriptor, MergeManager)

Here are the key modifications:

  1. Eliminating Eternal Waits: First and foremost, I've removed the eternal waits from the method. Instead, it now employs timed waits.

  2. Deadlock Recognition: The revised method is designed to recognize when it's in deep trouble and a deadlock may be occurring.

  3. Returning Cache Key Despite Null Object: If the method detects a potential deadlock, it will return the cache key even if the associated object is still null. Surprisingly, this seems acceptable because the parent caller—specifically:

    org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(MergeManager.java:782)
    

    —appears to be robust. During debugging sessions, deliberately setting the cacheKey object to null revealed that the method can still retrieve the object from the server session cache if necessary. It seems the caller of the abstract session has an arsenal of tricks to navigate this post-commit phase and merge clones into the session cache.

  4. Flagging the Cache Key as INVALID: To safeguard against cache corruption, I flag the cache key at the root of the problem as an INVALID cache key. This minor trade-off ensures the entire cache remains secure, even if the cache key isn't updated.

  5. Imperfect Solution: Admittedly, the solution I'm proposing is far from perfect or my ideal vision. In an ideal scenario, the merge manager code would simply declare:

    "I recognize that I'm in deep trouble trying to merge my committed changes into the cache. I don't have the necessary locks to place my clones into the cache. Fine, I'll abandon my cache update attempt and instead mark every object in my change set as invalid."

    This approach would be the fix I'd like to see implemented.

  6. Benefits of the Proposed Fix:

    • The cache remains uncorrupted.
    • The system avoids freezing due to a pure deadlock.
    • Ultimately, the deadlock is resolved because the thread responsible for merging releases all its locks, saving the day.
  7. Experience Gap: Unfortunately, I lack the experience to implement my ideal fix.

In summary, I hope we can surpass the current implementation by addressing this defect effectively.

@99sono
Copy link
Author

99sono commented Apr 27, 2024

Here, with this post, I am uploading my patch.

It gives an overall idea of what I changed in the code.

0001-TRK-32421-Implement-an-attempt-to-fix-for-issue.patch

I am also uploading all of the core files i have changed.

full_classes_related_to_patch_tempFilesToUpload.zip

Overview of Uploaded Files:
With the files you've uploaded, you'll find both the patch I'll be testing this Monday and the complete files themselves. Having access to the full files is essential, especially if the patch proves challenging to understand.

Key Method for Code Analysis:
The most critical method to examine is:

org.eclipse.persistence.internal.sessions.AbstractSession.getCacheKeyFromTargetSessionForMergeScenarioMergeManagerNotNullAndCacheKeyOriginalStillNull(CacheKey, Object, ObjectBuilder, ClassDescriptor, MergeManager)

This method is positioned just above the older implementation, which has now been renamed to:

org.eclipse.persistence.internal.sessions.AbstractSession.getCacheKeyFromTargetSessionForMergeDeprecated(Object, ObjectBuilder, ClassDescriptor, MergeManager)

Begin your code analysis here, as it holds significant importance in the context of your changes.

To execute this code, ensure that your persistence.xml file includes the following new persistence property:

<!--
    Possibilities are:
        "backwards-compatibility" or "wait-loop-with-time-limit"
    The option "backwards compatibility" is not recommended since it may lead to a deadlock pattern.
    For more details, refer to:
    GitHub Issue #2094
-->
<property name="eclipselink.concurrency.manager.allow.abstractsession.getCacheKeyFromTargetSessionForMerge.modeofoperation" value="wait-loop-with-time-limit"/>

This property configuration is crucial for managing concurrency, and it's advisable to avoid the "backwards compatibility" option due to the potential deadlock issue. For further context, please review the provided GitHub issue.

@99sono
Copy link
Author

99sono commented Apr 29, 2024

In order to check the new code, I have had to heavily manipulate my debugger to override local variables so that I could finally get into the:

org.eclipse.persistence.internal.sessions.AbstractSession.getCacheKeyFromTargetSessionForMergeScenarioMergeManagerNotNullAndCacheKeyOriginalStillNull(CacheKey, Object, ObjectBuilder, ClassDescriptor, MergeManager)

Here is what the new massive dump with the new page 08 looks like.

massive_dump_generated_by_2.7.6-wls14_1_1_0v003_2024_04_09_attempt_001.txt

Refining Code Execution in Manipulated EclipseLink

While working with the manipulated JAR file of EclipseLink, I successfully exercised the new code. The specific version we're using is 2.7.6-wls14_1_1_0v003_2024_04_09_attempt_001.

This JAR includes not only the bugfix for the issue at hand but also integrates all other relevant bugfixes. You can find details about the issue on GitHub.

Unfortunately, I lack the means to reproduce the bug directly. Consequently, my approach involves strategically placing debug breakpoints during the merge phase of EclipseLink. Specifically, I focus on the following methods:

  1. org.eclipse.persistence.internal.sessions.MergeManager.mergeChangesOfWorkingCopyIntoOriginal(Object, ObjectChangeSet, ClassDescriptor, AbstractSession, UnitOfWorkImpl)
  2. org.eclipse.persistence.internal.sessions.AbstractSession.getCacheKeyFromTargetSessionForMerge(Object, ObjectBuilder, ClassDescriptor, MergeManager)
  3. And finally, within the method:
    org.eclipse.persistence.internal.sessions.AbstractSession.getCacheKeyFromTargetSessionForMergeScenarioMergeManagerNotNullAndCacheKeyOriginalStillNull(CacheKey, Object, ObjectBuilder, ClassDescriptor, MergeManager)

By manipulating these critical areas, I can effectively exercise the new code.

I'll keep you informed if we manage to eliminate this new deadlock pattern. However, it's essential to acknowledge that our current approach to patching this bug falls short of the ideal solution.

I have already explained that the ideal solution here consists on the merge manager release 100% of its locks, stopping its attempt to update the server session cache, and flagging all cache keys as invalid / stale.

That is the proper way to go.

My approach my prove to not be sufficiently effective, we shall see.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant