Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: handle BerkeleyJE DB interruption [tp-tests] #4425

Merged
merged 1 commit into from May 14, 2024

Conversation

tien
Copy link
Contributor

@tien tien commented Apr 29, 2024

Currently, any interruption to BerkeleyJE DB will cause Janusgraph to stop working entirely & require a complete restart of Janusgraph. This PR will make it so interruption to BerkeleyJE DB will close & reopen the Berkeley's environment instead.

Fixes #2120

Copy link

linux-foundation-easycla bot commented Apr 29, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: tien / name: Tiến Nguyễn Khắc (4b86cbd)

@tien
Copy link
Contributor Author

tien commented Apr 29, 2024

@porunov @li-boxuan struggling on how to test this, the test that I copied from #3990 pass but when I check it actually doesn't do anything.

@tien tien force-pushed the fix/berkeley-db-interruption branch 6 times, most recently from d36b5de to c9e5d4d Compare April 30, 2024 10:41
@tien
Copy link
Contributor Author

tien commented Apr 30, 2024

Update: I was able to write a test to consistently trigger the error n prove that my change should work.

@tien tien force-pushed the fix/berkeley-db-interruption branch 2 times, most recently from be7dc33 to 88f00f5 Compare April 30, 2024 14:27
@tien
Copy link
Contributor Author

tien commented Apr 30, 2024

Have made a change to only re/initialize on interruption.

@janusgraph-bot janusgraph-bot added the cla: external Externally-managed CLA label Apr 30, 2024
@tien tien force-pushed the fix/berkeley-db-interruption branch 2 times, most recently from 912bc67 to 771d46e Compare April 30, 2024 23:12
@tien
Copy link
Contributor Author

tien commented Apr 30, 2024

Oops, forgot to enable my test since it works now, have done so.

@tien tien force-pushed the fix/berkeley-db-interruption branch from 771d46e to c7e7eca Compare May 1, 2024 04:16
@tien
Copy link
Contributor Author

tien commented May 1, 2024

@li-boxuan @porunov all tests passed, can you guys take a look 👀? Keen to get this merged soon 💪.

@li-boxuan li-boxuan requested a review from mad May 2, 2024 03:04
@li-boxuan
Copy link
Member

I am sorry but I seem to have missed some context. Do you know why @mad 's previous PR didn't pass the test, and what did you do differently to solve the problem?

@tien
Copy link
Contributor Author

tien commented May 2, 2024

I am sorry but I seem to have missed some context. Do you know why @mad 's previous PR didn't pass the test, and what did you do differently to solve the problem?

@li-boxuan I'm not sure why that one didn't pass the test. I'm using a completely different approach in this PR that involves a lot less changes.

@mad
Copy link
Contributor

mad commented May 2, 2024

Thanks for your contribution!

For proper testing, you need to disable test filtering here, here and here

To be completely correct, you still need to enabld this (supportsInterruption)

@tien tien force-pushed the fix/berkeley-db-interruption branch from c7e7eca to fb85ab0 Compare May 2, 2024 14:11
@tien
Copy link
Contributor Author

tien commented May 2, 2024

Thank you @mad, I have committed the suggested changes.

@tien
Copy link
Contributor Author

tien commented May 2, 2024

Thanks for your contribution!

For proper testing, you need to disable test filtering here, here and here

To be completely correct, you still need to enabld this (supportsInterruption)

@mad Look like doing this will actually fail the tests.

I think it's because BerkeleyJE DB technically still does not support interruption, as in it will throw an error upon being interrupted. We are only reopening the environment here so that future requests won't all fail.

@tien tien force-pushed the fix/berkeley-db-interruption branch from fb85ab0 to 2e7eaf3 Compare May 2, 2024 22:47
@tien
Copy link
Contributor Author

tien commented May 2, 2024

@mad @li-boxuan I have implemented the interruption retry logic.

@tien tien force-pushed the fix/berkeley-db-interruption branch from 98a9cbb to 62de78d Compare May 3, 2024 22:33
@tien
Copy link
Contributor Author

tien commented May 3, 2024

I've integrated some of the other changes from @mad, so the interruption test suite should pass now. cc @mad @li-boxuan

@tien tien requested a review from li-boxuan May 3, 2024 22:46
@tien tien force-pushed the fix/berkeley-db-interruption branch 3 times, most recently from 0015c00 to 68976e5 Compare May 3, 2024 23:25
@tien
Copy link
Contributor Author

tien commented May 4, 2024

Hmm, look like there was one last flaky test in the interruption test.

This test parameter in the TraversalInterruptionTest & TraversalInterruptionComputerTest is failing sometime locally for me, which look like it did fail on the CI:

{"g_E_properties", (Function<GraphTraversalSource, GraphTraversal<?,?>>) g -> g.E(), (UnaryOperator<GraphTraversal<?,?>>) t -> t.properties()}

Will investigate, getting close, just 1 case left that need to pass 💪.

@tien tien force-pushed the fix/berkeley-db-interruption branch from 68976e5 to ef3aa1f Compare May 4, 2024 10:27
@tien
Copy link
Contributor Author

tien commented May 4, 2024

Okay, I've made further changes, after which I haven't been able to reproduce any failure when running the test locally. Hopefully that means the flakiness has gone away 😅.

@tien tien force-pushed the fix/berkeley-db-interruption branch from ef3aa1f to 3a56be5 Compare May 4, 2024 11:46
@tien
Copy link
Contributor Author

tien commented May 4, 2024

Oops, forgot to check other tests beside interruption, made I mistake, have pushed a fix.

@tien tien force-pushed the fix/berkeley-db-interruption branch from 3a56be5 to 7c176ce Compare May 5, 2024 06:59
Co-authored-by: Pavel Ershov <owner.mad.epa@gmail.com>

Signed-off-by: Tiến Nguyễn Khắc <tien.nguyenkhac@icloud.com>
@tien tien force-pushed the fix/berkeley-db-interruption branch from 7c176ce to 4b86cbd Compare May 5, 2024 07:20
@tien
Copy link
Contributor Author

tien commented May 5, 2024

Have made the best effort implementation, so BerkeleyJE DB environment will be reset after interruption & also that race condition around thread are covered so that the interruption test from tp-test will always works.

@tien
Copy link
Contributor Author

tien commented May 5, 2024

@li-boxuan @porunov all tests passed :D

There's the coverage check, but I don't think I can get it any higher though, cuz a lot of the uncovered code are catch clauses. Which are impossible to trigger consistently.

@li-boxuan
Copy link
Member

Codecov sometimes seems buggy and we don't know why. Don't worry about that. Congrats on passing all tests! @mad would you be able to take another look?

Copy link
Contributor

@mad mad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finishing the job!

@tien
Copy link
Contributor Author

tien commented May 7, 2024

Awesome, thanks for all the helps and supports guys 🙏

@li-boxuan
Copy link
Member

I am gonna merge this following lazy merge consensus. Thank you @tien for fixing this!

@li-boxuan li-boxuan added this to the Release v1.1.0 milestone May 14, 2024
@li-boxuan li-boxuan merged commit 90b9694 into JanusGraph:master May 14, 2024
118 of 119 checks passed
@janusgraph-automations
Copy link

💚 All backports created successfully

Status Branch Result
v1.0

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

@porunov
Copy link
Member

porunov commented May 14, 2024

After merging this PR tp-tests are not failing in master branch for berkleyje.

Error:  Errors: 
Error:  org.apache.tinkerpop.gremlin.process.traversal.TraversalInterruptionTest.shouldRespectThreadInterruptionInVertexStep[expectInterruption(g_V_outE)]
[INFO]   Run 1: PASS
Error:    Run 2: TraversalInterruptionTest>AbstractGremlinTest.tearDown:155 » LockTimeout (JE 18.3.12) Lock expired. Locker 1229014061 2150_main_Txn: waited for lock on database=_jeNameMap LockAddr:617727493 LSN=0x0/0x1167 type=WRITE grant=WAIT_NEW timeoutMillis=500 startTime=1715693374163 endTime=1715693374663

I will try to re-run those tests again. If they don't pass we should revert this commit.

@tien
Copy link
Contributor Author

tien commented May 15, 2024

After merging this PR tp-tests are not failing in master branch for berkleyje.

Error:  Errors: 
Error:  org.apache.tinkerpop.gremlin.process.traversal.TraversalInterruptionTest.shouldRespectThreadInterruptionInVertexStep[expectInterruption(g_V_outE)]
[INFO]   Run 1: PASS
Error:    Run 2: TraversalInterruptionTest>AbstractGremlinTest.tearDown:155 » LockTimeout (JE 18.3.12) Lock expired. Locker 1229014061 2150_main_Txn: waited for lock on database=_jeNameMap LockAddr:617727493 LSN=0x0/0x1167 type=WRITE grant=WAIT_NEW timeoutMillis=500 startTime=1715693374163 endTime=1715693374663

I will try to re-run those tests again. If they don't pass we should revert this commit.

I'm pretty sure this is just a flaky error due to Berkeley DB can't handle many concurrent reads/writes, which now happen cuz we no longer exclude it from some of the tests.

Like I can see that this pass https://github.com/JanusGraph/janusgraph/actions/runs/9075171201/job/24965523610
But this run got the time-out error https://github.com/JanusGraph/janusgraph/actions/runs/9075171201/job/24965524860

Look like a whole other issue: #1623

@porunov
Copy link
Member

porunov commented May 15, 2024

After merging this PR tp-tests are not failing in master branch for berkleyje.

Error:  Errors: 
Error:  org.apache.tinkerpop.gremlin.process.traversal.TraversalInterruptionTest.shouldRespectThreadInterruptionInVertexStep[expectInterruption(g_V_outE)]
[INFO]   Run 1: PASS
Error:    Run 2: TraversalInterruptionTest>AbstractGremlinTest.tearDown:155 » LockTimeout (JE 18.3.12) Lock expired. Locker 1229014061 2150_main_Txn: waited for lock on database=_jeNameMap LockAddr:617727493 LSN=0x0/0x1167 type=WRITE grant=WAIT_NEW timeoutMillis=500 startTime=1715693374163 endTime=1715693374663

I will try to re-run those tests again. If they don't pass we should revert this commit.

I'm pretty sure this is just a flaky error due to Berkeley DB can't handle many concurrent reads/writes, which now happen cuz we no longer exclude it from some of the tests.

Like I can see that this pass https://github.com/JanusGraph/janusgraph/actions/runs/9075171201/job/24965523610 But this run got the time-out error https://github.com/JanusGraph/janusgraph/actions/runs/9075171201/job/24965524860

Look like a whole other issue: #1623

Yes, tests for berkeleyje with Java 8 passed after restart, but failed again for Java 11. I restarted the job again for Java 11.
We should probably increase those timeouts as I see timeoutMillis=500 which is probably isn't enough for berkeleyje in GitHub Actions sometimes.

@tien
Copy link
Contributor Author

tien commented May 15, 2024

@porunov I can see that it failed again :(

It's like a flaky race condition where the test try to create an entirely new Berkeley DB graph when the previous one is still invalid/re-intializing (Probably due to the lock issue). This isn't handled by this PR: the BerkeleyDB store doesn't have control over other instances created before or after it.

This won't happen in production because you'll most likely & should have only one Janusgraph with Berkeley DB instance up and running.

Maybe we should just continue what we did before and disable this particular test for BerkeleyDB?

Or push on with the flaky test, It has always pass for me when I run this locally :p

@tien
Copy link
Contributor Author

tien commented May 15, 2024

Though like you said @porunov increasing the lock time-out might just make this problem go away :p

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/v1.0 cla: external Externally-managed CLA
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JanusGraph image stop responding after query timeout
6 participants