New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Infinite Loop of Panic/Rollback (NewEpochFailure) #398
Comments
Is this mainnet? Are you upgrading from one version of the software to another? |
Version |
yes, mainnet, I built a new DB and it is working now with the same version, so not sure what happened |
Closing this. |
I've just hit this on the
|
The problem happens at epoch rollover. |
@rhyslbw what is the git hash of the The one @mmahut was using in #404 was commit 6187081 which is missing commit afe68e0 which contains changes to the ledger-state config. This issue is about it dying in ledger state related code, but that change should not make a difference on mainnnet. The |
This is a HUGE pain in the neck to debug without a fix for #256 . |
@erikd I'm using the release tag commit |
Did a LOT of work trying to recreate this issue, but it is not deterministic. I am currently running a version of this code that should better catch any errors (and abort immediately). I am hoping there is a chance of triggering this again on the next epoch boundary which happens in about 14 hours from now. |
10 nodes, all of them crashed with this specific bug. The lstate's are:
and using git rev |
Current hypothesis is that ledger state gets corrupted at some point and that the corruption is only noticed at the epoch boundary. |
Infinite loop NewEpochFailue
|
2 of 3 instances threw the
The snapshots of the instance not affected have been rolled out. |
The fact that the same ledger state file, eg |
If the cause is a corrupted ledger state, probably this and #405 are the same issue |
@SebastienGllmt Yes, that is possible, but I have not even had a chance to look at #405 yet. |
2 out of 4 instances thrown down
(the other failing instance has different hashes) Healthy instance:
|
I tried to resync 10 different instances on commit 2 out of 10 had a corrupted version of the ledger state files (different from the rest). The correct sums are:
I have also noticed an inconsistency in the files. |
FYI, same issue again on the epoch 230 transition... |
Ok, I know what is causing the problem. Fixing this is relatively simple. The fix will not require a db resync unless the ledger state is already corrupt (which will be detected by the fixed version of the software). The problem is:
|
@erikd Is it possible to make this a config toggle between full checking and fast checking? I'd prefer to running everything in "safe" mode and use extra resources to make sure it stays up. |
@CyberCyclone Once the hash is checked, there is nothing else that can go wrong with probability greater than the chance of a 256 bit hash collision. The hash should have been checked. I thought it was being checked. Once it is checked, there is no reason to do more checking. |
Awesome, great to hear! The way it was worded sounded like there was a lot more going on. But yeah, hash collisions aren't anything to worry about. |
It should not be possible for it to roll back to the correct slot number, but the wrong block. The chain sync instructs to roll back to a specified point on the chain (point being a slot+hash), but this point is guaranteed to exist on the consumer's chain. Yes it's very sensible to check, but if this check were to fail then that indicates a logic bug somewhere. So I think this will need more investigation before we can call it fixed. Adding an assertion should detect the problem much more promptly at the point where it occurs, rather than much later at the epoch boundary. Adding an assertion is not itself a fix of course. |
It is possible if the rollback only checks the slot number but not the hash. |
The logging has now produced this:
which is a little odd. Restarting it results in:
Need to check the code for this. |
I have a temprory work around fix for this. The work around comes from my work-in-progress debugging branch, but has not been full tested, QAed or released. If anyone is running the There are no database changes relative to However, running this version may detect an already corrupted ledger state (I am not even sure what that would look like) in which case a resync will be required. |
After adding a bunch of debug code and then waiting for the problem to be triggered. Turns out this issue is a race condition. From the logs:
Basically what happens is:
The fix is to move the code to rollback ledger state from the write end of the queue to the read end. |
Fixed on master in #413 . There will also be |
The text was updated successfully, but these errors were encountered: