Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

state: fix locking issues during Commit #1102

Merged
merged 2 commits into from
Sep 20, 2023
Merged

state: fix locking issues during Commit #1102

merged 2 commits into from
Sep 20, 2023

Conversation

altergui
Copy link
Contributor

@altergui altergui commented Sep 13, 2023

state: fix locking issues during Commit

  • State.MainTreeView needs to respect v.tx.RLock()
  • State.Save now defers v.tx.Unlock() until the end

fix #1100 'while down, could not get value for key'
that was returned on any query to the state
during Commit of a block

  • e2etest: add raceDuringCommit to reproduce 'while down, could not get value for key'

@altergui altergui changed the title f/arbo while down test to reproduce "arbo while down" bug Sep 13, 2023
@altergui altergui self-assigned this Sep 13, 2023
@altergui altergui marked this pull request as draft September 13, 2023 11:05
@altergui altergui changed the title test to reproduce "arbo while down" bug test to reproduce "arbo while down" bug (WIP) Sep 13, 2023
@coveralls
Copy link

coveralls commented Sep 13, 2023

Pull Request Test Coverage Report for Build 6247858876

  • 71 of 84 (84.52%) changed or added relevant lines in 3 files are covered.
  • 29 unchanged lines in 12 files lost coverage.
  • Overall coverage decreased (-0.03%) to 57.414%

Changes Missing Coverage Covered Lines Changed/Added Lines %
cmd/end2endtest/race.go 68 81 83.95%
Files with Coverage Reduction New Missed Lines %
httprouter/apirest/apirest.go 1 74.67%
api/chain.go 2 31.48%
apiclient/blockchain.go 2 43.14%
tree/arbo/navigate.go 2 71.56%
tree/arbo/tree.go 2 60.92%
vochain/transaction/election_tx.go 2 45.93%
vochain/transaction/transaction.go 2 38.79%
vochain/vochaininfo/vochaininfo.go 2 77.83%
apiclient/helpers.go 3 67.43%
vochain/state/state.go 3 66.87%
Totals Coverage Status
Change from base Build 6199566912: -0.03%
Covered Lines: 12515
Relevant Lines: 21798

💛 - Coveralls

@altergui
Copy link
Contributor Author

altergui commented Sep 18, 2023

i went down a deep rabbit hole until ac6202a trying to reproduce the issue with go test, only to hit (and fix) a couple other concurrency issues, until i gave up due to #1106 and discarded that approach altogether

so i went the end2endtest way, and reproduced it by bruteforce 🥳
https://github.com/vocdoni/vocdoni-node/actions/runs/6222851817/job/16887584057?pr=1102#step:3:2970

testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:33.981Z DBG vochain/app.go:354 > deliver tx hash=ec186246564eab843af37f3dbbc64132ea1d4c404abeba38119e0edf31c51821 height=6 tx={"Payload":{"SetProcess":{"nonce":1,"processId":"xdJGAYb3gk927pAoDZUKYHyW369msbnI6pMCAAAAAAA=","status":2,"txtype":2}}} type=setProcess
testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:33.982Z DBG ist/ist.go:124 > schedule IST action action=commit-results height=7 id=c5d2460186f7824f76ee90280d950a607c96dfaf66b1b9c8ea93020000000000
testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:33.982Z DBG state/account.go:258 > burning fee for tx SET_PROCESS_STATUS with cost 1 from account 0x824f76Ee90280D950a607C96dFaf66b1b9C8ea93
testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:33.982Z DBG state/account.go:202 > setAccount: address 0xFFfFfFffFFfffFFfFFfFFFFFffFFFffffFfFFFfF, nonce 0, infoURI , balance: 13, delegates: [], processIndex: 0
testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:33.982Z DBG state/account.go:202 > setAccount: address 0x824f76Ee90280D950a607C96dFaf66b1b9C8ea93, nonce 2, infoURI ipfs://bafybeig6es56e5luv6mpletx2y6cwtokixsfzlqy3tzazxgcuzehdmygqi, balance: 99989, delegates: [], processIndex: 1
testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:33.982Z DBG offchaindatahandler/offchaindatahandler.go:151 > adding account info metadata ipfs://bafybeig6es56e5luv6mpletx2y6cwtokixsfzlqy3tzazxgcuzehdmygqi to queue
testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:33.99Z DBG httprouter/message.go:65 > response: (500) {"error":"election results are not yet available","code":5024}
testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:33.99Z DBG middleware/logger.go:158 > "GET http://gateway0:9090/v2/elections/c5d2460186f7824f76ee90280d950a607c96dfaf66b1b9c8ea93020000000000/scrutiny HTTP/1.1" from 172.18.0.5:40342 - 500 63B in 165.331µs
testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:33.991Z DBG indexer/indexer.go:326 > updated process processID=c5d2460186f7824f76ee90280d950a607c96dfaf66b1b9c8ea93020000000000
testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:33.991Z INF offchaindatahandler/offchaindatahandler.go:79 > importing data type="election metadata" uri=ipfs://bafybeig6es56e5luv6mpletx2y6cwtokixsfzlqy3tzazxgcuzehdmygqi
testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:33.991Z DBG downloader/downloader.go:121 > pining remote file uri=ipfs://bafybeig6es56e5luv6mpletx2y6cwtokixsfzlqy3tzazxgcuzehdmygqi
testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:33.991Z DBG ipfs/ipfs.go:269 > retrieved file /ipfs/bafybeig6es56e5luv6mpletx2y6cwtokixsfzlqy3tzazxgcuzehdmygqi from cache
testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:33.991Z INF offchaindatahandler/metadata.go:17 > metadata downloaded successfully from ipfs://bafybeig6es56e5luv6mpletx2y6cwtokixsfzlqy3tzazxgcuzehdmygqi (139 bytes)
testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:34.001Z DBG httprouter/message.go:65 > response: (500) {"error":"cannot fetch election: (c5d2460186f7824f76ee90280d950a607c96dfaf66b1b9c8ea93020000000000): while down, could not get value for key 5408cfc0908767c96f6d65bb346a54eec51f0c9751786022832ec3308303de23: key not found","code":5008}
testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:34.001Z DBG middleware/logger.go:158 > "GET http://gateway0:9090/v2/elections/c5d2460186f7824f76ee90280d950a607c96dfaf66b1b9c8ea93020000000000/scrutiny HTTP/1.1" from 172.18.0.5:40342 - 500 235B in 152.881µs
testsuite_6222851817-gateway0-1    | 2023-09-18T13:00:34.001Z INF vochain/app.go:339 > commit block height=6 txs=1

@altergui
Copy link
Contributor Author

so definitely the race is during app.Commit
just before Commit starts, "election results are not yet available"
during Commit, "key not found"
after Commit, results

testsuite_6225351155-gateway0-1    | 2023-09-18T16:31:48.647Z DBG httprouter/message.go:65 > response: (500) {"error":"election results are not yet available","code":5024}
testsuite_6225351155-gateway0-1    | 2023-09-18T16:31:48.647Z DBG middleware/logger.go:158 > "GET http://gateway0:9090/v2/elections/c5d2460186f78a33bf69942145a4edd5c7f8d6a2fb1265c42adc020000000000/scrutiny HTTP/1.1" from 192.168.64.5:34370 - 500 63B in 150.921µs
testsuite_6225351155-gateway0-1    | 2023-09-18T16:31:48.653Z ERR vochain/app.go:325 > Commit: start height 7
testsuite_6225351155-gateway0-1    | 2023-09-18T16:31:48.653Z DBG indexer/vote.go:110 > finalize results height=7 processID=c5d2460186f78a33bf69942145a4edd5c7f8d6a2fb1265c42adc020000000000
testsuite_6225351155-gateway0-1    | 2023-09-18T16:31:48.654Z DBG indexer/indexer.go:326 > updated process processID=c5d2460186f78a33bf69942145a4edd5c7f8d6a2fb1265c42adc020000000000
testsuite_6225351155-gateway0-1    | 2023-09-18T16:31:48.658Z ERR pebbledb/pebledb.go:23 > pebble.ErrNotFound key 742f70918abcf7a86089eb2b40e05817f19674544042cf22b50aed7319c7a7961154
testsuite_6225351155-gateway0-1    | 2023-09-18T16:31:48.658Z ERR arbo/tree.go:546 > t.GetWithTx err while down, could not get value for key 70918abcf7a86089eb2b40e05817f19674544042cf22b50aed7319c7a7961154: key not found
testsuite_6225351155-gateway0-1    | 2023-09-18T16:31:48.658Z DBG httprouter/message.go:65 > response: (500) {"error":"cannot fetch election: (c5d2460186f78a33bf69942145a4edd5c7f8d6a2fb1265c42adc020000000000): while down, could not get value for key 70918abcf7a86089eb2b40e05817f19674544042cf22b50aed7319c7a7961154: key not found","code":5008}
testsuite_6225351155-gateway0-1    | 2023-09-18T16:31:48.658Z DBG middleware/logger.go:158 > "GET http://gateway0:9090/v2/elections/c5d2460186f78a33bf69942145a4edd5c7f8d6a2fb1265c42adc020000000000/scrutiny HTTP/1.1" from 192.168.64.5:34370 - 500 235B in 242.352µs
testsuite_6225351155-gateway0-1    | 2023-09-18T16:31:48.664Z ERR vochain/app.go:343 > Commit: done height 7 txs 0
testsuite_6225351155-gateway0-1    | 2023-09-18T16:31:48.669Z DBG httprouter/message.go:65 > response: (200) {"abiEncoded":"0xc5d2460186f78a33bf69942145a4edd5c7f8d6a2fb1265c42adc0200000000000000000000000000000000008a33bf69942145a4edd5c7f8d6a2fb1265c42adcca801295639bdab61157cf9a187b743aa54c7128cd6c0ba271e8951e0d68a7de000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000a0000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000200000000000000000000000000000000000000000000000000000000000000003000000000000000000000000000000000000000000000000000000000000003200000000000000000000000000000000000000000000000000000000000000320000000000000000000000000000000000000000000000000000000000000000","censusRoot":"ca801295639bdab61157cf9a187b743aa54c7128cd6c0ba271e8951e0d68a7de","electionId":"c5d2460186f78a33bf69942145a4edd5c7f8d6a2fb1265c42adc020000000000","organizationId":"8a33bf69942145a4edd5c7f8d6a2fb1265c42adc","results":[["50","50","0"]]}

@altergui
Copy link
Contributor Author

so, some progress, but not enough (yet)

on one hand, pau reminded me of #465 which i believe will be fixed by fc75e10

in theory my batch.mu shouldn't be necessary since all users of batch implement their own locking (as in treeTxWithMutex)
but this sometimes did not work as expected (like #581 or #1069)

@altergui
Copy link
Contributor Author

but on another hand, it looks like the Batch.Commit was not the culprit (it was just a coincidence?), after implementing the lock, the commit starts and ends ... AND THEN some ms after, the pebble.ErrNotFound still happens. so i'll keep digging 😓

https://github.com/vocdoni/vocdoni-node/actions/runs/6234524391/job/16921885853?pr=1102#step:4:7406

testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.499Z ERR pebbledb/pebledb.go:114 > tx.batch.Commit with len 1281
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.508Z ERR pebbledb/pebledb.go:116 > tx.batch.Commit end, stats {0s 0s 0s 0s 0s 0s 0s}
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z ERR pebbledb/pebledb.go:74 > WriteTx.Get 6d2f762f63757272656e74
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z ERR pebbledb/pebledb.go:74 > WriteTx.Get 6d2f762f07000000
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z ERR pebbledb/pebledb.go:74 > WriteTx.Get 742f726f6f74
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z ERR pebbledb/pebledb.go:164 > PebbleDB.Get 742f05d34c6603ef0a263cb552c6d1b6d141007cd7d402f5ed2a5ac677e347360d55
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z ERR pebbledb/pebledb.go:74 > WriteTx.Get 742f726f6f74
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z ERR pebbledb/pebledb.go:164 > PebbleDB.Get 6d2f762f63757272656e74
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z ERR pebbledb/pebledb.go:26 > pebble.ErrNotFound key 742f05d34c6603ef0a263cb552c6d1b6d141007cd7d402f5ed2a5ac677e347360d55
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z ERR pebbledb/pebledb.go:164 > PebbleDB.Get 6d2f762f07000000
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z ERR arbo/tree.go:543 > t.GetWithTx err while down, could not get value for key 05d34c6603ef0a263cb552c6d1b6d141007cd7d402f5ed2a5ac677e347360d55: key not found
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z ERR pebbledb/pebledb.go:164 > PebbleDB.Get 742f726f6f74
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z ERR pebbledb/pebledb.go:164 > PebbleDB.Get 742fdd6cf959bc95bbb7a2526c775ec8db24719bb33330f2cb0cc2787aa307c7f66c
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z DBG httprouter/message.go:65 > response: (500) {"error":"cannot fetch election: (c5d2460186f7120773dcefe86c12965290aff8ef1f63940350aa020000000000): while down, could not get value for key 05d34c6603ef0a263cb552c6d1b6d141007cd7d402f5ed2a5ac677e347360d55: key not found","code":5008}
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z ERR pebbledb/pebledb.go:164 > PebbleDB.Get 6d2f762f63757272656e74
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z DBG middleware/logger.go:158 > "GET http://gateway0:9090/v2/elections/c5d2460186f7120773dcefe86c12965290aff8ef1f63940350aa020000000000/scrutiny HTTP/1.1" from 172.30.0.5:46816 - 500 235B in 10.185635ms
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z ERR pebbledb/pebledb.go:164 > PebbleDB.Get 6d2f762f07000000
testsuite_6234524391-gateway0-1    | 2023-09-19T10:46:52.509Z ERR vochain/app.go:343 > Commit: done height 7 txs 0

@altergui
Copy link
Contributor Author

mapping the rabbit hole:

GET http://gateway0:9090/v2/elections/c5d2460186f7ad518812f44e1358976412a5efeb7379935c021d020000000000/scrutiny goes down this path:

// @Router /elections/{electionID}/scrutiny [get]
func (a *API) electionScrutinyHandler(_ *apirest.APIdata, ctx *httprouter.HTTPContext) error {
electionID, err := hex.DecodeString(util.TrimHex(ctx.URLParam("electionID")))
if err != nil || electionID == nil {
return ErrCantParseElectionID.Withf("(%s): %v", ctx.URLParam("electionID"), err)
}
process, err := getElection(electionID, a.vocapp.State)

func getElection(electionID []byte, vs *state.State) (*models.Process, error) {
process, err := vs.Process(electionID, true)

note committed=true so v.tx.RLock is skipped in next func

func (v *State) Process(pid []byte, committed bool) (*models.Process, error) {
if !committed {
v.tx.RLock()
defer v.tx.RUnlock()
}
return getProcess(v.mainTreeViewer(committed), pid)

and mainTreeViewer returns a v.MainTreeView()

// mainTreeViewer returns the mainTree as a treeViewer.
// When committed is false, the mainTree returned is the not yet committed one
// from the currently open StateDB transaction.
// When committed is true, the mainTree returned is the last committed version.
func (v *State) mainTreeViewer(committed bool) statedb.TreeViewer {
if committed {
return v.MainTreeView()
}
return v.tx.AsTreeView()
}

which is, in turn, just a pointer

// MainTreeView is a thread-safe function to obtain a pointer to the last
// opened mainTree as a TreeView.
func (v *State) MainTreeView() *statedb.TreeView {
return v.mainTreeViewValue.Load()
}

that is only ever updated using setMainTreeView

// setMainTreeView is a thread-safe function to store a pointer to the last
// opened mainTree as TreeView.
func (v *State) setMainTreeView(treeView *statedb.TreeView) {
v.mainTreeViewValue.Store(treeView)
}

  • during NewState, simply init to a nil tree (sdb.TreeView(nil))
  • during State.Save(), at the very end of the func, after releasing v.tx.Unlock()

v.tx.Unlock()
if err != nil {
return nil, err
}
// Update the main state tree
mainTreeView, err := v.store.TreeView(nil)
if err != nil {
return nil, fmt.Errorf("cannot get statedb mainTreeView: %w", err)
}
v.setMainTreeView(mainTreeView)

so my understanding is that when the API endpoint is queried during Commit, getProcess waits until v.tx.Unlock is released, then manages to get a v.MainTreeView() pointing to the leftovers of the old tree (garbage, just got pruned), and a few milliseconds after, setMainTreeView points the value to the new fresh tree, so future queries are answered correctly.

@altergui
Copy link
Contributor Author

altergui commented Sep 19, 2023

so my understanding is that when the API endpoint is queried during Commit, getProcess waits until v.tx.Unlock is released, then manages to get a v.MainTreeView() pointing to the leftovers of the old tree (garbage, just got pruned), and a few milliseconds after, setMainTreeView points the value to the new fresh tree, so future queries are answered correctly.

well, not exactly. getProcess doesn't really waits until v.tx.Unlock because there's no mutex at all in that code path, so it immediately gets the v.MainTreeView() pointing to the tree-being-pruned. It does wait during the Get of that dead tree, but it's too late.

i believe the correct fix is to make MainTreeView respect the lock as well

func (v *State) MainTreeView() *statedb.TreeView {
v.tx.RLock()
defer v.tx.RUnlock()
return v.mainTreeViewValue.Load()
}

(in addition to the defer v.tx.Unlock() during Save, of course)

defer v.tx.Unlock()

with these two changes, the issue is gone for good (both in CI and in my local tests, consistently). WDYT @mvdan @p4u?

i understand this was no issue until now, since trees were never pruned, so the old trees were left intact and so Get would never fail.

@altergui altergui changed the base branch from main to f/ci-random-fixes September 19, 2023 16:11
@altergui
Copy link
Contributor Author

i keep it as "draft" since it's not meant to be merged as-is since it disables all e2etests, to force issue reproduction. but please review @p4u @mvdan

@altergui altergui force-pushed the f/ci-random-fixes branch 3 times, most recently from ad87898 to 2615e7f Compare September 20, 2023 10:09
@altergui altergui marked this pull request as ready for review September 20, 2023 11:12
@altergui altergui changed the title test to reproduce "arbo while down" bug (WIP) state: fix locking issues during Commit Sep 20, 2023
@altergui altergui requested review from p4u and mvdan September 20, 2023 11:12
@altergui altergui changed the base branch from f/ci-random-fixes to main September 20, 2023 11:14
* State.MainTreeView needs to respect v.tx.RLock()
* State.Save now defers v.tx.Unlock() until the end

fix #1100 'while down, could not get value for key'
that was returned on any query to the state
during Commit of a block
@altergui
Copy link
Contributor Author

altergui commented Sep 20, 2023

the new test raceDuringCommit triggers issue #1100 reproducibly (i triggered 3 re-runs, all of them failed)
https://github.com/vocdoni/vocdoni-node/actions/runs/6247856534

image

and after applying the fix 6bca6cd, no more failures (2 re-runs)
https://github.com/vocdoni/vocdoni-node/actions/runs/6247858876

image

🥳

@altergui
Copy link
Contributor Author

( Build and Test / Publish code coverage (pull_request) Failing after 14s failed due to #1109 , just ignore that check)

@p4u
Copy link
Member

p4u commented Sep 20, 2023

So, is this ready to merge @altergui ?

@p4u p4u merged commit 94d6149 into main Sep 20, 2023
14 of 15 checks passed
@p4u p4u deleted the f/arbo-while-down branch September 20, 2023 16:52
altergui added a commit that referenced this pull request Sep 20, 2023
PR #1102 was merged without realising it needed some changes
due to the merge of PR #1056 (one hour before)

this totally broke the `main` build, so here's a hotfix
altergui added a commit that referenced this pull request Sep 20, 2023
PR #1102 was merged without realising it needed some changes
due to the merge of PR #1056 (one hour before)

this totally broke the `main` build, so here's a hotfix
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

arbo: while down, could not get value for key
3 participants