solver: prevent edge merge to inactive states #4887

sipsma · 2024-04-26T19:16:55Z

Before this, it was possible for an edge merge to happen to a target
edge/state that is no longer in the actives map, e.g.

Job A solves some edges
Job B solves some edges that get merged with job A's edges
Job A is discarded
Job C solves some edges that get merged with job B's edges, which
recursively end up merging to job A's, which no longer exist in the
actives map.

While this doesn't always result in an error, given the right state and
order of operations this can result in getState or getEdge being
called on the "stale" edges that are inactive and an error. E.g. if a
stale dep transitions from desired state cache-fast to cache-slow,
the slow cache logic will call getState on it and return a compute cache error.

I also suspect this is the same root cause behind inconsistent graph state errors still seen occasionally, but have not repro'd that error
locally so can't be 100% sure yet.

The fix here updates state.setEdge to register all the jobs in the
source edge's state with the target edge's state. This works out
because:

edges that are the target of a merge will not have their state
removed from the actives map if only the original job creating them
is discarded
those edges are still removed eventually, but only when all jobs
referencing them (now including jobs referencing them via edge
merges) are discarded

I previously was able to repro a failed to compute cache key: failed to get state for index 0 error very consistently locally but no longer can with this change. It's hard to integ test since it involves a very specific set of state's and ordering, but added a unit test for the change made here to not discard jobs that are still referenced via edge merges.

Also have some updates to logs that were useful for parsing them: eab04dc

In case it's ever useful to anyone reading this, the code I wrote for parsing, filtering and displaying the scheduler debugs logs is here: https://gist.github.com/sipsma/e67119d96cebe06e69e69f9ba9be2e78

Very far from polished or complete, but still ended up being extremely useful for debugging here when I had 1.4GB of logs to sift through 😄

Draft Until:

I noticed that buildkitd after this change had 5000+ goroutines left around stuck on progress-related code. Need to check if pre-existing issue or related to this change.
- Seems to be a pre-existing issue, same thing happens without the fix here. Dagger also just went through a lot of changes with how progress is handled so it's entirely possible this doesn't even happen in vanilla buildkitd.

Before this, it was possible for an edge merge to happen to a target edge/state that is no longer in the actives map, e.g. 1. Job A solves some edges 2. Job B solves some edges that get merged with job A's edges 3. Job A is discarded 4. Job C solves some edges that get merged with job B's edges, which recursively end up merging to job A's, which no longer exist in the actives map. While this doesn't always result in an error, given the right state and order of operations this can result in `getState` or `getEdge` being called on the "stale" edges that are inactive and an error. E.g. if a stale dep transitions from desired state `cache-fast` to `cache-slow`, the slow cache logic will call `getState` on it and return a `compute cache` error. I also *suspect* this is the same root cause behind `inconsistent graph state` errors still seen occasionally, but have not repro'd that error locally so can't be 100% sure yet. The fix here updates `state.setEdge` to register all the jobs in the source edge's state with the target edge's state. This works out because: 1. edges that are the target of a merge will not have their state removed from the actives map if only the original job creating them is discarded 1. those edges are still removed eventually, but only when all jobs referencing them (now including jobs referencing them via edge merges) are discarded Signed-off-by: Erik Sipsma <erik@sipsma.dev>

While debugging solver bugs with scheduler debug logs I ended up with so many logs that I needed to write a program that parsed them and extracted relevant information so it could be summarized more comprehensibly. That was greatly simplified by updating some of the old scheduler debug logs to use logrus fields rather than one-off `fmt.Sprintf`s. All the same information as before is present, just formatted more consistently for easier parsability. I also ended up removing one of the logs I recently added that printed each job for a vertex in `loadUnlocked`. I realized that information was parsable from the rest of the logs and thus was mostly just extra noise. Signed-off-by: Erik Sipsma <erik@sipsma.dev>

tonistiigi · 2024-04-27T00:33:18Z

In the case where job is released and it releases all the state except the one that was targetState for another job's merged edge, is it possible that this lone remaining state wants to access its parents and would still get an error because they have been released?

sipsma · 2024-04-27T01:05:04Z

In the case where job is released and it releases all the state except the one that was targetState for another job's merged edge, is it possible that this lone remaining state wants to access its parents and would still get an error because they have been released?

Oh yeah, good catch, that sounds highly plausible. Will fix by just adding jobs recursively to the state and it's ancestors + cover in unit test

Signed-off-by: Erik Sipsma <erik@sipsma.dev>

sipsma · 2024-04-27T01:58:55Z

@tonistiigi Pushed a fix for that: ffdbd86

I realized in practice I'm not sure how that case could actually get hit since it would require that one edge is mergeable but it has ancestors that aren't mergeable. But I left the fix in anyways since there's a high probability I'm either not thinking of a case currently possible or future modifications will make that possible somehow, in which case we don't want to be on shaky ground.

tonistiigi · 2024-04-27T02:08:33Z

I realized in practice I'm not sure how that case could actually get hit since it would require that one edge is mergeable but it has ancestors that aren't mergeable.

If the merge happens based on content checksum (cache-slow case) then the parent chains can be completely different.

tonistiigi · 2024-04-27T02:13:51Z

If the merge happens based on content checksum (cache-slow case) then the parent chains can be completely different.

Although, in this case if the cache-slow has been computed that should mean the parent is already in completed state and likely will not be called into again.

But I don't think there is a guarantee the parent has merged even without cache-slow, even if the practical case is hard to think of for current LLB ops.

jedevc · 2024-04-29T09:48:44Z

I might be misunderstanding the recursive part of this here - but essentially we're taking all ancestors of a node and adding the job to each state here?

Isn't state.jobs also used for other things though? Just glancing through, it appears it's used as a SessionIterator (so maybe we could end up accidentally leaking sessions into ancestors that shouldn't have access to it? e.g. if job B merges to job A, should vertices executed in this context have access to the session from B - not sure, but I do think it's a subtle behavior change). It's also used in connectProgressFromState (which might end connecting more progress than it should?).

Not quite sure about this, so take it with a grain of salt 😆

sipsma · 2024-04-29T16:39:43Z

I might be misunderstanding the recursive part of this here - but essentially we're taking all ancestors of a node and adding the job to each state here?

Isn't state.jobs also used for other things though? Just glancing through, it appears it's used as a SessionIterator (so maybe we could end up accidentally leaking sessions into ancestors that shouldn't have access to it? e.g. if job B merges to job A, should vertices executed in this context have access to the session from B - not sure, but I do think it's a subtle behavior change). It's also used in connectProgressFromState (which might end connecting more progress than it should?).

Not quite sure about this, so take it with a grain of salt 😆

Good question, on one hand I agree that attaching session resources to those ancestor states feels bad, on the other hand if we don't then you can get into a situation where those vertices execute and no longer have any sessions attached and thus are guaranteed to fail... And even if we do go with the current approach it's reasonably unlikely the client of the source of the edge merge has all the right attachables, so still could fail.

Same idea for progress, but rather than failing you would end up being blocked on execution that you get no progress updates about.

I'm not seeing any good immediate answers here. It might feel reasonable to just never do an edge merge into an edge that has uncompleted vertices with session attachables in itself or in an ancestor, but I don't think there's any plumbing right now to identify vertices with session attachables on this level of the solver (yet).

Wdyt @tonistiigi ? Does tracking vertices with session attachables and updating the edge merge logic as described above feel reasonable to you?

I would personally also be okay with just going back to not including ancestors for now and thinking about this in followups. This issue is causing tons of problems for us now so getting the fix in ASAP is preferable even if it's not theoretically 100% comprehensive. It seems to solve the immediate problems for us as is, so it's still a step forward. But if we agree on an approach to deal with this and it's not enormously time consuming I can squash it now.

jedevc · 2024-04-29T17:00:08Z

if we don't then you can get into a situation where those vertices execute and no longer have any sessions attached and thus are guaranteed to fail...

Hm, I hadn't thought of that side - I think I'd rather have something that "leaks" in a maybe slightly unexpected way, instead of erroring out, or not edge merging where we might expect.

I think as implemented, this looks right to me then - already, when edge merging, we merge together sessions in this way, this doesn't feel like a significantly larger issue to me (though in general, how we handle this does feel like it has implications for how we're using/planning to use sessions in dagger - but that's more a downstream kind of question 😆)

sipsma · 2024-04-29T17:10:49Z

if we don't then you can get into a situation where those vertices execute and no longer have any sessions attached and thus are guaranteed to fail...

Hm, I hadn't thought of that side - I think I'd rather have something that "leaks" in a maybe slightly unexpected way, instead of erroring out, or not edge merging where we might expect.

I think as implemented, this looks right to me then - already, when edge merging, we merge together sessions in this way, this doesn't feel like a significantly larger issue to me (though in general, how we handle this does feel like it has implications for how we're using/planning to use sessions in dagger - but that's more a downstream kind of question 😆)

Generally agree, honestly if there was a "userspace" option on LLB vertexes to say "never merge this" that would solve a lot of these cases for us.

tonistiigi · 2024-04-29T17:51:20Z

I think as implemented, this looks right to me then - already, when edge merging, we merge together sessions in this way,

That's already how all the requests are combined in the graph, it is is connected directly via LLB digest on load or later via
"merged edges" doesn't make a difference. Single buildkit instance combines requests, and does not provide isolation between them.

tonistiigi · 2024-04-29T18:00:41Z

It's also used in connectProgressFromState (which might end connecting more progress than it should?).

It should not connect progress from the parents, but yes it would get tricky if there are parents that are not fully loaded yet. Based on #4887 (comment) maybe we can assume that this can't happen in LLB atm.

It might feel reasonable to just never do an edge merge into an edge that has uncompleted vertices with session attachables in itself or in an ancestor, but I don't think there's any plumbing right now to identify vertices with session attachables on this level of the solver (yet).

Any estimation how many merged edges would behave like this. Most of the cases come from multiple local sources tracking the same files. So these do have sessions associated with them but all of them should be completed already because the cache key based on content has been computed.

Another approach: should be the in merged edge set the sharedOp from discarded edge as a fallback to the new target. So that if the target fails (eg. because it was cancelled and session is gone) then it would fallback to the original one.

sipsma · 2024-04-29T20:16:19Z

Any estimation how many merged edges would behave like this. Most of the cases come from multiple local sources tracking the same files. So these do have sessions associated with them but all of them should be completed already because the cache key based on content has been computed.

ExecOps with secrets also come to mind, but I think they are in the same boat in that they should already have ran if they are part of an edge merge (sure there's some case that's not true somehow but probably small subset, and the only downside is that the slight optimization of edge merge doesn't happen).

Another approach: should be the in merged edge set the sharedOp from discarded edge as a fallback to the new target. So that if the target fails (eg. because it was cancelled and session is gone) then it would fallback to the original one.

To confirm I'm following: if edge transitions to a state where it would actually execute the sharedOp but the original job is gone, "unmerge" and use a different src edge with a job still around?

That makes conceptual sense if so. TBH my only concern with it is that it seems like it would involve tricky intricate changes to the scheduler, which is hard to modify without breaking something else unintentionally. Maybe it's not as bad as I'm imagining though?

Side note: this reminded me of the long standing issue where an op ends up lazy due to remote cache but then gets unlazied later and there's a failure pulling layers and the whole build fails. Maybe completely separate, but feels related somehow in that it's another case where you want to decide to the scheduler to decide to execute an op differently than it previously decided. Feel free to ignore if irrelevant.

tonistiigi · 2024-04-29T21:30:57Z

"unmerge" and use a different src edge with a job still around?

I don't think we would need "full unmerge". The possible issues could come sharedOp when calling CacheMap() or Exec() , right? So we could just fallback to the old implementation of sharedOp for these.

For something like mounting secrets during exec I'm not sure if this is even possible. We just need to make sure that the job/session from JobB is added to the merged edge and when it runs it uses sessiongroup.Any so even if JobA is gone, it should not cause an issue.

If it is possible that the parent gets called then it would be different, but then we don't really have a way to "unmerge" anyway because of the step failing is not the merged step, but it just happens to have a child step later that is merged. Not sure if this is possible in LLB though.

tonistiigi · 2024-04-29T23:35:32Z

solver/jobs.go

+		}
+		inputState.addJobs(srcState, memo)
+
+		// tricky case: if the inputState's edge was *already* merged we should


If this case actually produces inputState != mergedInputState, is there even a need to bother with the inputState?

Provided that the whole codebase strictly uses getEdge/getState (and thus always uses the merge target if a merge happened), then yeah you could probably skip inputState.addJobs and just unconditionally call inputState.getEdge and use the return of that.

I'm not sure if that's true (haven't audited everything), do you happen to know if it is?

sipsma · 2024-04-30T17:20:13Z

"unmerge" and use a different src edge with a job still around?

I don't think we would need "full unmerge". The possible issues could come sharedOp when calling CacheMap() or Exec() , right? So we could just fallback to the old implementation of sharedOp for these.

For something like mounting secrets during exec I'm not sure if this is even possible. We just need to make sure that the job/session from JobB is added to the merged edge and when it runs it uses sessiongroup.Any so even if JobA is gone, it should not cause an issue.

If it is possible that the parent gets called then it would be different, but then we don't really have a way to "unmerge" anyway because of the step failing is not the merged step, but it just happens to have a child step later that is merged. Not sure if this is possible in LLB though.

@tonistiigi I just looked into the details of this:

The getState in CalcSlowCache that was returning the error for us is here: https://github.com/sipsma/buildkit/blob/ffdbd863b34ad0c6505949ebbda5e61722c939ad/solver/jobs.go#L934-L934
It's using s.st.vtx, which means that s.st.vtx is not a source of a merge, it's the target.
- Basically, this bug actually requires there to be two levels of stale states. s.st has to be not in the actives map and s.st.vtx.Inputs()[index] also has to not be in the actives map.

So AFAICT the fix would require that sharedOp track all the states that have been merged into it (as opposed to tracking one extra state for the state that it was merged into). I guess that's possible, though there are many usages of sharedOp.st that I'd need look through and see if they should be impacted.

Is that what you were imagining?

Until something like moby/buildkit#4887 lands upstream, looks like this is can cause the cache to not build further

sipsma · 2024-05-04T16:28:34Z

ping @tonistiigi on the questions asked here #4887 (comment)

tonistiigi · 2024-05-06T18:43:06Z

So AFAICT the fix would require that sharedOp track all the states that have been merged into it (as opposed to tracking one extra state for the state that it was merged into). I guess that's possible, though there are many usages of sharedOp.st that I'd need look through and see if they should be impacted.

I think we first need to figure out if this case is even possible. Otherwise, we might be complicating things unnecessarily. So are there cases where 1) edge is merged with target from jobA, 2) that target still needs to resolve its own inputs or call other sharedOp methods 3) when the original jobA cancels it would cause a sharedOp method to error as well (normally convered by sessionmanager.Any).

For 3 to appear it would mean that different tasks from different sessions resolved in the same cache key (it needs to be tasks that only works on one session and not the other for the Any to not work). But in that case it should be content based cachekey and already resolved so the 2 should not apply.

If we have a case then we can look what are the constraints for "unmerge" for that case. It could be possible for example that it can exist but only for roots, to the getState to input that you linked in your comment does not apply in that case.

sipsma · 2024-05-07T18:01:13Z

One quick update, it does appear that some form of the more complex change (where we add jobs to all ancestor states) is needed due to the fact that walkProvenance looks at the actives map for all ancestors: https://github.com/dagger/buildkit/blob/63352b0fb96b6231686023fb83d4c708bd27d430/solver/jobs.go#L735-L735

Still looking through what cases are possible in practice and whether adding jobs to all ancestors is sufficient or if more is needed.

tonistiigi · 2024-05-07T18:55:09Z

That walkProvenance case is interesting indeed. In that case you would expect the original LLB that actually matches the vertex digest, but if it calls into op.Pin() then that op should be the one that actually ran(merged target). Case to test that would be to do llb.Image(alpine:latest) and llb.Image(alpine:3.19) in parallel and check the provenance afterwards (this case can not exist in Dockerfile because the FROM commands are resolved to digest before LLB is converted).

sipsma added 2 commits April 26, 2024 12:07

sipsma requested review from tonistiigi and jedevc April 26, 2024 19:16

sipsma mentioned this pull request Apr 26, 2024

ci: enable buildkit scheduler debug logs dagger/dagger#7128

Closed

sipsma marked this pull request as ready for review April 26, 2024 20:02

This was referenced Apr 26, 2024

Intermittent failed to get state for index errors dagger/dagger#6111

Open

Intermittent failed to get edge: inconsistent graph state errors dagger/dagger#6234

Open

thompson-shaun added kind/bug status/code-review labels Apr 26, 2024

solver: recursively add merge source jobs to target and ancestors

ffdbd86

Signed-off-by: Erik Sipsma <erik@sipsma.dev>

tonistiigi reviewed Apr 29, 2024

View reviewed changes

ludamad added a commit to AztecProtocol/aztec-packages that referenced this pull request May 2, 2024

chore(ci): force earthly prune if corrupted cache

1e299e6

Until something like moby/buildkit#4887 lands upstream, looks like this is can cause the cache to not build further

ludamad mentioned this pull request May 2, 2024

chore(ci): force earthly prune if corrupted cache AztecProtocol/aztec-packages#6152

Merged

ludamad added a commit to AztecProtocol/aztec-packages that referenced this pull request May 2, 2024

chore(ci): force earthly prune if corrupted cache (#6152)

3910314

Until something like moby/buildkit#4887 lands upstream, looks like this is can cause the cache to not build further

mikejholly mentioned this pull request May 2, 2024

Upstream inconsistent graph state fix earthly/buildkit#58

Closed

tonistiigi added this to the v0.14.0 milestone May 2, 2024

sipsma mentioned this pull request May 6, 2024

chore: use temporary buildkit fork with bug fixes pending upstream merge dagger/dagger#7295

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

solver: prevent edge merge to inactive states #4887

solver: prevent edge merge to inactive states #4887

sipsma commented Apr 26, 2024 •

edited

tonistiigi commented Apr 27, 2024

sipsma commented Apr 27, 2024

sipsma commented Apr 27, 2024

tonistiigi commented Apr 27, 2024

tonistiigi commented Apr 27, 2024

jedevc commented Apr 29, 2024

sipsma commented Apr 29, 2024

jedevc commented Apr 29, 2024

sipsma commented Apr 29, 2024

tonistiigi commented Apr 29, 2024

tonistiigi commented Apr 29, 2024

sipsma commented Apr 29, 2024 •

edited

tonistiigi commented Apr 29, 2024

tonistiigi Apr 29, 2024

sipsma Apr 30, 2024

sipsma commented Apr 30, 2024

sipsma commented May 4, 2024

tonistiigi commented May 6, 2024

sipsma commented May 7, 2024 •

edited

tonistiigi commented May 7, 2024

solver: prevent edge merge to inactive states #4887

Are you sure you want to change the base?

solver: prevent edge merge to inactive states #4887

Conversation

sipsma commented Apr 26, 2024 • edited

tonistiigi commented Apr 27, 2024

sipsma commented Apr 27, 2024

sipsma commented Apr 27, 2024

tonistiigi commented Apr 27, 2024

tonistiigi commented Apr 27, 2024

jedevc commented Apr 29, 2024

sipsma commented Apr 29, 2024

jedevc commented Apr 29, 2024

sipsma commented Apr 29, 2024

tonistiigi commented Apr 29, 2024

tonistiigi commented Apr 29, 2024

sipsma commented Apr 29, 2024 • edited

tonistiigi commented Apr 29, 2024

tonistiigi Apr 29, 2024

Choose a reason for hiding this comment

sipsma Apr 30, 2024

Choose a reason for hiding this comment

sipsma commented Apr 30, 2024

sipsma commented May 4, 2024

tonistiigi commented May 6, 2024

sipsma commented May 7, 2024 • edited

tonistiigi commented May 7, 2024

sipsma commented Apr 26, 2024 •

edited

sipsma commented Apr 29, 2024 •

edited

sipsma commented May 7, 2024 •

edited