Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: New Healthy condition for claims and XRs #5643

Open
negz opened this issue May 1, 2024 · 4 comments
Open

Proposal: New Healthy condition for claims and XRs #5643

negz opened this issue May 1, 2024 · 4 comments

Comments

@negz
Copy link
Member

negz commented May 1, 2024

Executive summary

I propose we add a new Healthy status condition to claims and XRs. This condition would indicate whether the resource and all of its child resources were healthy - i.e. weren't currently encountering any errors.

This would make it possible to quickly determine whether a claim isn't ready because it's hitting an error, or because it just needs more time to become ready.

Importantly, the new condition wouldn't include the specific error(s) child resources were encountering. It would only indicate that there were or weren't errors.

What problem are you facing?

Claims, XRs, and MRs have the following status conditions:

  • Ready - Indicates that Crossplane believes the resource is ready to use
  • Synced - Indicates that Crossplane successfully synced the resource on the last reconcile

Ready doesn't mean "Crossplane is successfully reconciling this resource". It means "the thing this resource represents appears to be ready to use". For example if a claim represents a database, the Ready condition indicates that Crossplane believes the database is ready to use. It's ready for some application to connect to it and store data.

Ready is cumulative. A claim is only Ready when its underlying XR is Ready. The XR is only Ready when all its composed resources are Ready.

Synced does mean "Crossplane is successfully reconciling this resource". It isn't cumulative, though. It only means "this controller reconciled its most recent desired state successfully". For example an XR could reconcile its desired state by creating three misconfigured composed resources that would never become Synced or Ready. The XR is still Synced though, because its controller did its job. It successfully synced its desired state by creating composed resources.

I think these two conditions make sense. For example:

  • A resource could be Synced because Crossplane has successfully applied its desired state, but not yet Ready to use.
  • A resource could be Ready to use (as far as Crossplane knows), but no longer Synced because Crossplane is hitting an error.

There's an information gap though. Today if you look at a newly created claim you'll likely see:

  • Synced: True - The claim successfully created and is syncing with an XR
  • Ready: False - The conceptual resource the claim represents (e.g. a database) isn't ready to use yet

It's possible the XR controller, and all composed resource controllers are working fine but some MR isn't ready to use yet - e.g. the RDS API is reporting that a composed RDS instance isn't ready to use. It's also possible that one of the controllers is stuck due to an error.

Today there's no way to distinguish between these two cases without inspecting the next layer down.

Folks have been feeling the pain of this gap for some time, but I've been reluctant to address it because the proposed solution has always been "take the errors from lower level controllers and propagate them to the higher level controllers" - for example #3262.

@vfarcic explains why that isn't a great solution in https://www.youtube.com/watch?v=xAl3TAfFE_M. Claims, XRs, and MRs are all different layers of abstraction, with different intended audiences. An error that makes sense at one layer probably won't make sense at a higher layer of abstraction.

Consider for example an XR error due to an invalid Composition. I don't think it makes sense to copy that error verbatim to the claim. Assuming the claim author isn't writing their own Composition, the error isn't actionable for them.

One way to address this is to build a presentation layer that allows Composition authors to choose which errors were surfaced to the claim and how. That's what we're working on in #5402.

The downside of the presentation layer approach is that the information gap still exists by default. If the Composition author doesn't actively "present" errors to the claim, there's still no way to distinguish between a claim that's not ready due to an error, or a claim that's not ready because the cloud provider is still working on creating one of the composed resources.

How could Crossplane help solve your problem?

I propose we add a new Healthy status condition to claims and XRs.

A claim would be Healthy only when:

  • The claim controller was Synced
  • The underlying XR controller was Synced and Healthy

An XR would be Healthy only when:

  • The XR controller was Synced
  • All composed resources were Synced, and if applicable (e.g. the composed resource is an XR) Healthy

The Healthy condition would in effect be a cumulative version of the Synced condition. I'm proposing a new condition instead of proposing that we change the meaning of the Synced condition because changing the meaning would be a significant behavior change.

Importantly, I'm not proposing that we copy error messages from child to parent resources. For example if a claim wasn't Healthy I'd expect the status condition message to say something generic like "N/M composed resources are not synced" or just "Some resources encountered errors".

Under this proposal a claim status might look like this:

status:
  conditions:
  - type: "Ready"
    status: "False"
    reason: "BindCompositeResource"
    message: "The composite resource is not yet ready"
  - type: "Healthy"
    status: "False"
    reason: "UnhealthyCompositeResource"
    message: "The composite resource is not healthy"
  - type: "Synced"
    status: "True"
    reason: "ReconcileSuccess"

An XR status might look like this:

status:
  conditions:
  - type: "Ready"
    status: "False"
    reason: "Creating"
    message: "Unready resources: some-composed-resource, another-composed-resource"
  - type: "Healthy"
    status: "False"
    reason: "UnhealthyComposedResources"
    message: "Unhealthy resources: some-composed-resource"
  - type: "Synced"
    status: "True"
    reason: "ReconcileSuccess"
@negz
Copy link
Member Author

negz commented May 1, 2024

CC @dalton-hill-0 @TerjeLafton @jbw976 - Just an FYI as this is tangentially related to designs you've all been working on.

@birapjr
Copy link

birapjr commented May 2, 2024

Hi @negz Very good feature. It will help a lot to a fast investigation on issues. We can see a bit of the Health status via crossplane beta trace cli command, but its not clear when you get status only of one resource.

@dariozachow
Copy link

I think this would be very helpful, struggling with this atm.

@negz
Copy link
Member Author

negz commented May 22, 2024

If anyone from the community wants to give this a shot, I'd be supportive.

I think it would require a little experience with Crossplane though, and the update needs to be made in a few places (crossplane-runtime, providers, core, etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants