Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All Kuberhealthy checks creation failing after installing latest v2.6.0 release #988

Closed
AshutoshNirkhe opened this issue Jul 21, 2021 · 10 comments
Assignees
Labels
bug Something isn't working Stale

Comments

@AshutoshNirkhe
Copy link
Contributor

AshutoshNirkhe commented Jul 21, 2021

Describe the bug
After installing latest (v2.6) Kuberhealthy, new crds got installed. But all the khchecks and khstate creation started failing with error,
"invalid: spec.LastRun: Invalid value: "null": spec.LastRun in body must be of type string: "null""

Steps To Reproduce

  • Install v2.6 Kuberhealthy
  • Enable few upstream checks like deployment, dns-internal etc.

Expected behavior
All khcheck pods and khstates should have been created properly.

Screenshots
If applicable, add screenshots to help explain your problem. You can paste images into github and they will upload automatically.

Logs from Kuberhealthy backend pods,

time="2021-07-20T15:08:19Z" level=info msg="Starting check: synthetic / dns-status-internal"
time="2021-07-20T15:08:19Z" level=info msg="Running check: dns-status-internal"
time="2021-07-20T15:08:19Z" level=debug msg="Generated new UUID for external check: e1f45822-0d5e-40a2-8420-6228f53eaced"
time="2021-07-20T15:08:19Z" level=info msg="e1f45822-0d5e-40a2-8420-6228f53eaced synthetic/dns-status-internal: [Setting expected UUID to: e1f45822-0d5e-40a2-8420-6228f53ea
ced]"
time="2021-07-20T15:08:19Z" level=info msg="Starting check: synthetic / deployment"
time="2021-07-20T15:08:19Z" level=info msg="Running check: deployment"
time="2021-07-20T15:08:19Z" level=debug msg="Generated new UUID for external check: 689c1ec7-45a0-49a6-9d9c-24300f63ebf6"
time="2021-07-20T15:08:19Z" level=info msg="689c1ec7-45a0-49a6-9d9c-24300f63ebf6 synthetic/deployment: [Setting expected UUID to: 689c1ec7-45a0-49a6-9d9c-24300f63ebf6]"
time="2021-07-20T15:08:20Z" level=info msg="689c1ec7-45a0-49a6-9d9c-24300f63ebf6 synthetic/deployment: [khstate did not exist, so a default object will be created]"
time="2021-07-20T15:08:20Z" level=info msg="689c1ec7-45a0-49a6-9d9c-24300f63ebf6 synthetic/deployment: [Creating khstate deployment synthetic because it did not exist]"
time="2021-07-20T15:08:20Z" level=info msg="e1f45822-0d5e-40a2-8420-6228f53eaced synthetic/dns-status-internal: [khstate did not exist, so a default object will be created]
"
time="2021-07-20T15:08:20Z" level=info msg="e1f45822-0d5e-40a2-8420-6228f53eaced synthetic/dns-status-internal: [Creating khstate dns-status-internal synthetic because it d
id not exist]"
time="2021-07-20T15:08:20Z" level=info msg="e1f45822-0d5e-40a2-8420-6228f53eaced synthetic/dns-status-internal: [failed to create a khstate after finding that it did not ex
ist: KuberhealthyState.comcast.github.io \"dns-status-internal\" is invalid: spec.LastRun: Invalid value: \"null\": spec.LastRun in body must be of type string: \"null\"]"
time="2021-07-20T15:08:20Z" level=error msg="Error running check: dns-status-internal in namespace synthetic: KuberhealthyState.comcast.github.io \"dns-status-internal\" is
 invalid: spec.LastRun: Invalid value: \"null\": spec.LastRun in body must be of type string: \"null\""
time="2021-07-20T15:08:20Z" level=debug msg="Checking existence of custom resource: dns-status-internal"

Versions

  • Cluster OS: [e.g. CoreOS 1800.1.0]
  • Kubernetes Version: 1.20.7
  • Kuberhealthy Release v2.6.0

Additional context
All checks work perfectly fine with v2.5.0 KH release, but fail with v2.6.0 (fresh) installation.
LastRun field seems to be coming from khstate spec,
https://github.com/kuberhealthy/kuberhealthy/pull/937/files#diff-5e5ae41d19dd83d75f4a16a51db94409dbebeac54e7cd0b9454b84eb1b74741eR38

Another question I have is when I tried upgrading from v2.5.0 to v2.6.0, crds were not updated automatically and hence checks continued to run fine. I guess helm upgrade doesn't handle this. Do we need to remove older crds and install new ones manually before chart upgrade ? How are we supposed to handle existing custom resources (checks/jobs) in such case ?

@AshutoshNirkhe AshutoshNirkhe added the bug Something isn't working label Jul 21, 2021
@integrii
Copy link
Collaborator

Thanks for the report @AshutoshNirkhe - we will work on reproducing this one. We haven't seen this in our end to end tests or in our clusters, which is a concern. @jonnydawg is checking this out.

@jonnydawg
Copy link
Collaborator

Looks like this might be an issue related to how we generated our CRDs -- leaving this here as a note for later reference:

https://github.com/kubernetes/kubernetes/issues/58311
https://github.com/kubernetes/kubernetes/issues/66899
https://github.com/kubernetes/kubernetes/issues/86811
https://github.com/kubernetes/code-generator/tree/master/examples/crd/apis/example/v1
https://github.com/kubernetes-sigs/cluster-api/blob/bf790fc2a53614ff5d3405c83c0de0dd3303bb1f/api/v1alpha2/common_types.go#L46-L67

@jonnydawg jonnydawg mentioned this issue Jul 30, 2021
@integrii
Copy link
Collaborator

integrii commented Aug 2, 2021

This should be fixed in v2.7.0.

@integrii integrii closed this as completed Aug 2, 2021
@AshutoshNirkhe
Copy link
Contributor Author

This should be fixed in v2.7.0.

@integrii @jonnydawg Thanks for fixing this. I had one query. Do we need to manually drop older crds and install new ones ?
As such helm upgrade to v2.6/2.7 doesn't really apply new crds.

@jonnydawg
Copy link
Collaborator

Yes! You would have to manually wipe the CRDs unfortunately.

The CRD changes I made from v2.6.0 -> v2.7.0 updates kuberhealthy CRDs to different k8s objects. Instead of apiextensions.k8s.io/v1beta1 the objects now live under apiextensions.k8s.io/v1 -- this means that they are different objects under different api groups.

This is what I am seeing in this scenario if you are going from v2.6.0 -> v2.7.0:

  • Applying and installing v2.7.0 CRDs will create new objects under apiextensions.k8s.io/v1
  • Older CRDs from v2.6.0 will not get updated as they are considered different objects (though with the same name)

This creates a possible scenario where two groups of CRDs could exist -- or a failure to apply / install.

@jonnydawg jonnydawg reopened this Aug 25, 2021
@jonnydawg
Copy link
Collaborator

Leaving this one open for awareness.

@AshutoshNirkhe
Copy link
Contributor Author

AshutoshNirkhe commented Aug 26, 2021

Yes! You would have to manually wipe the CRDs unfortunately.

The CRD changes I made from v2.6.0 -> v2.7.0 updates kuberhealthy CRDs to different k8s objects. Instead of apiextensions.k8s.io/v1beta1 the objects now live under apiextensions.k8s.io/v1 -- this means that they are different objects under different api groups.

This is what I am seeing in this scenario if you are going from v2.6.0 -> v2.7.0:

  • Applying and installing v2.7.0 CRDs will create new objects under apiextensions.k8s.io/v1
  • Older CRDs from v2.6.0 will not get updated as they are considered different objects (though with the same name)

This creates a possible scenario where two groups of CRDs could exist -- or a failure to apply / install.

Thanks for confirming @jonnydawg , I was upgrading from v2.5 to v2.6 initially and now to v2.7. But yeah, the results were same in either case. May be you can add it as a warning in the Release notes section ?

@integrii integrii pinned this issue Feb 25, 2022
@Hungrylion2019
Copy link
Collaborator

Hungrylion2019 commented Feb 28, 2022

I met the question for many times. But it's not the field "LastRun".

How I deal with this question

  1. modify the khstate types.go file for " WorkloadDetails struct " to add "nullbale" annotation.
  2. run the script “/scripts/generate.sh”
  3. apply the new crd.yaml in the cluster
  4. apply kuberhealthy.yaml and checker.yaml

Shortcommings
Present code can't cover all situation if the apiserver's verication become more strict.

The failed and key code position for verification:

func (ext *Checker) setUUID(uuid string) error {
	ext.log("Setting expected UUID to:", uuid)
	checkState, err := ext.getKHState()

	// if the fetch operation had an error, but it wasn't 'not found', we return here
	if err != nil && !(k8sErrors.IsNotFound(err) || strings.Contains(err.Error(), "not found")) {
		return fmt.Errorf("error setting uuid for check %s %w", ext.CheckName, err)
	}

	// if the check was not found, we create a fresh one and start there
	if err != nil && (k8sErrors.IsNotFound(err) || strings.Contains(err.Error(), "not found")) {
		ext.log("khstate did not exist, so a default object will be created")
		details := khstatev1.NewWorkloadDetails(ext.KHWorkload)
		details.Namespace = ext.CheckNamespace()
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
              " AuthoritativePod " field is empty (ext.hostname="")
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
		details.AuthoritativePod = ext.hostname
		details.OK = true
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
              " CurrentUUID " field is empty
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
		details.RunDuration = time.Duration(0).String()
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
              “node” field  is empty
//~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
		newState := khstatev1.NewKuberhealthyState(ext.CheckName, details)
		newState.Namespace = ext.Namespace
		ext.log("Creating khstate", newState.Name, newState.Namespace, "because it did not exist")
		_, err = ext.KHStateClient.KuberhealthyStates(ext.CheckNamespace()).Create(&newState)
		if err != nil {
			ext.log("failed to create a khstate after finding that it did not exist:", err)
			return err
		}

now the type.go for khstate
It only let the "LastRun" and “khWorkload” is nullable.

type WorkloadDetails struct {
	OK          bool     `json:"OK" yaml:"OK"`                   // true or false status of the khWorkload, whether or not it completed successfully
	Errors      []string `json:"Errors" yaml:"Errors"`           // the list of errors reported from the khWorkload run
	RunDuration string   `json:"RunDuration" yaml:"RunDuration"` // the time it took for the khWorkload to complete
	Namespace   string   `json:"Namespace" yaml:"Namespace"`     // the namespace the khWorkload was run in
	Node        string   `json:"Node" yaml:"Node"`               // the node the khWorkload ran on
	// +nullable
	LastRun          *metav1.Time `json:"LastRun,omitempty" yaml:"LastRun,omitempty"` // the time the khWorkload was last run
	AuthoritativePod string       `json:"AuthoritativePod" yaml:"AuthoritativePod"`   // the main kuberhealthy pod creating and updating the khstate
	CurrentUUID      string       `json:"uuid" yaml:"uuid"`                           // the UUID that is authorized to report statuses into the kuberhealthy endpoint
	// +nullable
	khWorkload *KHWorkload `json:"khWorkload,omitempty" yaml:"khWorkload,omitempty"`
}

@integrii

Copy link

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment on the issue or this will be closed in 15 days.

@github-actions github-actions bot added the Stale label Jan 28, 2024
Copy link

This issue was closed because it has been stalled for 15 days with no activity. Please reopen and comment on the issue if you believe it should stay open.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Stale
Projects
None yet
Development

No branches or pull requests

4 participants