During the update from 1.8 to 1.9 the operator started crashing #14222

traceroute42 · 2024-05-16T09:27:00Z

During the update, the operator started crashing, rook-ceph-crashcollector were updated to rook-version=v1.9.13, osd mon mds remained in rook-version=v1.8.10.
Api server returns issues posted in the log below.

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior:

Expected behavior:
Operator starts normally

How to reproduce it (minimal and precise):

Follow steps from https://rook.io/docs/rook/v1.9/ceph-upgrade.html#csi-version upgrade from 1.8 to 1.9 simple install
With added env flag to operator
ROOK_DISABLE_ADMISSION_CONTROLLER = true
File(s) to submit:
cluster.txt

Logs to submit:

Operator's pod is crashing

2024-05-16 07:33:30.985433 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:30.985829 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:30.986214 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:30.986589 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:30.987023 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:30.987448 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:30.987858 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:30.988263 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:30.988643 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:30.989017 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:30.989396 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:30.989775 D | ceph-spec: found existing monitor secrets for cluster rook-ceph-external
2024-05-16 07:33:30.991948 I | ceph-spec: parsing mon endpoints: prceph-mon02=10.11.10.30:6789,prceph-mon03=10.11.10.93:6789,prceph-mon01=10.11.10.190:6789
2024-05-16 07:33:30.992009 D | ceph-spec: loaded: maxMonID=2, mons=map[prceph-mon01:0xc0478f6b80 prceph-mon02:0xc0478f6b00 prceph-mon03:0xc0478f6b40], assignment=&{Schedule:map[]}
2024-05-16 07:33:30.992036 I | ceph-spec: found the cluster info to connect to the external cluster. will use "client.admin" to check health and monitor status. mons=map[prceph-mon01:0xc0478f6b80 prceph-mon02:0xc0478f6b00 prceph-mon03:0xc0478f6b40]
2024-05-16 07:33:30.994455 D | ceph-spec: CephCluster "rook-ceph-external" status: "Progressing". "failed to create csi kubernetes secrets: failed to create kubernetes csi secret: failed to create kubernetes secret \"rook-csi-rbd-provisioner\" for cluster \"rook-ceph-external\": failed to get secret for rook-csi-rbd-provisioner: context canceled"
2024-05-16 07:33:30.995019 I | cephclient: writing config file /var/lib/rook/rook-ceph-external/rook-ceph-external.config
2024-05-16 07:33:30.995400 I | cephclient: generated admin config in /var/lib/rook/rook-ceph-external
2024-05-16 07:33:30.995740 I | ceph-cluster-controller: external cluster identity established
2024-05-16 07:33:30.996049 I | cephclient: getting or creating ceph auth key "client.csi-rbd-provisioner"
2024-05-16 07:33:30.996345 D | exec: Running command: ceph auth get-or-create-key client.csi-rbd-provisioner mon profile rbd mgr allow rw osd profile rbd --connect-timeout=15 --cluster=rook-ceph-external --conf=/var/lib/rook/rook-ceph-external/rook-ceph-external.config --name=client.admin --keyring=/var/lib/rook/rook-ceph-external/client.admin.keyring --format json
2024-05-16 07:33:31.000872 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.000929 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.001073 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.001109 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.001207 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.001238 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.001326 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.001353 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.001690 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.001732 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.001914 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.002001 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.002043 I | ceph-cluster-controller: context cancelled, exiting reconcile
2024-05-16 07:33:31.002099 D | ceph-cluster-controller: successfully configured CephCluster "rook-ceph-external/rook-ceph-external"
2024-05-16 07:33:31.002156 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.002193 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.002220 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.002268 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.002379 I | ceph-cluster-controller: reconciling ceph cluster in namespace "rook-ceph"
2024-05-16 07:33:31.002494 D | ceph-spec: found 1 ceph clusters in namespace "rook-ceph-external"
2024-05-16 07:33:31.002519 D | ceph-cluster-controller: update event on CephCluster CR
2024-05-16 07:33:31.004204 D | ceph-spec: found existing monitor secrets for cluster rook-ceph
2024-05-16 07:33:31.005720 I | ceph-spec: parsing mon endpoints: i=10.101.141.73:6789,e=10.102.64.224:6789,h=10.109.166.21:6789
2024-05-16 07:33:31.005789 D | ceph-spec: loaded: maxMonID=8, mons=map[e:0xc04ecc4860 h:0xc04ecc48a0 i:0xc04ecc4820], assignment=&{Schedule:map[e:0xc05849ef40 h:0xc05849ef80 i:0xc05849efc0]}
2024-05-16 07:33:31.010347 I | ceph-cluster-controller: enabling ceph mon monitoring goroutine for cluster "rook-ceph"
2024-05-16 07:33:31.010396 I | op-osd: ceph osd status in namespace "rook-ceph" check interval "1m0s"
2024-05-16 07:33:31.010419 I | ceph-cluster-controller: enabling ceph osd monitoring goroutine for cluster "rook-ceph"
2024-05-16 07:33:31.010440 I | ceph-cluster-controller: ceph status check interval is 1m0s
2024-05-16 07:33:31.010457 I | ceph-cluster-controller: enabling ceph status monitoring goroutine for cluster "rook-ceph"
2024-05-16 07:33:31.010481 D | op-mon: ceph mon status in namespace "rook-ceph" check interval "45s"
2024-05-16 07:33:31.010559 D | ceph-cluster-controller: checking health of cluster
2024-05-16 07:33:31.010606 D | exec: Running command: ceph status --format json --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xf0 pc=0x172ff55]
goroutine 26979 [running]:
github.com/rook/rook/pkg/operator/ceph/cluster/mon.(*HealthChecker).Check(0xc02c9226e0, 0x8cb18a, {0x1f2e303, 0x3})
	/home/runner/work/rook/rook/pkg/operator/ceph/cluster/mon/health.go:137 +0xd5
created by github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).startMonitoringCheck
	/home/runner/work/rook/rook/pkg/operator/ceph/cluster/monitoring.go:85 +0x217

There also logs from api server with timeout to get resources but rbac

kube-apiserver-master02b kube-apiserver E0516 08:24:23.488837       1 wrap.go:54] timeout or abort while handling: method=GET URI="/api/v1/namespaces/rook-ceph/secrets/rook-ceph-mon" audit-ID="af0a9657-ce03-446e-adee-96960b0440b9"
kube-apiserver-master02b kube-apiserver E0516 08:24:23.488885       1 timeout.go:142] post-timeout activity - time-elapsed: 4.48µs, GET "/api/v1/namespaces/rook-ceph/secrets/rook-ceph-mon" result: <nil>
kube-apiserver-master02c kube-apiserver E0516 08:12:40.918918       1 timeout.go:142] post-timeout activity - time-elapsed: 17.360091ms, DELETE "/apis/apps/v1/namespaces/rook-ceph/deployments" result: <nil>
kube-apiserver-master02b kube-apiserver E0516 08:24:23.489337       1 writers.go:135] apiserver was unable to write a fallback JSON response: http: Handler timeout
kube-apiserver-master02c kube-apiserver E0516 08:12:40.920969       1 timeout.go:142] post-timeout activity - time-elapsed: 18.995217ms, GET "/api/v1/namespaces/rook-ceph-external/secrets/rook-ceph-mon" result: <nil>
kube-apiserver-master02b kube-apiserver E0516 08:24:23.490618       1 timeout.go:142] post-timeout activity - time-elapsed: 3.061023ms, DELETE "/apis/apps/v1/namespaces/rook-ceph/daemonsets/rook-discover" result: <nil>

Cluster Status to submit:

Inside rook-ceph-tools pod

ceph status
  cluster:
    id:     a72c4707-301f-4acd-8007-41af0a11a860
    health: HEALTH_OK

  services:
    mon: 3 daemons, quorum e,h,i (age 6d)
    mgr: a(active, since 6d)
    mds: myfs:1 {0=myfs-a=up:active} 1 up:standby-replay
    osd: 8 osds: 8 up (since 6d), 8 in (since 2w)

  data:
    pools:   4 pools, 177 pgs
    objects: 44.89M objects, 805 GiB
    usage:   2.7 TiB used, 2.0 TiB / 4.7 TiB avail
    pgs:     177 active+clean

  io:
    client:   521 KiB/s rd, 2.6 MiB/s wr, 15 op/s rd, 87 op/s wr

but from cmd line
kubectl rook-ceph ceph status
Error: . failed to run command. unable to upgrade connection: container not found ("rook-ceph-operator")%!(EXTRA string=failed to get rook version)
Environment:

OS: Debian 12
Kernel: 6.1.0-18-amd64
Cloud provider or hardware configuration: Baremetal self hosted
Rook version (use rook version inside of a Rook Pod): : rook: v1.9.13 / go: go1.17.13
Storage backend version (e.g. for ceph do ceph -v): operator pod says its ceph version 16.2.10 (45fa1a083152e41a408d15505f594ec5f1b4fe17) pacific (stable) but in daemons its still 15.2
Kubernetes version (use kubectl version): 1.26
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): self hosted kubeadm
Storage backend status : HEALTH_OK

The text was updated successfully, but these errors were encountered:

parth-gr · 2024-05-20T10:05:09Z

@traceroute42 seems like you are using rook to connect to external ceph custer(RHCS),

I suspect why you have the mds osds pods running in the k8s env, these daemons should just remain at the external ceph cluster side and not run on k8s rook cluster

traceroute42 · 2024-05-20T10:37:41Z

@parth-gr
Hi, we've one rook-ceph cluster in k8s env on nvme disks and one external ceph cluster with hdd disks and higher volume on different baremetal.

K8s ceph cluster has kubectl -n $ROOK_CLUSTER_NAMESPACE get deployment -l rook_cluster=$ROOK_CLUSTER_NAMESPACE -o jsonpath='{range .items[*]}{"ceph-version="}{.metadata.labels.ceph-version}{"\n"}{end}' | sort | uniq ceph-version=15.2.15-0

And external
ceph version 15.2.17

parth-gr · 2024-05-20T10:41:26Z

So do you have 2 cephcluster?

Can you show different cephcluster running,

kubectl get cephcluster -nnamesapce?

And the external ceph cluster on HDD was updated successfully but the internal cluster on nmve is blocked?

traceroute42 · 2024-05-20T10:51:33Z

kg cephclusters.ceph.rook.io -n rook-ceph

kg cephcluster
NAME        DATADIRHOSTPATH   MONCOUNT   AGE      PHASE         MESSAGE                                                                                                                                                                                                                                       HEALTH        EXTERNAL
rook-ceph   /var/lib/rook     3          2y129d   Progressing   failed the ceph version check: failed to complete ceph version job: failed to run CmdReporter rook-ceph-detect-version successfully. failed waiting for results ConfigMap rook-ceph-detect-version. timed out waiting for results ConfigMap   HEALTH_WARN

and with running operator after restart before crash
kg cephclusters.ceph.rook.io -n rook-ceph

NAME        DATADIRHOSTPATH   MONCOUNT   AGE      PHASE         MESSAGE                  HEALTH        EXTERNAL
rook-ceph   /var/lib/rook     3          2y129d   Progressing   Detecting Ceph version   HEALTH_WARN

kg cephclusters.ceph.rook.io -n rook-ceph-external

NAME                 DATADIRHOSTPATH   MONCOUNT   AGE     PHASE        MESSAGE                                             HEALTH      EXTERNAL
rook-ceph-external                                2y10d   Connecting   Attempting to connect to an external Ceph cluster   HEALTH_OK   true

but tried few times and it throw an error too

NAME                 DATADIRHOSTPATH   MONCOUNT   AGE     PHASE         MESSAGE                                                                                                                                                                                                                                                HEALTH      EXTERNAL
rook-ceph-external                                2y10d   Progressing   failed to create csi kubernetes secrets: failed to create kubernetes csi secret: failed to create kubernetes secret "rook-csi-rbd-provisioner" for cluster "rook-ceph-external": failed to get secret for rook-csi-rbd-provisioner: context canceled   HEALTH_OK   true

At the moment we're updating rook-ceph , we did update from 1.7 -> 1.8 and it went flawless, and trying 1.8 -> 1.9 because it still support ceph in version 15.2. Then we're going to update ceph cluster to 16 but after update rook-ceph from 1.8 -> 1.9 operator is crashing so we didn't do update ceph from 15 -> 16 on both servers.

parth-gr · 2024-05-20T11:01:36Z

there are some crash conditions, I suggest re-starting the rook operator pod. And then share the operator logs.

Maybe a network latency during the updates.

PS: Also provide output of kubectl get secrets rook-csi-rbd-provisioner -nrook-ceph-external

kubectl get cm -nrook-ceph

traceroute42 · 2024-05-20T11:09:07Z

kubectl get secrets rook-csi-rbd-provisioner -nrook-ceph-external

NAME                       TYPE                 DATA   AGE
rook-csi-rbd-provisioner   kubernetes.io/rook   2      2y10d

kubectl get cm -nrook-ceph

NAME                           DATA   AGE
kube-root-ca.crt               1      2y129d
rook-ceph-csi-config           1      2y129d
rook-ceph-csi-mapping-config   1      2y129d
rook-ceph-detect-version       3      2s
rook-ceph-mon-endpoints        4      2y129d
rook-ceph-operator-config      27     2y129d
rook-ceph-pdbstatemap          2      2y129d
rook-config-override           1      2y129d

Logs from operator INFO and DEBUG level
operator_logs_debug.txt
operator_logs_info.txt

parth-gr · 2024-05-20T11:21:58Z

can you delete the rook-operator pod

kubectl delete pods $podname -nnamsespace and then share the logs,

I see the rook-csi-rbd-provisioner secret exists but in re-concile some context got cancelled somehow and it stuck there..

traceroute42 · 2024-05-20T11:36:25Z

These logs above are directly after delete operator pod
I did delete operator-pod and share on info level then changed ConfigMap to debug log level delete pod and again cached output.

parth-gr · 2024-05-20T12:16:04Z

So in conclusion

Internal cluster is failing

2024-05-20 11:05:58.655420 D | exec: Running command: ceph status --format json --connect-timeout=15 --cluster=rook-ceph --conf=/var/lib/rook/rook-ceph/rook-ceph.config --name=client.admin --keyring=/var/lib/rook/rook-ceph/client.admin.keyring
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xf0 pc=0x172ff55]

goroutine 58900 [running]:
github.com/rook/rook/pkg/operator/ceph/cluster/mon.(*HealthChecker).Check(0xc0c0c42210, 0x8cb18a, {0x1f2e303, 0x3})
	/home/runner/work/rook/rook/pkg/operator/ceph/cluster/mon/health.go:137 +0xd5
created by github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).startMonitoringCheck
	/home/runner/work/rook/rook/pkg/operator/ceph/cluster/monitoring.go:85 +0x217

looks like context is nil and still called here, we might need to improve this

		// Since c.ClusterInfo.IsInitialized() below uses a different context, we need to check if the context is done
		case <-hc.monCluster.ClusterInfo.Context.Done():
			logger.Infof("stopping monitoring of mons in namespace %q", hc.monCluster.Namespace)
			delete(monitoringRoutines, daemon)
			return

And external cluster

2024-05-20 11:05:58.638909 D | ceph-spec: CephCluster "rook-ceph-external" status: "Progressing". "failed to create csi kubernetes secrets: failed to create kubernetes csi secret: failed to create kubernetes secret \"rook-csi-rbd-provisioner\" for cluster \"rook-ceph-external\": failed to get secret for rook-csi-rbd-provisioner: context canceled"

Looks like a network issue

parth-gr · 2024-05-20T12:23:58Z

The logs are full of 2024-05-20 11:02:52.364845 I | ceph-cluster-controller: context canceled, exiting reconcile I would suggest if you can re-start once more, we can understand why this is getting canceled

traceroute42 · 2024-05-20T12:40:39Z

Do you mean the logs after an operator reboot?
Delete operator pod logs
after_1st_restart.txt

Then after crashes with out deleting
after_2nd_restart.txt
after_3rd_restart.txt

Application pods in cluster can access to internal and external cluster storage

parth-gr · 2024-05-20T13:14:25Z

It failed here

rook/pkg/operator/ceph/cluster/cluster_external.go

Line 157 in 2555e51

cluster.ClusterInfo.CephVersion = *externalVersion

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x1985999]

goroutine 48810 [running]:
github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).configureExternalCephCluster(0xc0001bc480, 0xc00367e000)
	/home/runner/work/rook/rook/pkg/operator/ceph/cluster/cluster_external.go:140 +0x759

cluster.Spec.Monitoring.Enabled can you make the monitoring false for external cluster, looks like more bugs are coming from that, and later we can turn it on

traceroute42 · 2024-05-20T13:36:06Z

I did disable monitoring on external cluster

    monitoring:
      enabled: false

Logs after delete operator and after 1st crash with out delete
after_disable_monitoring.txt
after_disable_1st_restart.txt

parth-gr · 2024-05-20T13:55:10Z

I am out of thoughts on this

2024-05-20 13:30:06.121719 D | ceph-crashcollector-controller: deleting cronjob if it exists...
2024-05-20 13:30:06.121746 E | ceph-crashcollector-controller: context canceled

@travisn do you have any idea

travisn · 2024-05-20T19:34:50Z

The crash is happening after the goroutine has a health check. As a workaround, try disabling the health check for mon failover. See this topic. This should do it:

healthCheck:
  daemonHealth:
    mon:
      disabled: true

Then if you can continue upgrading to a newer version of Rook, you can re-enable the health checks again. On a newer version if you're still seeing an issue then we can look into a fix.

BlaineEXE · 2024-05-20T19:38:37Z

I have some general suggestions that also might help, if they apply.

If you aren't on the highest .z version (e.g., vX.y.z) of kubernetes, you might try updating k8s. I recall a while back that there was a k8s issue that affected configmaps that could result in timed out waiting for results ConfigMap, and a .z version update fixed it.

Also, be sure you are following the upgrade guides carefully when doing the upgrades. Some of the upgrades require manual steps, and we have seen users miss them on accident. In particular, make sure to update crds.yaml and common.yaml before every update/upgrade: missing RBAC/CRD updates that can have seemingly strange effects.

traceroute42 · 2024-05-21T07:06:03Z

The crash is happening after the goroutine has a health check. As a workaround, try disabling the health check for mon failover. See this topic. This should do it:
healthCheck:
  daemonHealth:
    mon:
      disabled: true
Then if you can continue upgrading to a newer version of Rook, you can re-enable the health checks again. On a newer version if you're still seeing an issue then we can look into a fix.

I did on both internal and external

internal

  healthCheck:
    daemonHealth:
      mon:
        disabled: true
        interval: 45s
      osd:
        disabled: false
        interval: 60s
      status:
        disabled: false
        interval: 60s

external

  healthCheck:
    daemonHealth:
      mon:
        disabled: true
        interval: 45s
      osd: {}
      status: {}

but still crashing

2024-05-21 06:52:32.380753 D | ceph-spec: ceph version found "15.2.15-0"
2024-05-21 06:52:32.578998 D | op-config: setting "rook/kubernetes/version"="v1.26.9" option in the mon config-key store
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0xf0 pc=0x16aab52]

goroutine 46887 [running]:
github.com/rook/rook/pkg/daemon/ceph/client.(*CephToolCommand).run(0xc019197cd8)
	/home/runner/work/rook/rook/pkg/daemon/ceph/client/command.go:132 +0x32
github.com/rook/rook/pkg/daemon/ceph/client.(*CephToolCommand).RunWithTimeout(...)
	/home/runner/work/rook/rook/pkg/daemon/ceph/client/command.go:197
github.com/rook/rook/pkg/operator/ceph/config.(*MonStore).SetKeyValue(0xc019197d90, {0x1f4f762, 0xc09c974220}, {0xc07c285d10, 0x0})
	/home/runner/work/rook/rook/pkg/operator/ceph/config/monstore.go:204 +0x235
github.com/rook/rook/pkg/operator/ceph/cluster/telemetry.ReportKeyValue(0x445b53, 0x4370b6, {0x1f4f762, 0x17}, {0xc07c285d10, 0x7})
	/home/runner/work/rook/rook/pkg/operator/ceph/cluster/telemetry/telemetry.go:50 +0x6c
github.com/rook/rook/pkg/operator/ceph/cluster.(*cluster).reportTelemetry(0xc006810000)
	/home/runner/work/rook/rook/pkg/operator/ceph/cluster/cluster.go:559 +0x1e5
created by github.com/rook/rook/pkg/operator/ceph/cluster.(*ClusterController).initializeCluster
	/home/runner/work/rook/rook/pkg/operator/ceph/cluster/cluster.go:228 +0x565

I have some general suggestions that also might help, if they apply.

If you aren't on the highest .z version (e.g., vX.y.z) of kubernetes, you might try updating k8s. I recall a while back that there was a k8s issue that affected configmaps that could result in timed out waiting for results ConfigMap, and a .z version update fixed it.

Also, be sure you are following the upgrade guides carefully when doing the upgrades. Some of the upgrades require manual steps, and we have seen users miss them on accident. In particular, make sure to update crds.yaml and common.yaml before every update/upgrade: missing RBAC/CRD updates that can have seemingly strange effects.

We're on 1.26.9 right now I see its possible update to 1.26.15-1.1 , so I might try.
I tried to recreate , and few times apply crds , and common this was obvious before the ticket was rised

Edit

after update master nodes from 1.26.9 to 1.26.15-1.1 still crashing
after_master_update.txt

parth-gr · 2024-05-21T13:08:56Z

Looks like the same problem,

c.clusterInfo.Context.Err() clusterInfo would be nil but we are accessing its Context.

and seems no option to disable telemetry,

travisn · 2024-05-21T14:04:18Z

Strange, c.clusterInfo.Context seems to be nil and is causing havoc. The trigger for this was only to upgrade to v1.9? We haven't had any other reports of this issue.

parth-gr · 2024-05-21T14:07:22Z

@travisn can we suggest updating to 1.10?

traceroute42 · 2024-05-21T14:30:03Z

@travisn can we suggest updating to 1.10?

It requires update ceph cluster from 15.2 to 16.x minimum

Breaking changes in v1.10[¶](https://rook.io/docs/rook/v1.10/Upgrade/rook-upgrade/#breaking-changes-in-v110)

Support for Ceph Octopus (15.2.x) was removed. If you are running v15 you must upgrade to Ceph Pacific (v16) or Quincy (v17) before upgrading to Rook v1.10

This will require an update of the external and internal cluster, just can it be done if the rook-operator crashes?

travisn · 2024-05-31T18:58:58Z

@traceroute42 Any luck or any more clues? I'm not sure what will help here. I wonder if you could upgrade Ceph if you first downgrade Rook back to 1.8. Downgrades aren't tested so I would hesitate, but if the operator is failing anyway it might be worth trying.

traceroute42 added the bug label May 16, 2024

travisn assigned parth-gr May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

During the update from 1.8 to 1.9 the operator started crashing #14222

During the update from 1.8 to 1.9 the operator started crashing #14222

traceroute42 commented May 16, 2024

parth-gr commented May 20, 2024

traceroute42 commented May 20, 2024

parth-gr commented May 20, 2024 •

edited

traceroute42 commented May 20, 2024 •

edited

parth-gr commented May 20, 2024 •

edited

traceroute42 commented May 20, 2024

parth-gr commented May 20, 2024 •

edited

traceroute42 commented May 20, 2024

parth-gr commented May 20, 2024 •

edited

parth-gr commented May 20, 2024

traceroute42 commented May 20, 2024

parth-gr commented May 20, 2024 •

edited

traceroute42 commented May 20, 2024

parth-gr commented May 20, 2024

travisn commented May 20, 2024 •

edited

BlaineEXE commented May 20, 2024

traceroute42 commented May 21, 2024 •

edited

parth-gr commented May 21, 2024

travisn commented May 21, 2024

parth-gr commented May 21, 2024

traceroute42 commented May 21, 2024

travisn commented May 31, 2024

During the update from 1.8 to 1.9 the operator started crashing #14222

During the update from 1.8 to 1.9 the operator started crashing #14222

Comments

traceroute42 commented May 16, 2024

parth-gr commented May 20, 2024

traceroute42 commented May 20, 2024

parth-gr commented May 20, 2024 • edited

traceroute42 commented May 20, 2024 • edited

parth-gr commented May 20, 2024 • edited

traceroute42 commented May 20, 2024

parth-gr commented May 20, 2024 • edited

traceroute42 commented May 20, 2024

parth-gr commented May 20, 2024 • edited

parth-gr commented May 20, 2024

traceroute42 commented May 20, 2024

parth-gr commented May 20, 2024 • edited

traceroute42 commented May 20, 2024

parth-gr commented May 20, 2024

travisn commented May 20, 2024 • edited

BlaineEXE commented May 20, 2024

traceroute42 commented May 21, 2024 • edited

Edit

parth-gr commented May 21, 2024

travisn commented May 21, 2024

parth-gr commented May 21, 2024

traceroute42 commented May 21, 2024

travisn commented May 31, 2024

parth-gr commented May 20, 2024 •

edited

traceroute42 commented May 20, 2024 •

edited

parth-gr commented May 20, 2024 •

edited

parth-gr commented May 20, 2024 •

edited

parth-gr commented May 20, 2024 •

edited

parth-gr commented May 20, 2024 •

edited

travisn commented May 20, 2024 •

edited

traceroute42 commented May 21, 2024 •

edited