Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release "prometheus-operator" failed: rpc error: code = Canceled #6130

Closed
rnkhouse opened this issue Jul 31, 2019 · 71 comments
Closed

Release "prometheus-operator" failed: rpc error: code = Canceled #6130

rnkhouse opened this issue Jul 31, 2019 · 71 comments

Comments

@rnkhouse
Copy link

Describe the bug
When I try to install prometheus operator on AKS with helm install stable/prometheus-operator --name prometheus-operator -f prometheus-operator-values.yaml I am getting this error:

prometheus-operator" failed: rpc error: code = Canceled

I checked with history:

helm history prometheus-operator -o yaml
- chart: prometheus-operator-6.3.0
  description: 'Release "prometheus-operator" failed: rpc error: code = Canceled desc
    = grpc: the client connection is closing'
  revision: 1
  status: FAILED
  updated: Tue Jul 30 12:36:52 2019

Chart
[stable/prometheus-operator]

Additional Info
I am using below configurations to deploy a chart:

kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/alertmanager.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/prometheus.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/prometheusrule.crd.yaml
 kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/servicemonitor.crd.yaml

In values file: createCustomResource is set to false,

Output of helm version:
Client: &version.Version{SemVer:"v2.14.3", GitCommit:"0e7f3b6637f7af8fcfddb3d2941fcc7cbebb0085", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.3", GitCommit:"0e7f3b6637f7af8fcfddb3d2941fcc7cbebb0085", GitTreeState:"clean"}

Output of kubectl version:
Client Version: version.Info{Major:"1", Minor:"10", GitVersion:"v1.10.4", GitCommit:"5ca598b4ba5abb89bb773071ce452e33fb66339d", GitTreeState:"clean", BuildDate:"2018-06-06T08:13:03Z", GoVersion:"go1.9.3", Compiler:"gc", Platform:"windows/amd64"}
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.7", GitCommit:"4683545293d792934a7a7e12f2cc47d20b2dd01b", GitTreeState:"clean", BuildDate:"2019-06-06T01:39:30Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

Cloud Provider/Platform (AKS, GKE, Minikube etc.):
AKS

@janvdvegt
Copy link

We have the same issue on minikube so it does not seem to be specific to AWS.

@robinelfrink
Copy link

We have the same issue on kubespray-deployed clusters.

@DLV111
Copy link

DLV111 commented Sep 2, 2019

I'm also seeing the issue on both k8s 12.x and 13.x k8s kubespray deployed clusters in our automated pipeline - 100% failure rate. The previous version of prometheus-operator(0.30.1) works without issues.
Funny things is - that if I run the command manually instead of via the CD pipeline it works - so i'm a little confused as to what would be the cause.

@DLV111
Copy link

DLV111 commented Sep 2, 2019

Saw there was an update to promethus chart today. I bumped it to

NAME                            CHART VERSION   APP VERSION
stable/prometheus-operator      6.8.0           0.32.0     

and i'm no longer seeing the issue.

@hickeyma
Copy link
Contributor

hickeyma commented Sep 2, 2019

@rnkhouse Can you check with the latest chart version as mentioned by @dlevene1 in #6130 (comment)?

@PaulusTM
Copy link

PaulusTM commented Sep 2, 2019

I have this same issue with version 6.8.1 on AKS.

NAME                      	CHART VERSION	APP VERSION
stable/prometheus-operator	6.8.1        	0.32.0
❯ helm version 
Client: &version.Version{SemVer:"v2.14.3", GitCommit:"0e7f3b6637f7af8fcfddb3d2941fcc7cbebb0085", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.3", GitCommit:"0e7f3b6637f7af8fcfddb3d2941fcc7cbebb0085", GitTreeState:"clean"}
 ❯ helm install -f prd.yaml --name prometheus --namespace monitoring stable/prometheus-operator 
Error: release prometheus failed: grpc: the client connection is closing
>>> elapsed time 1m56s

@zarvd
Copy link

zarvd commented Sep 4, 2019

We have the same issue on kubespray-deployed clusters.

Kubernete version: v1.4.1
Helm version:

Client: &version.Version{SemVer:"v2.14.3", GitCommit:"0e7f3b6637f7af8fcfddb3d2941fcc7cbebb0085", GitTreeState:"clean"}
Server: &version.Version{SemVer:"v2.14.0", GitCommit:"05811b84a3f93603dd6c2fcfe57944dfa7ab7fd0", GitTreeState:"clean"}

Prometheus-operator version:

NAME                            CHART VERSION   APP VERSION
stable/prometheus-operator      6.8.1           0.32.0  

@will-beta
Copy link

I have the same issue on aks.

@bacongobbler
Copy link
Member

Can anyone reproduce this issue in Helm 3, or does it propagate as a different error? My assumption is that with the removal of tiller this should no longer be an issue.

@will-beta
Copy link

@bacongobbler This is still an issue in Helm 3.

bash$ helm install r-prometheus-operator stable/prometheus-operator --version 6.8.2 -f prometheus-operator/helm/prometheus-operator.yaml

manifest_sorter.go:179: info: skipping unknown hook: "crd-install"
Error: apiVersion "monitoring.coreos.com/v1" in prometheus-operator/templates/exporters/kube-controller-manager/servicemonitor.yaml is not available

@bacongobbler
Copy link
Member

bacongobbler commented Sep 7, 2019

That seems to be a different issue than the issue raised by the OP, though.

description: 'Release "prometheus-operator" failed: rpc error: code = Canceled desc
= grpc: the client connection is closing'

Can you check and see if you're using the latest beta release as well? That error was seemingly addressed in #6332 which was released in 3.0.0-beta.3. If not can you open a new issue?

@will-beta
Copy link

@bacongobbler i'm using the latest Helm v3.0.0-beta.3.

@ghost
Copy link

ghost commented Sep 8, 2019

I had to go back to --version 6.7.3 to get it to install properly

@robinelfrink
Copy link

Our workaround is to keep prometheus operator image on v0.31.1.

@pyadminn
Copy link

pyadminn commented Sep 10, 2019

helm.log
Also just encountered this issue on DockerEE kubernetes install

After some fiddling with install options --debug and such, am now getting:

Error: release prom failed: context canceled

Edit: May try updating my helm versions, currently at v2.12.3
Edit2: Updated to 2.14.3 and still problematic
grpc: the client connection is closing
Edit3: Installed version 6.7.3 per above suggestions to get things going again
Edit4: Attached tiller log for a failed install as helm.log

related: helm/charts#15977

@vsliouniaev
Copy link

vsliouniaev commented Sep 12, 2019

After doing some digging with @cyp3d it appears that the issue could be caused by a helm delete timeout that's too short for some clusters. I cannot reproduce the issue anywhere, so if someone who is experiencing this could validate a potential fix in the linked pull request branch I would much appreciate it!

helm/charts#17090

@xvzf
Copy link
Contributor

xvzf commented Sep 13, 2019

Same here on several Clusters created with kops on AWS.
No issues when running on K3S though.

@vsliouniaev
Copy link

@xvzf

Could you try the potential fix in this PR? helm/charts#17090

@pyadminn
Copy link

I gave the PR a run through and still the same Error: release prom failed: context canceled
tiller.log

@xvzf
Copy link
Contributor

xvzf commented Sep 13, 2019

@vsliouniaev Nope, does not fix the issue here

@vsliouniaev
Copy link

Thanks for checking @xvzf and @pyadminn. I have made another change in the same PR. Could you see if this helps?

@pyadminn
Copy link

pyadminn commented Sep 16, 2019

Just checked the updated PR still seeing the following on our infra: Error: release prom failed: rpc error: code = Canceled desc = grpc: the client connection is closing

FYI we are on Kuber 1.14.3
Helm vers v2.14.3

@quantumhype
Copy link

quantumhype commented Sep 20, 2019

I was able to get around this issue by following the 'Helm fails to create CRDs' section in readme.md. I'm not sure how they're related, but it worked.

Step 1: Manually create the CRDS

kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/alertmanager.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/prometheus.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/prometheusrule.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/servicemonitor.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/podmonitor.crd.yaml

Step 2:
Wait for CRDs to be created, which should only take a few seconds

Step 3:
Install the chart, but disable the CRD provisioning by setting prometheusOperator.createCustomResource=false

$ helm install --name my-release stable/prometheus-operator --set prometheusOperator.createCustomResource=false

@xvzf
Copy link
Contributor

xvzf commented Sep 23, 2019

@vsliouniaev Still same issue! Though the workaround from lethalwire works.

@pyadminn
Copy link

The lethalwire workaround has me resolved as well.

@Typositoire
Copy link

So 4 days a part the workaround worked and stopped working I had to use the CRDs file from 0.32.0 not master.

@waynekhan
Copy link

I tried on chart v8.2.4: if prometheusOperator.admissionWebhooks=false, prometheus.tlsProxy.enabled=false too.

Also, like what vsliouniaev said, what does --debug and --dry-run say?

@vsliouniaev
Copy link

@truealex81 Since helm3 is meant to give more information about this, can you please post verbose logs from the install process?

@bacongobbler bacongobbler reopened this Nov 28, 2019
@sschne
Copy link

sschne commented Nov 29, 2019

I am receiving the same issue deploying 8.2.4 on Azure AKS.

Helm Version:
version.BuildInfo{Version:"v3.0.0", GitCommit:"e29ce2a54e96cd02ccfce88bee4f58bb6e2a28b6", GitTreeState:"clean", GoVersion:"go1.13.4"}

Helm --debug produces this output:

install.go:148: [debug] Original chart version: ""
install.go:165: [debug] CHART PATH: /root/.cache/helm/repository/prometheus-operator-8.2.4.tgz
client.go:87: [debug] creating 1 resource(s)
client.go:87: [debug] creating 1 resource(s)
client.go:87: [debug] creating 1 resource(s)
client.go:87: [debug] creating 1 resource(s)
client.go:87: [debug] creating 1 resource(s)
install.go:139: [debug] Clearing discovery cache
wait.go:51: [debug] beginning wait for 5 resources with timeout of 1m0s
client.go:220: [debug] Starting delete for "prometheus-operator-admission" ServiceAccount
client.go:245: [debug] serviceaccounts "prometheus-operator-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prometheus-operator-admission" PodSecurityPolicy
client.go:245: [debug] podsecuritypolicies.policy "prometheus-operator-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prometheus-operator-admission" RoleBinding
client.go:245: [debug] rolebindings.rbac.authorization.k8s.io "prometheus-operator-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prometheus-operator-admission" Role
client.go:245: [debug] roles.rbac.authorization.k8s.io "prometheus-operator-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prometheus-operator-admission" ClusterRoleBinding
client.go:245: [debug] clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prometheus-operator-admission" ClusterRole
client.go:245: [debug] clusterroles.rbac.authorization.k8s.io "prometheus-operator-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prometheus-operator-admission-create" Job
client.go:245: [debug] jobs.batch "prometheus-operator-admission-create" not found
client.go:87: [debug] creating 1 resource(s)
client.go:420: [debug] Watching for changes to Job prometheus-operator-admission-create with timeout of 5m0s
client.go:445: [debug] Add/Modify event for prometheus-operator-admission-create: MODIFIED
client.go:484: [debug] prometheus-operator-admission-create: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:445: [debug] Add/Modify event for prometheus-operator-admission-create: MODIFIED
client.go:220: [debug] Starting delete for "prometheus-operator-admission" ServiceAccount
client.go:220: [debug] Starting delete for "prometheus-operator-admission" PodSecurityPolicy
client.go:220: [debug] Starting delete for "prometheus-operator-admission" RoleBinding
client.go:220: [debug] Starting delete for "prometheus-operator-admission" Role
client.go:220: [debug] Starting delete for "prometheus-operator-admission" ClusterRoleBinding
client.go:220: [debug] Starting delete for "prometheus-operator-admission" ClusterRole
client.go:220: [debug] Starting delete for "prometheus-operator-admission-create" Job
client.go:87: [debug] creating 120 resource(s)
Error: context canceled

I can reproduce this reliably. If there is a way to get more verbose logs, please let me know and i post the output here

@vsliouniaev
Copy link

@pather87 thanks a lot!

Here's the order of what's meant to happen in the chart:

  1. CRDs are provisioned
  2. There is a pre-install;pre-upgrade job which runs a container to create a secret with certificates for the admission hooks. This job and its resources are cleaned up on success
  3. All the resources are created
  4. There is a post-install;post-upgrade job that runs a container to patch the created validationgwebhookconfiguration and mutatingwebhookconfiguration with the CA from the certificates created in step 2. This job and its resources are cleaned up on success

Could you please check if you have any failed jobs still present? From the logs it reads like you shouldn't because they were all successful.

Are there any other resources present in the cluster after the Error: context canceled happens?

@willsilvano
Copy link

Same here when install prometheus-operator:

helm install prometheus-operator stable/prometheus-operator \
  --namespace=monitoring \
  --values=values.yaml

Error: rpc error: code = Canceled desc = grpc: the client connection is closing

@sschne
Copy link

sschne commented Nov 29, 2019

@vsliouniaev thanks for your answer!

  1. There are no jobs laying around after the deployment.
  2. Deployments and services are present in the Cluster after the deployment, see kubectl output:

kubectl get all -lrelease=prometheus-operator

NAME                                                     READY   STATUS    RESTARTS   AGE
pod/prometheus-operator-grafana-59d489899-4b5kd          2/2     Running   0          3m56s
pod/prometheus-operator-operator-8549bcd687-4kb2x        2/2     Running   0          3m56s
pod/prometheus-operator-prometheus-node-exporter-4km6x   1/1     Running   0          3m56s
pod/prometheus-operator-prometheus-node-exporter-7dgn6   1/1     Running   0          3m56s

NAME                                                   TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)            AGE
service/prometheus-operator-alertmanager               ClusterIP   xxx   <none>        9093/TCP           3m57s
service/prometheus-operator-grafana                    ClusterIP   xxx   <none>        80/TCP             3m57s
service/prometheus-operator-operator                   ClusterIP   xxx     <none>        8080/TCP,443/TCP   3m57s
service/prometheus-operator-prometheus                 ClusterIP   xxx   <none>        9090/TCP           3m57s
service/prometheus-operator-prometheus-node-exporter   ClusterIP   xxx    <none>        9100/TCP           3m57s

NAME                                                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/prometheus-operator-prometheus-node-exporter   2         2         2       2            2           <none>          3m57s

NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/prometheus-operator-grafana    1/1     1            1           3m57s
deployment.apps/prometheus-operator-operator   1/1     1            1           3m57s

NAME                                                      DESIRED   CURRENT   READY   AGE
replicaset.apps/prometheus-operator-grafana-59d489899     1         1         1       3m57s
replicaset.apps/prometheus-operator-operator-8549bcd687   1         1         1       3m57s

NAME                                                             READY   AGE
statefulset.apps/alertmanager-prometheus-operator-alertmanager   1/1     3m44s
statefulset.apps/prometheus-prometheus-operator-prometheus       1/1     3m34s

@willsilvano
Copy link

Installation with debug:

client.go:87: [debug] creating 1 resource(s)
install.go:126: [debug] CRD alertmanagers.monitoring.coreos.com is already present. Skipping.
client.go:87: [debug] creating 1 resource(s)
install.go:126: [debug] CRD podmonitors.monitoring.coreos.com is already present. Skipping.
client.go:87: [debug] creating 1 resource(s)
install.go:126: [debug] CRD prometheuses.monitoring.coreos.com is already present. Skipping.
client.go:87: [debug] creating 1 resource(s)
install.go:126: [debug] CRD prometheusrules.monitoring.coreos.com is already present. Skipping.
client.go:87: [debug] creating 1 resource(s)
install.go:126: [debug] CRD servicemonitors.monitoring.coreos.com is already present. Skipping.
install.go:139: [debug] Clearing discovery cache
wait.go:51: [debug] beginning wait for 0 resources with timeout of 1m0s
client.go:220: [debug] Starting delete for "prometheus-operator-admission" ClusterRoleBinding
client.go:245: [debug] clusterrolebindings.rbac.authorization.k8s.io "prometheus-operator-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prometheus-operator-admission" RoleBinding
client.go:245: [debug] rolebindings.rbac.authorization.k8s.io "prometheus-operator-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prometheus-operator-admission" ClusterRole
client.go:245: [debug] clusterroles.rbac.authorization.k8s.io "prometheus-operator-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prometheus-operator-admission" ServiceAccount
client.go:245: [debug] serviceaccounts "prometheus-operator-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prometheus-operator-admission" PodSecurityPolicy
client.go:245: [debug] podsecuritypolicies.policy "prometheus-operator-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prometheus-operator-admission" Role
client.go:245: [debug] roles.rbac.authorization.k8s.io "prometheus-operator-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prometheus-operator-admission-create" Job
client.go:245: [debug] jobs.batch "prometheus-operator-admission-create" not found
client.go:87: [debug] creating 1 resource(s)
client.go:420: [debug] Watching for changes to Job prometheus-operator-admission-create with timeout of 5m0s
client.go:445: [debug] Add/Modify event for prometheus-operator-admission-create: MODIFIED
client.go:484: [debug] prometheus-operator-admission-create: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:445: [debug] Add/Modify event for prometheus-operator-admission-create: MODIFIED
client.go:220: [debug] Starting delete for "prometheus-operator-admission" ClusterRoleBinding
client.go:220: [debug] Starting delete for "prometheus-operator-admission" RoleBinding
client.go:220: [debug] Starting delete for "prometheus-operator-admission" ClusterRole
client.go:220: [debug] Starting delete for "prometheus-operator-admission" ServiceAccount
client.go:220: [debug] Starting delete for "prometheus-operator-admission" PodSecurityPolicy
client.go:220: [debug] Starting delete for "prometheus-operator-admission" Role
client.go:220: [debug] Starting delete for "prometheus-operator-admission-create" Job
client.go:87: [debug] creating 122 resource(s)
Error: context canceled
helm.go:76: [debug] context canceled

After, then I execute: kubectl get all -lrelease=prometheus-operator -A

NAMESPACE    NAME                                                     READY   STATUS    RESTARTS   AGE
monitoring   pod/prometheus-operator-grafana-d6676b794-r6cg9          2/2     Running   0          2m45s
monitoring   pod/prometheus-operator-operator-6584f4b5f5-wdkrx        2/2     Running   0          2m45s
monitoring   pod/prometheus-operator-prometheus-node-exporter-2g4tg   1/1     Running   0          2m45s
monitoring   pod/prometheus-operator-prometheus-node-exporter-798p5   1/1     Running   0          2m45s
monitoring   pod/prometheus-operator-prometheus-node-exporter-pvk5t   1/1     Running   0          2m45s
monitoring   pod/prometheus-operator-prometheus-node-exporter-r9j2r   1/1     Running   0          2m45s

NAMESPACE     NAME                                                   TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)            AGE
kube-system   service/prometheus-operator-coredns                    ClusterIP   None           <none>        9153/TCP           2m46s
kube-system   service/prometheus-operator-kube-controller-manager    ClusterIP   None           <none>        10252/TCP          2m46s
kube-system   service/prometheus-operator-kube-etcd                  ClusterIP   None           <none>        2379/TCP           2m46s
kube-system   service/prometheus-operator-kube-proxy                 ClusterIP   None           <none>        10249/TCP          2m46s
kube-system   service/prometheus-operator-kube-scheduler             ClusterIP   None           <none>        10251/TCP          2m46s
monitoring    service/prometheus-operator-alertmanager               ClusterIP   10.0.238.102   <none>        9093/TCP           2m46s
monitoring    service/prometheus-operator-grafana                    ClusterIP   10.0.16.19     <none>        80/TCP             2m46s
monitoring    service/prometheus-operator-operator                   ClusterIP   10.0.97.114    <none>        8080/TCP,443/TCP   2m45s
monitoring    service/prometheus-operator-prometheus                 ClusterIP   10.0.57.153    <none>        9090/TCP           2m46s
monitoring    service/prometheus-operator-prometheus-node-exporter   ClusterIP   10.0.83.30     <none>        9100/TCP           2m46s

NAMESPACE    NAME                                                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
monitoring   daemonset.apps/prometheus-operator-prometheus-node-exporter   4         4         4       4            4           <none>          2m46s

NAMESPACE    NAME                                           READY   UP-TO-DATE   AVAILABLE   AGE
monitoring   deployment.apps/prometheus-operator-grafana    1/1     1            1           2m46s
monitoring   deployment.apps/prometheus-operator-operator   1/1     1            1           2m46s

NAMESPACE    NAME                                                      DESIRED   CURRENT   READY   AGE
monitoring   replicaset.apps/prometheus-operator-grafana-d6676b794     1         1         1       2m46s
monitoring   replicaset.apps/prometheus-operator-operator-6584f4b5f5   1         1         1       2m46s

NAMESPACE    NAME                                                             READY   AGE
monitoring   statefulset.apps/alertmanager-prometheus-operator-alertmanager   1/1     2m40s
monitoring   statefulset.apps/prometheus-prometheus-operator-prometheus       1/1     2m30s

@sschne
Copy link

sschne commented Nov 29, 2019

What I've also discovered by trying to work around this: The issue persists, if i delete the chart and the CRDs afterwards and install the chart again, but the issue does not persist, if i do not delete the crds.

I tried out and installed the crds beforehand, and do a helm install --skip-crds, but still the issue persists. This somewhat confusing.

@vsliouniaev
Copy link

The next log line I would expect after this is about post-install,post-upgrade hooks, but it does not appear in your case. I'm not certain what helm is waiting on here

...
lient.go:220: [debug] Starting delete for "prom-op-prometheus-operato-admission" RoleBinding
client.go:245: [debug] rolebindings.rbac.authorization.k8s.io "prom-op-prometheus-operato-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prom-op-prometheus-operato-admission" Role
client.go:245: [debug] roles.rbac.authorization.k8s.io "prom-op-prometheus-operato-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prom-op-prometheus-operato-admission" ClusterRole
client.go:245: [debug] clusterroles.rbac.authorization.k8s.io "prom-op-prometheus-operato-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prom-op-prometheus-operato-admission" ServiceAccount
client.go:245: [debug] serviceaccounts "prom-op-prometheus-operato-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prom-op-prometheus-operato-admission" ClusterRoleBinding
client.go:245: [debug] clusterrolebindings.rbac.authorization.k8s.io "prom-op-prometheus-operato-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prom-op-prometheus-operato-admission" PodSecurityPolicy
client.go:245: [debug] podsecuritypolicies.policy "prom-op-prometheus-operato-admission" not found
client.go:87: [debug] creating 1 resource(s)
client.go:220: [debug] Starting delete for "prom-op-prometheus-operato-admission-patch" Job
client.go:245: [debug] jobs.batch "prom-op-prometheus-operato-admission-patch" not found
client.go:87: [debug] creating 1 resource(s)
client.go:420: [debug] Watching for changes to Job prom-op-prometheus-operato-admission-patch with timeout of 5m0s
client.go:445: [debug] Add/Modify event for prom-op-prometheus-operato-admission-patch: MODIFIED
client.go:484: [debug] prom-op-prometheus-operato-admission-patch: Jobs active: 1, jobs failed: 0, jobs succeeded: 0
client.go:445: [debug] Add/Modify event for prom-op-prometheus-operato-admission-patch: MODIFIED
client.go:220: [debug] Starting delete for "prom-op-prometheus-operato-admission" RoleBinding
client.go:220: [debug] Starting delete for "prom-op-prometheus-operato-admission" Role
client.go:220: [debug] Starting delete for "prom-op-prometheus-operato-admission" ClusterRole
client.go:220: [debug] Starting delete for "prom-op-prometheus-operato-admission" ServiceAccount
client.go:220: [debug] Starting delete for "prom-op-prometheus-operato-admission" ClusterRoleBinding
client.go:220: [debug] Starting delete for "prom-op-prometheus-operato-admission" PodSecurityPolicy
client.go:220: [debug] Starting delete for "prom-op-prometheus-operato-admission-patch" Job

@truealex81
Copy link

Manual CRDs creation helps at least on Azure.
Firstly create crds from this link https://github.com/coreos/prometheus-operator/tree/release-0.34/example/prometheus-operator-crd
"kubectl create -f alertmanager.crd.yaml" and so on for all files
Then
helm install prometheus-operator stable/prometheus-operator --namespace monitoring --version 8.2.4 --set prometheusOperator.createCustomResource=false

@willsilvano
Copy link

Thanks @truealex81 ! That works on Azure.

@bierhov
Copy link

bierhov commented Dec 5, 2019

myenv:
k8s 1.11.2 helm 2.13.1 tiller 2.13.1
prometheus-operator-5.5 APP VERSION 0.29 is OK!!!

but:
prometheus-operator-8 APP VERSION 0.32 hava same problem:
"context canceled" or "grpc: the client connection is closing"!!!

i guess the lastest version of prometheus-operator is not compatible?!!!

@vsliouniaev
Copy link

@bierhov please can you post the resources in the namespace after a failure?

@bierhov
Copy link

bierhov commented Dec 5, 2019

yes!
shell execute "helm ls" i can see my prometheus-operator release status "failed",but the namespace where prometheus-operator i installed have all prometheus-operator resourses
but,
promethues web can't get any data!

@vsliouniaev
Copy link

Can you please post the resources though?

@bierhov
Copy link

bierhov commented Dec 5, 2019

Can you please post the resources though?

sorry,i cant reappear,unless i remove my stable helm env and do it again!

@vsliouniaev
Copy link

@bierhov do you have any failed jobs left after the install?

@bierhov
Copy link

bierhov commented Dec 5, 2019

@bierhov do you have any failed jobs left after the install?

my k8s version is 1.11.2 helm an tiller version is 2.13.1
if i install prometheus-operator version 8.x
shell exec command "helm ls",the job status is failed
but i install prometheus-operator version 5.x
shell exec command "helm ls",the job status is deployed!!!

@zomarg
Copy link

zomarg commented Dec 12, 2019

Not reproducable using:

Kubernetes version: v1.13.12"
Kubectl version: v1.16.2
Helm version: 3.0.1
Prometheus-operator version: 8.3.3

  1. Install CRDs manually:
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/v0.34.0/example/prometheus-operator-crd/prometheus.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/v0.34.0/example/prometheus-operator-crd/prometheusrule.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/v0.34.0/example/prometheus-operator-crd/servicemonitor.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/v0.34.0/example/prometheus-operator-crd/podmonitor.crd.yaml
  1. Configure operator to not create crds in Values.yaml or when installing using

--set prometheusOperator.createCustomResource=false

prometheusOperator: createCustomResource: false

@vsliouniaev
Copy link

@gramozkrasniqi
What if you don't create CRDs manually? That's one of the workarounds for the issue

@zomarg
Copy link

zomarg commented Dec 12, 2019

@vsliouniaev if you dont create them you will get the error.
But in the original issue in Additional Info @rnkhouse stated that he was creating the CRDs manually.

@alfonzso
Copy link

We use prometheus-operator in our deployment, in a nutshell, we upgraded prom-op from 6.9.3 to 8.3.3 and always failed with "Error: context canceled".
Also we always install crds before install/upgrade prometheus-operator, and ofc we didn't change or update these crd-s.

I try to refresh crds, which in 'github.com/helm/charts/tree/master/stable/prometheus-operator' mentions ( like this kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/alertmanager.crd.yaml ), but these don't exists anymore.
After that I try to these from here: https://github.com/helm/charts/tree/master/stable/prometheus-operator/crds
But It failed again.

I almost gave up, but with these crds, helm deploy succeeded ! yeyyyy
https://github.com/coreos/kube-prometheus/tree/master/manifests/setup

My setup:

Kubernetes version: v1.14.3
Kubectl version: v1.14.2
Helm version: 2.14.3
Prometheus-operator version: 8.3.3

Purge prometheus-operator from k8s !

Then:

kubectl apply -f https://raw.githubusercontent.com/coreos/kube-prometheus/master/manifests/setup/prometheus-operator-0alertmanagerCustomResourceDefinition.yaml   
kubectl apply -f https://raw.githubusercontent.com/coreos/kube-prometheus/master/manifests/setup/prometheus-operator-0podmonitorCustomResourceDefinition.yaml     
kubectl apply -f https://raw.githubusercontent.com/coreos/kube-prometheus/master/manifests/setup/prometheus-operator-0prometheusCustomResourceDefinition.yaml     
kubectl apply -f https://raw.githubusercontent.com/coreos/kube-prometheus/master/manifests/setup/prometheus-operator-0prometheusruleCustomResourceDefinition.yaml 
kubectl apply -f https://raw.githubusercontent.com/coreos/kube-prometheus/master/manifests/setup/prometheus-operator-0servicemonitorCustomResourceDefinition.yaml 
helm upgrade -i prom-op                               \
  --version 8.3.3                                     \
  --set prometheusOperator.createCustomResource=false \
  stable/prometheus-operator

That's all !

@pandvan
Copy link

pandvan commented Dec 19, 2019

Does this mean that it's necessary to do a clean install and lose historical metrics data?

@truealex81
Copy link

Аfter upgarding AKS k8s to 1.15.5, helm to 3.0.1 and Prometheus-operator chart to 8.3.3 the problem is gone.

@infa-ddeore
Copy link

infa-ddeore commented Jan 14, 2020

Our workaround is to keep prometheus operator image on v0.31.1.

worked for me as well on AKS v1.14.8 and helm+tiller v2.16.1 and changing operator image to v0.31.1

@cocuba
Copy link

cocuba commented Jan 28, 2020

Manual CRDs creation helps at least on Azure.
Firstly create crds from this link https://github.com/coreos/prometheus-operator/tree/release-0.34/example/prometheus-operator-crd
"kubectl create -f alertmanager.crd.yaml" and so on for all files
Then
helm install prometheus-operator stable/prometheus-operator --namespace monitoring --version 8.2.4 --set prometheusOperator.createCustomResource=false

in azure kubernetes works, thanks

@Superset1986
Copy link

I was able to get around this issue by following the 'Helm fails to create CRDs' section in readme.md. I'm not sure how they're related, but it worked.

Step 1: Manually create the CRDS

kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/alertmanager.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/prometheus.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/prometheusrule.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/servicemonitor.crd.yaml
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/master/example/prometheus-operator-crd/podmonitor.crd.yaml

Step 2:
Wait for CRDs to be created, which should only take a few seconds

Step 3:
Install the chart, but disable the CRD provisioning by setting prometheusOperator.createCustomResource=false

$ helm install --name my-release stable/prometheus-operator --set prometheusOperator.createCustomResource=false

Thanks, this worked for me with AKS cluster. had to change the URL for the CRD's.

kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.37/example/prometheus-operator-crd/monitoring.coreos.com_alertmanagers.yaml --validate=false
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.37/example/prometheus-operator-crd/monitoring.coreos.com_podmonitors.yaml --validate=false
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.37/example/prometheus-operator-crd/monitoring.coreos.com_prometheuses.yaml --validate=false
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.37/example/prometheus-operator-crd/monitoring.coreos.com_prometheusrules.yaml --validate=false
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.37/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml --validate=false
kubectl apply -f https://raw.githubusercontent.com/coreos/prometheus-operator/release-0.37/example/prometheus-operator-crd/monitoring.coreos.com_thanosrulers.yaml --validate=false

helm install stable/prometheus-operator --name prometheus-operator --namespace monitoring --set prometheusOperator.createCustomResource=false

@bacongobbler
Copy link
Member

Closing. Looks like this has been since resolved, according to the last three commenters. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.