Fresh install of tobs fails with promscale db password #562

lenaxia · 2022-08-24T01:54:12Z

What did you do?
This is a fresh install of tobs into a namespace using helm and fluxcd.

https://github.com/lenaxia/k3s-ops-dev/blob/main/components/apps/base/monitoring/tobs/helm-release.yaml
https://github.com/lenaxia/k3s-ops-dev/blob/main/components/apps/dev/tobs-values.yaml

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: tobs
  namespace: monitoring
spec:
  chart:
    spec:
      version: "12.0.1"
  install:
    createNamespace: true
    remediation:
      retries: 10
  upgrade:
    remediation:
      retries: 10
  values:
    opentelemetry-operator:
      enabled: false

    promscale:
      enabled: true
      #image: timescale/promscale:0.8.0
      service:
        type: LoadBalancer

    timescaledb-single:
      enabled: true
      replicaCount: 1
      loadBalancer:
        enabled: true
      persistentVolumes:
        data:
          size: 11Gi
        wal:
          size: 5Gi
      backup:
        enabled: false
      #env:
      #  PGBACKREST_REPO1_S3_BUCKET
      #  PGBACKREST_REPO1_S3_ENDPOINT
      #  PGBACKREST_REPO1_S3_REGION
      #  PGBACKREST_REPO1_S3_KEY
      #  PGBACKREST_REPO1_S3_KEY_SECRET

    kube-prometheus-stack:
      enabled: true

      alertManager:
        enabled: true
        alertmanagerSpec:
          replicas: 1

      grafana:
        enabled: true

        prometheus:
          datasource:
            enabled: true
        timescale:
          datasource:
            enabled: true

        adminPassword: SOME_PASSWORD_HERE

        ingress:
          enabled: true
          ingressClassName: "traefik"
          annotations:
            hajimari.io/enable: "true"
            hajimari.io/icon: "mdiPlayNetwork"
            #cert-manager.io/cluster-issuer: "letsencrypt-staging"
            cert-manager.io/cluster-issuer: "ca-issuer"
            traefik.ingress.kubernetes.io/router.entrypoints: "websecure"
          hosts:
            - &hostGrafana "grafana.${SECRET_DEV_DOMAIN}"
          tls:
            - hosts:
                - *hostGrafana
              secretName: *hostGrafana

      prometheus:
        prometheusSpec:
          replicas: 1
          scrapeInterval: 1m
          scrapeTimeout: 10s
          evaluationInterval: 1m
          retention: 1d
          storageSpec:
            volumeClaimTemplate:
              spec:
                resources:
                  requests:
                    storage:
                      3Gi

        ingress:
          enabled: true
          ingressClassName: "traefik"
          annotations:
            #cert-manager.io/cluster-issuer: "letsencrypt-staging"
            cert-manager.io/cluster-issuer: "ca-issuer"
            traefik.ingress.kubernetes.io/router.entrypoints: "websecure"
          hosts:
            - &hostProm "prometheus.${SECRET_DEV_DOMAIN}"
          tls:
            - hosts:
                - *hostProm
              secretName: *hostProm

pod/tobs-promscale ends up in a crash loop unable to connect to the timescaledb

level=error ts=2022-08-24T01:36:19.275Z caller=runner.go:116 msg="aborting startup due to error" err="failed to connect to `host=tobs.monitoring.svc user=postgres database=postgres`: server error (FATAL: password authentication failed for user \"postgres\" (SQLSTATE 28P01))"

Did you expect to see some different?

tobs should've installed without issue

Environment

tobs version:

spec:
  chart:
    spec:
      version: "12.0.1"

Kubernetes version information:

kubectl version

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:38:05Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.7+k3s1", GitCommit:"ac70570999c566ac3507d2cc17369bb0629c1cc0", GitTreeState:"clean", BuildDate:"2021-11-29T16:40:13Z", GoVersion:"go1.16.10", Compiler:"gc", Platform:"linux/amd64"}
WARNING: version difference between client (1.23) and server (1.21) exceeds the supported minor version skew of +/-1

Kubernetes cluster kind:

K3s installed via Flux:

flux version

helm-controller: v0.18.1
kustomize-controller: v0.22.1
notification-controller: v0.23.1
source-controller: v0.22.2

flux check

► checking prerequisites
✗ flux 0.28.2 <0.32.0 (new version is available, please upgrade)
✔ Kubernetes 1.21.7+k3s1 >=1.20.6-0
► checking controllers
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v0.22.1
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v0.23.1
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.18.1
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v0.22.2
✔ all checks passed

tobs Logs:

kcl tobs-promscale-788c855fc5-59rqv -n monitoring

level=error ts=2022-08-24T01:36:19.275Z caller=runner.go:116 msg="aborting startup due to error" err="failed to connect to `host=tobs.monitoring.svc user=postgres database=postgres`: server error (FATAL: password authentication failed for user \"postgres\" (SQLSTATE 28P01))"

kc get secret tobs-credentials -n monitoring -o yaml

apiVersion: v1
data:
  PATRONI_REPLICATION_PASSWORD: bFZVRDJkRE9uY05UZm4wVA==
  PATRONI_SUPERUSER_PASSWORD: RXB5VlByYzc2NE15MVQyRg==
  PATRONI_admin_PASSWORD: TGVBSDlKS2lyOUhOVTFqNA==
kind: Secret
metadata:
  annotations:
    helm.sh/hook: pre-install,post-delete
    helm.sh/hook-weight: "0"
    helm.sh/resource-policy: keep
  creationTimestamp: "2022-08-24T01:32:09Z"
  labels:
    app: tobs-timescaledb
    cluster-name: tobs
  name: tobs-credentials
  namespace: monitoring
  resourceVersion: "22929956"
  uid: 2cbf2b64-c369-405f-8472-f85ac5ef289d
type: Opaque

echo RXB5VlByYzc2NE15MVQyRg== | base64 -d

EpyVPrc764My1T2F`

kubectl describe deploy tobs-promscale -n monitoring

Name:               tobs-promscale
Namespace:          monitoring
CreationTimestamp:  Wed, 24 Aug 2022 01:32:22 +0000
Labels:             app=tobs-promscale
                    app.kubernetes.io/component=connector
                    app.kubernetes.io/managed-by=Helm
                    app.kubernetes.io/name=tobs-promscale
                    app.kubernetes.io/version=0.13.0
                    chart=promscale-0.13.0
                    helm.toolkit.fluxcd.io/name=tobs
                    helm.toolkit.fluxcd.io/namespace=monitoring
                    heritage=Helm
                    release=tobs
Annotations:        deployment.kubernetes.io/revision: 1
                    meta.helm.sh/release-name: tobs
                    meta.helm.sh/release-namespace: monitoring
Selector:           app=tobs-promscale,release=tobs
Replicas:           1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType:       Recreate
MinReadySeconds:    0
Pod Template:
  Labels:           app=tobs-promscale
                    app.kubernetes.io/component=connector
                    app.kubernetes.io/name=tobs-promscale
                    app.kubernetes.io/version=0.13.0
                    chart=promscale-0.13.0
                    heritage=Helm
                    release=tobs
  Annotations:      checksum/config: a1171a41877cc559fe699480d7c9bc731055fde6ccbe0b47e5c9a279cfe38962
                    checksum/connection: d610b61926215912316a5f9c07435dd69b06894ed8e640bbd7c2bc21c51a16fa
                    prometheus.io/path: /metrics
                    prometheus.io/port: 9201
                    prometheus.io/scrape: false
  Service Account:  tobs-promscale
  Containers:
   promscale:
    Image:       timescale/promscale:0.13.0
    Ports:       9201/TCP, 9202/TCP
    Host Ports:  0/TCP, 0/TCP
    Args:
      -config=/etc/promscale/config.yaml
      --metrics.high-availability=true
    Requests:
      cpu:      30m
      memory:   500Mi
    Readiness:  http-get http://:metrics-port/healthz delay=0s timeout=15s period=15s #success=1 #failure=3
    Environment Variables from:
      tobs-promscale  Secret  Optional: false
    Environment:
      TOBS_TELEMETRY_INSTALLED_BY:         promscale
      TOBS_TELEMETRY_VERSION:              0.13.0
      TOBS_TELEMETRY_INSTALLED_BY:         helm
      TOBS_TELEMETRY_VERSION:              0.13.0
      TOBS_TELEMETRY_TRACING_ENABLED:      true
      TOBS_TELEMETRY_TIMESCALEDB_ENABLED:  true
    Mounts:
      /etc/promscale/ from configs (rw)
  Volumes:
   configs:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      tobs-promscale
    Optional:  false
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      False   MinimumReplicasUnavailable
  Progressing    True    ReplicaSetUpdated
OldReplicaSets:  <none>
NewReplicaSet:   tobs-promscale-788c855fc5 (1/1 replicas created)
Events:
  Type    Reason             Age   From                   Message
  ----    ------             ----  ----                   -------
  Normal  ScalingReplicaSet  7m    deployment-controller  Scaled up replica set tobs-promscale-788c855fc5 to 1

kc get secret tobs-promscale -n monitoring -o yaml

apiVersion: v1
data:
  PROMSCALE_DB_HOST: dG9icy5tb25pdG9yaW5nLnN2Yw==
  PROMSCALE_DB_NAME: cG9zdGdyZXM=
  PROMSCALE_DB_PASSWORD: RXB5VlByYzc2NE15MVQyRg==
  PROMSCALE_DB_PORT: NTQzMg==
  PROMSCALE_DB_SSL_MODE: cmVxdWlyZQ==
  PROMSCALE_DB_USER: cG9zdGdyZXM=
kind: Secret
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"PROMSCALE_DB_HOST":"dG9icy5tb25pdG9yaW5nLnN2Yw==","PROMSCALE_DB_NAME":"cG9zdGdyZXM=","PROMSCALE_DB_PASSWORD":"RXB5VlByYzc2NE15MVQyRg==","PROMSCALE_DB_PORT":"NTQzMg==","PROMSCALE_DB_SSL_MODE":"cmVxdWlyZQ==","PROMSCALE_DB_USER":"cG9zdGdyZXM="},"kind":"Secret","metadata":{"annotations":{"meta.helm.sh/release-name":"tobs","meta.helm.sh/release-namespace":"monitoring"},"creationTimestamp":"2022-08-24T01:32:18Z","labels":{"app":"tobs-promscale","app.kubernetes.io/component":"connector","app.kubernetes.io/managed-by":"Helm","app.kubernetes.io/name":"tobs-promscale","app.kubernetes.io/version":"0.13.0","chart":"promscale-0.13.0","helm.toolkit.fluxcd.io/name":"tobs","helm.toolkit.fluxcd.io/namespace":"monitoring","heritage":"Helm","release":"tobs"},"name":"tobs-promscale","namespace":"monitoring","resourceVersion":"22930073","uid":"eb651b96-5c5e-4c79-bfc0-64462bbd0b72"},"type":"Opaque"}
    meta.helm.sh/release-name: tobs
    meta.helm.sh/release-namespace: monitoring
  creationTimestamp: "2022-08-24T01:32:18Z"
  labels:
    app: tobs-promscale
    app.kubernetes.io/component: connector
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: tobs-promscale
    app.kubernetes.io/version: 0.13.0
    chart: promscale-0.13.0
    helm.toolkit.fluxcd.io/name: tobs
    helm.toolkit.fluxcd.io/namespace: monitoring
    heritage: Helm
    release: tobs
  name: tobs-promscale
  namespace: monitoring
  resourceVersion: "22930681"
  uid: eb651b96-5c5e-4c79-bfc0-64462bbd0b72
type: Opaque

echo RXB5VlByYzc2NE15MVQyRg== | base64 -d

EpyVPrc764My1T2F

Anything else we need to know?:

Installing tobs seems to be really unstable, especially with opentelemetry enabled. I've gotten it to install once or twice okay, but shortly thereafter it becomes unhealthy. And now it won't even install anymore.

The text was updated successfully, but these errors were encountered:

nhudson · 2022-08-24T02:00:57Z

Thanks for the report. Can you try the same, but with the latest version of the Helm chart? It's 14.0.0. Thanks!

lenaxia · 2022-08-24T07:15:30Z

So I think what it is (and I'll verify again later) is that if a tobs install fails for any reason, context deadline, node networking, etc, it corrupts the namespace to any future installs. The entire namespace needs to be completely blown away in order for a new install to occur. While I saw a few suggestions here and there in response to other people's problems, it was never called out as an SOP step that should be taken.

I think the instructions should clearly state, or at least there should be a troubleshooting section that describes common troubleshooting steps, especially because not everything will get uninstalled when it does uninstall (in the rare case it does manage to uninstall successfully).

Specifically, from a flux standpoint, the flux helmrelease needs to be deleted

flux delete hr tobs -n monitoring

Then the helm deployment needs to be deleted

helm delete tobs -n monitoring

Then the entire namespace needs to be deleted, and forced because some of the components often get stuck during uninstall and deletion

kubectl delete ns monitoring --force

However often times that isn't enough. You must list the pods in the ns and delete them individually. Sometimes secrets too.

kubectl get pods -n monitoring
kubectl delete pod <podname> -n monitoring --force
kubectl delete secret <secrets> -n monitoring --force

Then delete the namespace again

kubectl delete ns monitoring --force

Because of this, I think tobs should also come with a recommendation or default to being installed in its own namespace.

Also, from a helm and flux standpoint, timeouts should be set to 15m+, and there are some common errors that should be ignored at least for the timeout period such as the tobs-promscale postgres error

kubectl logs tobs-promscale

level=error ts=2022-08-24T01:36:19.275Z caller=runner.go:116 msg="aborting startup due to error" err="failed to connect to `host=tobs.monitoring.svc user=postgres database=postgres`: server error (FATAL: password authentication failed for user \"postgres\" (SQLSTATE 28P01))"

As this one seems to either resolve itself after a while, or randomly succeeds during an install, not quite sure yet.

If I have time I'll do a pull request to add some documentation.

nhudson · 2022-08-24T15:01:51Z

Ah yes. We are aware of some of these issues already. Namely cleaning up the namespace when you helm delete the tobs installation. That is currently being addressed here #365

While you can install tobs in any namespace you like the issue with leaving artifacts behind makes it a bit difficult to uninstall. I agree better documentation is needed, there are several open issues that we are working through already.

#312
#232
#539
#476

Also, from a helm and flux standpoint, timeouts should be set to 15m+, and there are some common errors that should be ignored at least for the timeout period such as the tobs-promscale postgres error

Yes it is recommended to set a timeout of 15m, we do this currently with our testing suite. Adding better documentation is needed and it will be addressed. Thanks for the update!

lenaxia added the question Further information is requested label Aug 24, 2022

nhudson self-assigned this Aug 24, 2022

nhudson mentioned this issue Aug 29, 2022

Update chart documentation #568

Merged

2 tasks

nhudson closed this as completed in #568 Aug 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fresh install of tobs fails with promscale db password #562

Fresh install of tobs fails with promscale db password #562

lenaxia commented Aug 24, 2022 •

edited

nhudson commented Aug 24, 2022

lenaxia commented Aug 24, 2022

nhudson commented Aug 24, 2022

Fresh install of tobs fails with promscale db password #562

Fresh install of tobs fails with promscale db password #562

Comments

lenaxia commented Aug 24, 2022 • edited

nhudson commented Aug 24, 2022

lenaxia commented Aug 24, 2022

nhudson commented Aug 24, 2022

lenaxia commented Aug 24, 2022 •

edited