Run operator and cluster under istio (maistra) mesh - either webhook or status is unavailable #1280

alishchytovych · 2022-12-26T01:39:28Z

alishchytovych
Dec 26, 2022

Trying to run cnpg operator (1.18.1) and cluster under istio (maistra 2.2) service mesh and have a major issue.

operator runs in cnpg-system namespace
cluster nodes run in workload namespace

if cnpg-system is a member of maistra (has maistra.io/member-of: istio-system) - no cluster can be created with the error "failed calling webhook "vpooler.kb.io": failed to call webhook: Post "https://cnpg-webhook-service.cnpg-system.svc:443/validate-postgresql-cnpg-io-v1-pooler?timeout=10s": dial tcp 10.130.3.18:9443: i/o timeout"

if cnpg-system is not a member of service mesh - operator can't get cluster status with the errors like this "msg":"Cannot extract Pod status","controller":"cluster","controllerGroup":"postgresql.cnpg.io","controllerKind":"Cluster","Cluster":{"name":"pg","namespace":"workload"},"namespace":"workload","name":"pg","reconcileID":"ba85f1e4-b2c2-452e-a324-3472e7d33a9e","uuid":"e6e96622-84a6-11ed-96f7-0a580a8202ff","name":"pg-1","error":"Get "http://10.128.2.214:8000/pg/status\": dial tcp 10.128.2.214:8000: i/o timeout"}"

In the cluster definition the server instances and pooler have annotations, these annotations are properly inhereted by the corresponding pods:
annotations:
sidecar.istio.io/inject: 'true'
proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }'

I've tried to use traffic.sidecar.istio.io/excludeInboundPorts: "8000" - it doesn't help.
I've tried to create serviceentry, virtualservice and destinationrule for webhook - it doesn't help.

Any ideas how to make it fully working?

sxd · 2022-12-30T20:00:14Z

sxd
Dec 30, 2022
Maintainer

Hi @alishchytovych

Sorry for the late response, don't know if this help but I understand that the operator cannot be reached from the clusters right? I can see that because you can't reach the webhooks, which it's an important thing to do, and you will require the operator to reach the pods, that's because the operator will try to reach the cluster via networking.

With that being said, you can always set the same annotations for the operator deployment, that's also possible, can you check if that works for you?

On the other hand, I'm turning this into a discussion since doesn't look like an issue with the operator

Best Regards!

0 replies

alishchytovych · 2022-12-30T20:09:11Z

alishchytovych
Dec 30, 2022
Author

Hi @sxd
I tried to make operator pod with the Istio sidecar - it doesn't help, errors are the same:

{"level":"error","ts":1672430855.27569,"msg":"Error while creating backup object","controller":"scheduledbackup","controllerGroup":"postgresql.cnpg.io","controllerKind":"ScheduledBackup","ScheduledBackup":{"name":"pg-watchdog-backup","namespace":"mb"},"namespace":"mb","name":"pg-watchdog-backup","reconcileID":"3951c33a-f947-4c22-a08e-035508b42e43","uuid":"93f1715e-887d-11ed-9747-0a580a820398","backupName":"pg-watchdog-backup-1672099200","error":"Internal error occurred: failed calling webhook \"mbackup.kb.io\": failed to call webhook: Post \"https://cnpg-webhook-service.cnpg-system.svc:443/mutate-postgresql-cnpg-io-v1-backup?timeout=10s\": context deadline exceeded","stacktrace":"github.com/cloudnative-pg/cloudnative-pg/pkg/management/log.(*logger).Error\n\tpkg/management/log/log.go:127\ngithub.com/cloudnative-pg/cloudnative-pg/controllers.createBackup\n\tcontrollers/scheduledbackup_controller.go:226\ngithub.com/cloudnative-pg/cloudnative-pg/controllers.ReconcileScheduledBackup\n\tcontrollers/scheduledback...

{"level":"error","ts":1672430855.2758024,"msg":"Reconciler error","controller":"scheduledbackup","controllerGroup":"postgresql.cnpg.io","controllerKind":"ScheduledBackup","ScheduledBackup":{"name":"pg-watchdog-backup","namespace":"mb"},"namespace":"mb","name":"pg-watchdog-backup","reconcileID":"3951c33a-f947-4c22-a08e-035508b42e43","error":"Internal error occurred: failed calling webhook \"mbackup.kb.io\": failed to call webhook: Post \"https://cnpg-webhook-service.cnpg-system.svc:443/mutate-postgresql-cnpg-io-v1-backup?timeout=10s\": context deadline exceeded","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:326\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tpkg/mod/sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:273\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.fun..
```.

0 replies

sxd · 2022-12-30T20:41:41Z

sxd
Dec 30, 2022
Maintainer

Hi @alishchytovych

This it's more an Istio thing that it's about creating the mesh network making pod capable to talk each other between namespaces and like that, sadly we're not Istio experts at that level.

I can tell that it's a network issue because I've seen this with NetworkPolicies and other users fixed this, but they didn't document how they did it, if you find a solution please let us know and we can add this fix to the Q&A

Best Regards!

0 replies

anthonator · 2023-02-21T22:25:09Z

anthonator
Feb 21, 2023

I'm running into this issue as well when bootstrapping with initdb. When I have come across this in the past it was because the pod was trying to communicate with the operator before the proxy sidecar was ready. This would cause the "failed calling webhook" error that @alishchytovych reported. To get around this we would set the proxy.istio.io/config: '{ "holdApplicationUntilProxyStarts": true }' annotation which would prevent this issue. I haven't been able to use this solution because the initdb pod doesn't seem to inherit the annotations/labels from the Cluster manifest and I haven't found any other way to decorate the initdb pod.

Providing support for setting the annotations/labels for the bootstrap pods would provide the flexibility needed to get around issues like this.

0 replies

tomaszkrzyzanowski · 2023-07-24T15:19:09Z

tomaszkrzyzanowski
Jul 24, 2023

Hi!

While working on service configuration and enabling Istio with CNPG I realized that I have very similar issue as mentioned.

My setup:

CNPG-Operator (cnpg-system namespace, Istio on namespace level enabled)
Postgres Cluster (app namespace, Istio on namespace level enabled)

From what I found in cloudnative-pg/cloudnative-pg code the requests made to pods are made with podIP which is problematic while using Istio because the Envoy is controing the pod ingress, and It's trying to enforce the mTLS connections (I'm using STRICT policy in current case)

After logs enablement in Istio I have find that Envoy is rejecting the connection to the PG pod with message (connection from curl pod made from app namespace):

~ $ curl http://100.96.6.63:8000/pg/status -v
* processing: http://100.96.6.63:8000/pg/status
*   Trying 100.96.6.63:8000...
* Connected to 100.96.6.63 (100.96.6.63) port 8000
> GET /pg/status HTTP/1.1
> Host: 100.96.6.63:8000
> User-Agent: curl/8.2.0
> Accept: */*
>
< HTTP/1.1 503 Service Unavailable
< content-length: 95
< content-type: text/plain
< date: Mon, 24 Jul 2023 14:42:11 GMT
< server: envoy
<
* Connection #0 to host 100.96.6.63 left intact
upstream connect error or disconnect/reset before headers. reset reason: connection termination~
~ $

And in logs the the message suggest that it's a problem to even create the connection to 8000:

{
  "authority": "100.96.6.63:8000",
  "requested_server_name": null,
  "connection_termination_details": null,
  "upstream_host": "100.96.6.63:8000",
  "upstream_cluster": "PassthroughCluster",
  "x_forwarded_for": null,
  "upstream_local_address": "100.96.5.131:52770",
  "protocol": "HTTP/1.1",
  "user_agent": "curl/8.2.0",
  "upstream_service_time": null,
  "upstream_transport_failure_reason": null,
  "route_name": "allow_any",
  "bytes_received": 0,
  "duration": 1,
  "downstream_local_address": "100.96.6.63:8000",
  "bytes_sent": 95,
  "path": "/pg/status",
  "response_flags": "UC",
  "start_time": "2023-07-24T14:42:11.344Z",
  "response_code_details": "upstream_reset_before_response_started{connection_termination}",
  "method": "GET",
  "response_code": 503,
  "downstream_remote_address": "100.96.5.131:52762",
  "request_id": "1be2486c-89f8-4d9b-a1ad-d0387d850ea6",
}

So the first issue for me while using Istio is IMO to use the PodIP in the first place (which is a case in more than one place):
https://github.com/search?q=repo%3Acloudnative-pg%2Fcloudnative-pg+Status.PodIP+language%3AGo&type=code&l=Go

IMO to make CNPG possible to work in the situation as mentioned, the operator needs to change this or support the connections made with some sort of services (like headless service) pointed to single pods.

As service meshes are focused mostly on services and inter-service networking, they are staring to struggle when the connections are going to the workloads without any sort of service in between.

So after setting up headless service like this:

apiVersion: v1
kind: Service
metadata:
  name: pg-cluster-1-hs
spec:
  selector:
    cnpg.io/instanceName: pg-cluster-1 #taken from running pod description
  clusterIP: None
  ports:
    - name: status
      port: 8000
      targetPort: 8000
    - name: postgres
      port: 5432
      targetPort: 5432

I could reach the /pg/status endpoint from inside the app namespace and from cnpg-system as well

PS C:\Users\krzyzt\GIT\helm\deployments\app> kubectl run -i --namespace cnpg-system --tty curl --image=curlimages/curl:latest -- sh
If you don't see a command prompt, try pressing enter.
~ $ curl pg-cluster-1-hs:8000/pg/status
curl: (6) Could not resolve host: pg-cluster-1-hs
~ $ curl pg-cluster-1-hs.app.svc.cluster.local:8000/pg/status
{"currentLsn":"0/AB000000","systemID":"7258248197917290525","isPrimary":true,"replayPaused":false,"pendingRestart":false,"pendingRestartForDecrease":false,"isWalReceiverActive":false,"node":"","pod":{"metadata":{"name":"pg-cluster-1","creationTimestamp":null},"spec":{"containers":null},"status":{}},"isPgRewindRunning":false,"totalInstanceSize":"32 MB","mightBeUnavailable":false,"lastArchivedWAL":"0000000100000000000000AA","lastArchivedWALTime":"2023-07-24T15:06:33.582965Z","lastFailedWALTime":"-infinity","isArchivingWAL":true,"currentWAL":"0000000100000000000000AA","timeLineID":1,"isPodReady":false,"executableHash":"76ceae159df15258568e3c48687e90c35aecda372e23e4ab1ef2a7eb6a2fe950","isInstanceManagerUpgrading":false,"instanceManagerVersion":"1.20.1","instanceArch":"amd64"}
~ $

Best regards!
Tomasz

FYI @sxd

Edit: Headless service is actually making able operator to manage the cluster nodes without any extra op code changes - with headless svc, the connections to PodIPs are working as headless svc is changing the routing on proxies and eventually accepting the connections.

So to summarize:

CNPG operator installed in cnpg-system namespace with enabled istio injection label
Istio PeerAuthentication rule installed with CNPG operator with disabled mTLS on webhook port 9443/443 (allowing connections from k8s api-server to operator webhook endpoint)
PostgreSQL db cluster resource (installed in namespace with Istio injection label)
Headless services for all the PostgreSQL cluster pods - making PodIPs available for CNPG operator (need to be installed with cluster right away)

2 replies

fouadsemaan Apr 11, 2024

for 4.) specifically replace POD IP with tenant headless SVC:

fouadsemaan Apr 11, 2024

100% @tomaszkrzyzanowski . That is a great summary.

The only way around it for now is to take operator TLS port off the mesh and CNPG tenant instance port 8000 off the mesh. Port 5432 can remain on the mesh.

I believe marking port 8000 as a status port is a misnomer https://github.com/cloudnative-pg/cloudnative-pg/blob/v1.22.2/pkg/management/url/url.go#L62 as it's also used for operations such as backup. Maybe it should be called the operator port for tenant operations.

Garett-MacGowan · 2024-01-31T23:22:03Z

Garett-MacGowan
Jan 31, 2024

You may also need to use appProtocol: TCP per https://istio.io/latest/docs/ops/common-problems/network-issues/#503-error-while-accessing-headless-services

e.g.

    - name: status
      port: 8000
      targetPort: 8000
      appProtocol: TCP
    - name: postgres
      port: 5432
      targetPort: 5432
      appProtocol: TCP

0 replies

fouadsemaan · 2024-04-11T18:05:45Z

fouadsemaan
Apr 11, 2024

There are two requirements for CNPG operator to run on the mesh:

Ability to turn off cert creation/TLS and run operator/tenant in plaintext --> currently the operator is hardcoded to run a TLS port
operator communication to pod via a service in the tenant namespace instead of POD ID (required for Istio) --> https://github.com/cloudnative-pg/cloudnative-pg/blob/v1.22.2/pkg/resources/instance/client.go#L219

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run operator and cluster under istio (maistra) mesh - either webhook or status is unavailable #1280

{{title}}

Replies: 7 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Run operator and cluster under istio (maistra) mesh - either webhook or status is unavailable #1280

alishchytovych Dec 26, 2022

Replies: 7 comments · 2 replies

sxd Dec 30, 2022 Maintainer

alishchytovych Dec 30, 2022 Author

sxd Dec 30, 2022 Maintainer

anthonator Feb 21, 2023

tomaszkrzyzanowski Jul 24, 2023

fouadsemaan Apr 11, 2024

fouadsemaan Apr 11, 2024

Garett-MacGowan Jan 31, 2024

fouadsemaan Apr 11, 2024

alishchytovych
Dec 26, 2022

Replies: 7 comments 2 replies

sxd
Dec 30, 2022
Maintainer

alishchytovych
Dec 30, 2022
Author

sxd
Dec 30, 2022
Maintainer

anthonator
Feb 21, 2023

tomaszkrzyzanowski
Jul 24, 2023

Garett-MacGowan
Jan 31, 2024

fouadsemaan
Apr 11, 2024