Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compute default service account does not have access to the global-game-images artifact registry repo #161

Open
AlexBulankou opened this issue Mar 26, 2023 · 8 comments

Comments

@AlexBulankou
Copy link

After following the demo steps I noticed that initially many workloads are left not initialized, because compute default service account (project_number-compute@developer.gserviceaccount.com) cannot pull the images as it does not have permissions to read from this registry. I fixed it manually but the IAM might be worth including in the Terraform configuration.

@abmarcum
Copy link
Collaborator

Have created a PR that will give the Compute Service Account the Artifact Repo reader role.

@zmerlynn
Copy link
Collaborator

#162 to cross-reference ^

@markmandel
Copy link
Member

Curious on something - is this a role that the compute instance would get by default when you enabled GKE? I've never had to manually enable this on any project 🤔 so why did this happen here?

I'm wondering if #162 is actually just hiding a race condition on the GKE cluster, or am I off base?

@markmandel
Copy link
Member

Actually, lemme rephrase -- should this be the GKE cluster have a depends_on the K8s APi being fully enabled?

@AlexBulankou can you share the exact input and output you were getting please? Was it an error in the Terraform, a specific image? Something else?

@AlexBulankou
Copy link
Author

I did not get any deployment errors, but the container could not pull the image before I added access explicitly. Not an expert, but intuitively I would be surprised if an registry created would have compute default service account by default, because it means that any cluster in the project has this access by default, not sure if this is desired behavior for many organizations (vs. enabling a dedicated service account for a given registry).

@markmandel
Copy link
Member

I think this is fixed now, but to confirm:

I did not get any deployment errors, but the container could not pull the image

Sorry, not sure I'm following - containers don't pull images. Do you were seeing Image Pull Backoffs in your GKE clusters? If so, which clusters? All of them? Some of them?

Which workloads, which Deployments, which clusters. Did some work, did others not? Screenshots and details here would be very useful.

@AlexBulankou
Copy link
Author

Do you were seeing Image Pull Backoffs in your GKE clusters? If so, which clusters? All of them? Some of them?

Yes. I was seeing it on game server workloads, I did not check if it was on all of them or some of them. here's an example:

{
  "insertId": "wlovxtp97bip59w8",
  "jsonPayload": {
    "_GID": "0",
    "PRIORITY": "6",
    "_PID": "1790",
    "SYSLOG_IDENTIFIER": "kubelet",
    "_SYSTEMD_UNIT": "kubelet.service",
    "_MACHINE_ID": "c6aa1e71abcbcf4326b3fdcbf82684e1",
    "_SYSTEMD_INVOCATION_ID": "12fccd8e939940818873f98ba85e7ae0",
    "_CAP_EFFECTIVE": "1ffffffffff",
    "_BOOT_ID": "0a3608b3b8544bf7b2f9fb860e66d631",
    "_UID": "0",
    "_SYSTEMD_CGROUP": "/system.slice/kubelet.service",
    "_SYSTEMD_SLICE": "system.slice",
    "_TRANSPORT": "stdout",
    "_COMM": "kubelet",
    "MESSAGE": "E0325 18:59:44.857798    1790 pod_workers.go:951] \"Error syncing pod, skipping\" err=\"failed to \\\"StartContainer\\\" for \\\"droidshooter\\\" with ImagePullBackOff: \\\"Back-off pulling image \\\\\\\"us-docker.pkg.dev/alexbu-gke-dev/global-game-images/droidshooter-server:b40b146a-8390-4569-abd7-abd5c509b1ec\\\\\\\"\\\"\" pod=\"default/droidshooter-bzlbw-qpjqv\" podUID=8d5da6d4-68d1-4c84-85d2-8407a9581739",
    "_HOSTNAME": "gk3-global-game-us-centr-nap-10413t6d-18671094-nxpq",
    "_CMDLINE": "/home/kubernetes/bin/kubelet --v=2 --cloud-provider=gce --experimental-mounter-path=/home/kubernetes/containerized_mounter/mounter --cert-dir=/var/lib/kubelet/pki/ --kubeconfig=/var/lib/kubelet/kubeconfig --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_256_GCM_SHA384,TLS_RSA_WITH_AES_128_GCM_SHA256 --max-pods=32 --volume-plugin-dir=/home/kubernetes/flexvolume --node-status-max-images=25 --container-runtime=remote --container-runtime-endpoint=unix:///run/containerd/containerd.sock --runtime-cgroups=/system.slice/containerd.service --registry-qps=10 --registry-burst=20 --config /home/kubernetes/kubelet-config.yaml \"--pod-sysctls=net.core.somaxconn=1024,net.ipv4.conf.all.accept_redirects=0,net.ipv4.conf.all.forwarding=1,net.ipv4.conf.all.route_localnet=1,net.ipv4.conf.default.forwarding=1,net.ipv4.ip_forward=1,net.ipv4.tcp_fin_timeout=60,net.ipv4.tcp_keepalive_intvl=60,net.ipv4.tcp_keepalive_probes=5,net.ipv4.tcp_keepalive_time=300,net.ipv4.tcp_rmem=4096 87380 6291456,net.ipv4.tcp_syn_retries=6,net.ipv4.tcp_tw_reuse=0,net.ipv4.tcp_wmem=4096 16384 4194304,net.ipv4.udp_rmem_min=4096,net.ipv4.udp_wmem_min=4096,net.ipv6.conf.all.disable_ipv6=1,net.ipv6.conf.default.accept_ra=0,net.ipv6.conf.default.disable_ipv6=1,net.netfilter.nf_conntrack_generic_timeout=600,net.netfilter.nf_conntrack_tcp_be_liberal=1,net.netfilter.nf_conntrack_tcp_timeout_close_wait=3600,net.netfilter.nf_conntrack_tcp_timeout_established=86400\" --pod-infra-container-image=gke.gcr.io/pause:3.6@sha256:10008c36b4611b44db1229451675d5d7d86c7ddf4ef00f883d806a01547203f6",
    "_STREAM_ID": "1423d9289b624b53b7196a781694f575",
    "_EXE": "/home/kubernetes/bin/kubelet",
    "SYSLOG_FACILITY": "3"
  },
  "resource": {
    "type": "k8s_node",
    "labels": {
      "node_name": "gk3-global-game-us-centr-nap-10413t6d-18671094-nxpq",
      "cluster_name": "global-game-us-central1-02",
      "location": "us-central1",
      "project_id": "alexbu-gke-dev"
    }
  },
  "timestamp": "2023-03-25T18:59:44.857881Z",
  "logName": "projects/alexbu-gke-dev/logs/kubelet",
  "receiveTimestamp": "2023-03-25T18:59:49.792357873Z"
}
{
  "insertId": "ezoa0uf99z2sz",
  "jsonPayload": {
    "kind": "Event",
    "apiVersion": "v1",
    "reportingInstance": "",
    "eventTime": null,
    "message": "Error: ImagePullBackOff",
    "reason": "Failed",
    "type": "Warning",
    "source": {
      "host": "gke-global-game-eu-west1-01-default-edbb1dd5-bdf8",
      "component": "kubelet"
    },
    "involvedObject": {
      "fieldPath": "spec.containers{droidshooter}",
      "uid": "8c645eaf-4f8f-4a9c-a467-e60a152aeb69",
      "name": "droidshooter-nmlfb-j9xwn",
      "kind": "Pod",
      "resourceVersion": "1774080",
      "apiVersion": "v1",
      "namespace": "default"
    },
    "lastTimestamp": "2023-03-25T18:59:45Z",
    "metadata": {
      "name": "droidshooter-nmlfb-j9xwn.174fbea1220c6344",
      "creationTimestamp": "2023-03-25T18:59:45Z",
      "namespace": "default",
      "resourceVersion": "38876",
      "managedFields": [
        {
          "fieldsV1": {
            "f:involvedObject": {},
            "f:type": {},
            "f:source": {
              "f:component": {},
              "f:host": {}
            },
            "f:lastTimestamp": {},
            "f:count": {},
            "f:reason": {},
            "f:firstTimestamp": {},
            "f:message": {}
          },
          "manager": "kubelet",
          "fieldsType": "FieldsV1",
          "operation": "Update",
          "apiVersion": "v1",
          "time": "2023-03-25T18:59:45Z"
        }
      ],
      "uid": "5d3d7ffd-d898-47da-b451-a57097419750"
    },
    "reportingComponent": ""
  },
  "resource": {
    "type": "k8s_pod",
    "labels": {
      "project_id": "alexbu-gke-dev",
      "location": "europe-west1",
      "namespace_name": "default",
      "cluster_name": "global-game-eu-west1-01",
      "pod_name": "droidshooter-nmlfb-j9xwn"
    }
  },
  "timestamp": "2023-03-25T18:59:45Z",
  "severity": "WARNING",
  "logName": "projects/alexbu-gke-dev/logs/events",
  "receiveTimestamp": "2023-03-25T18:59:45.747550125Z"
}

@markmandel
Copy link
Member

So if I look at my global-game-images registry, I see this permission on it:
image

Looking at your project I see the same permissions set on that registry permissions - so the compute service account should be able to read from the registry.

Looking at the permissions on the compute my project I see:

image

Weirdly, when I look at your compute service account... it doesn't match this, it's missing the one highlighted here.

Since we've merged #162 is that fixed now?

I'm also wondering what extra org policies you may have in effect that is different from a "standard" GCP project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants