Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infrastructure] Add Cloud Monitoring #139

Open
abmarcum opened this issue Mar 10, 2023 · 4 comments
Open

[Infrastructure] Add Cloud Monitoring #139

abmarcum opened this issue Mar 10, 2023 · 4 comments
Assignees

Comments

@abmarcum
Copy link
Collaborator

Add GCP Cloud Monitoring to the project to alert on service availability.

Use Terraform to create the following:

Dashboard
Uptime checks for Endpoints
Service Availability - GKE, Redis, Spanner, Endpoints

Make monitoring optional - add enable true/false flag.
Variables alert notifications email address & place in terraform.tfvars.sample

Add additional monitoring checks as they are suggested.

@abmarcum abmarcum self-assigned this Mar 10, 2023
@markmandel
Copy link
Member

Fun question: Cloud Monitoring or managed Prometheus, or both???

@abmarcum
Copy link
Collaborator Author

Cloud Monitoring can handle all GCP resources and most of the standard GKE metrics are available in Cloud Monitoring.

But for game specific/GKE workloads, Prometheus might be a better choice.

The question then becomes: do you want to manage both?

I would suggest a 2 phase approach: Get critical systems into Cloud Monitoring so that core systems are alerting on any issues. This is straight forward and all we need to determine is what we alert on. Then as Game monitoring requirements arise, we look at if they can work in Cloud Monitoring or if Prometheus is a better approach.

My 2 cents.

@markmandel
Copy link
Member

The question then becomes: do you want to manage both?

My thought was more - some would like Cloud Monitoring, some would like managed Prometheus. I've seen both in the wild.

@bbhuston
Copy link
Collaborator

bbhuston commented Apr 14, 2023

@abmarcum

I'm not sure how helpful this is for you, but the gcloud command shown is a 'fully loaded' one that I often use. It has absolutely all the bells and whistles turned on for GKE in the monitoring, logging, resource monitoring (aka 'cost monitoring), and notifications areas, including turning on monitoring for google-managed k8s controlplane components.

The names of the gcloud feature flags (and their corresponding values) are essentially a 1-1 mapping to the key/value pairs that the GKE terraform module uses, so hopefully this helps save some time stubbing out something here

gcloud beta container --project ${PROJECT_ID} clusters create ${CLUSTER_NAME} --region ${REGION} --no-enable-basic-auth --release-channel "rapid" --machine-type "e2-highcpu-4" --image-type "COS_CONTAINERD" --disk-type "pd-standard" --disk-size "100" --metadata disable-legacy-endpoints=true --num-nodes "2" --enable-autoscaling --min-nodes "0" --max-nodes "3" --scopes "https://www.googleapis.com/auth/devstorage.read_only","https://www.googleapis.com/auth/logging.write","https://www.googleapis.com/auth/monitoring","https://www.googleapis.com/auth/servicecontrol","https://www.googleapis.com/auth/service.management.readonly","https://www.googleapis.com/auth/trace.append" --max-pods-per-node "110" --enable-private-nodes --master-ipv4-cidr "172.16.0.0/28" --enable-ip-alias --network "projects/${PROJECT_ID}/global/networks/config-admin-vpc" --subnetwork "projects/${PROJECT_ID}/regions/${REGION}/subnetworks/config-admin-vpc" --cluster-ipv4-cidr "192.168.0.0/16" --services-ipv4-cidr "192.169.0.0/16" --no-enable-intra-node-visibility --default-max-pods-per-node "110" --enable-dataplane-v2 --enable-master-authorized-networks --master-authorized-networks 0.0.0.0/0 --addons HorizontalPodAutoscaling,HttpLoadBalancing,GcePersistentDiskCsiDriver,ConfigConnector,BackupRestore --enable-autoupgrade --enable-autorepair --max-surge-upgrade 1 --max-unavailable-upgrade 0 --labels mesh_id=proj-${PROJECT_NUMBER} --resource-usage-bigquery-dataset ${BQ_DATASET_NAME} --enable-resource-consumption-metering --workload-pool "${PROJECT_ID}.svc.id.goog" --enable-shielded-nodes --security-group "gke-security-groups@${GOOGLE_ADMIN_DOMAIN}" --notification-config=pubsub=ENABLED,pubsub-topic=projects/${PROJECT_ID}/topics/${ARGOCD_PUBSUB_TOPIC} --enable-image-streaming --logging=SYSTEM,WORKLOAD  --monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER --enable-managed-prometheus --enable-workload-config-audit

@googleforgames googleforgames deleted a comment from bbhuston Apr 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants