Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set auto_provisioning_defaults.service_account #639

Merged

Conversation

dpetersen
Copy link
Contributor

This sets the Service Account that should be used by node VMs created by
node auto-provisioning. This should cause the auto-provisioned nodes to
have the same permissions as the nodes that are manually provisioned.

This is to address #560. I will be testing this in my cluster that has NAP enabled. As far as I can tell, it's safe to set this value whether auto-provisioning is actually enabled or disabled, and that using local.service_account is the right default. That's the default for any manual node_pool that isn't specifically overridden.

@comment-bot-dev
Copy link

comment-bot-dev commented Aug 21, 2020

Thanks for the PR! 🚀
✅ Lint checks have passed.

@dpetersen
Copy link
Contributor Author

I have verified that this fixes NAP on my cluster. I was also able to stand up a cluster with the default settings and saw that NAP was not enabled.

As for the CI failure above, I'm not sure if I caused that or how. I get some similar errors when I run make docker_lint_test locally. I'm not 100% sure if I should be running the make build command and committing the output here, or if that's part of the release process? Let me know if there's anything else I should do, and if you think I broke CI.

@bharathkkb
Copy link
Member

bharathkkb commented Aug 22, 2020

@dpetersen Could you rebase on master? We had some issues with tf0.13 release that has since been fixed on master.
Update: Looks like I was able to, the update branch button popped up :)

@dpetersen
Copy link
Contributor Author

Doh, @bharathkkb I saw your email this weekend and didn't notice your update until I'd already rebased and force pushed. Looks like one job is passing, but linting is not. I am unable to see either of the jobs in Cloud Build to see what failed. I did manually try and replicate what terraform fmt would have done for indenting with my change.

@dpetersen
Copy link
Contributor Author

I have rebased this with master and also added one additional overridable attribute for node auto-provisioning that came up in my testing. I noticed that my NAP (node auto-provisioning) nodes could not pull images from my private registry, despite running as a service account that had access. I realized that those nodes had different oAuth scopes to my manual node:

https://www.googleapis.com/auth/logging.write
https://www.googleapis.com/auth/monitoring

my other pool, set up with this module, had the module's default:

https://www.googleapis.com/auth/cloud-platform

The logging.write + monitoring scopes came from GKE defaults, as far as I can tell. I have left those as the default in my additions to this module but allowed it to be overridden. I'm not sure if the stance of this module is to provide suggestions or to just stick with GKE defaults. Personally, I would be less confused if the default for all node pools created were identical, and the other nodes use the full cloud-platform scope. So I have GKE defaults in right now, but I'm certainly happy to change it.

@morgante
Copy link
Contributor

@dpetersen If scopes are not provided, we should probably pick up whatever scopes are specified on the existing node pools (maybe the first node pool), which also default to cloud-platform. Does that make sense?

@dpetersen
Copy link
Contributor Author

That makes sense, I will try and figure it out. Unfortunately, I'm running into a few unrelated issues here. First, I think #676 is breaking my local terraform apply. I will make a separate issue for that and un-rebase this branch for now. I also cannot see the lint output from this failing CI check, so I don't know what to fix. I can see the lint output from my other PR, #682, and it is complaining about code I didn't change.

@dpetersen
Copy link
Contributor Author

Well, I learned something new today. The cluster_autoscaling object I have added the oauth_scopes to cannot have any optional fields.

variable "cluster_autoscaling" {
type = object({
enabled = bool
autoscaling_profile = string
min_cpu_cores = number
max_cpu_cores = number
min_memory_gb = number
max_memory_gb = number
})
default = {
enabled = false
autoscaling_profile = "BALANCED"
max_cpu_cores = 0
min_cpu_cores = 0
max_memory_gb = 0
min_memory_gb = 0
}
description = "Cluster autoscaling configuration. See [more details](https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1beta1/projects.locations.clusters#clusterautoscaling)"
}

There is a long-running, open Github issue on the topic, but if I add oauth_scopes to this list, it becomes a required field on that object. If you want to override any of these fields, you have to override them all.

It probably doesn't make sense for oAuth scopes to be required to enable NAP. And this isn't the last NAP-related override that should probably be optional. At some point, surge upgrade settings for NAP will be supported in the provider, and that should certainly be optional. I think you could probably make the argument that autoscaling_profile should be optional, too.

So the scenario your described above is not possible with this configuration, you can't enable NAP and not specify your oAuth scopes. I suppose you could specify an empty list, but that would be confusing to use. I think I should probably make a new, separate variable, like cluster_autoscaling_oauth_scopes or something. Does that reasonable? I'm not a particularly experienced Terraform module author, so I'm not sure what other tricks may exist to keep this ergonomic.

@morgante
Copy link
Contributor

morgante commented Sep 22, 2020

Thanks for investigating and I had forgotten the issue around defaults for objects.

In this case, I think we actually have a nice solution: simply reference the "all" value for local.node_pools_oauth_scopes["all"] which we have already defined. This gives a sensible default and it makes sense that auto-provisioned node pools should have scopes matching "all" your node pools.

@dpetersen
Copy link
Contributor Author

That makes sense and works great for me. I have updated the PR and removed that variable.

@dpetersen
Copy link
Contributor Author

Ugh, I found one additional issue. I set up a minimal cluster to test one of my other PRs, and that cluster did not have NAP enabled. But the plan would continually say it wanted to change my cluster with:

~ cluster_autoscaling {
      autoscaling_profile = "BALANCED"
      enabled             = false

    + auto_provisioning_defaults {
        + oauth_scopes    = [
            + "https://www.googleapis.com/auth/cloud-platform",
          ]
        + service_account = "tf-gke-prtest-sde5@content-don-5ea5.iam.gserviceaccount.com"
      }
  }

It would apply cleanly, but then continue to have the same desired changes over and over. So I think maybe the API doesn't save your auto-provisioning defaults when autoscaling is disabled. That code for a conditional block (which I stole from a medium article) feels like a hack, but it does seem to work.

This sets the Service Account that should be used by node VMs created by
node auto-provisioning. This should cause the auto-provisioned nodes to
have the same permissions as the nodes that are manually provisioned.
The defaults I included come from the scopes I observed in a cluster I
stood up when no scopes were specified. I am assuming these are GKE
defaults. This does not match the default scopes for normal node pools
in this Terraform module, so it may not be the correct choice.
The API will not reject the defaults, but it also does not save them
when auto-provisioning is disabled. Which means your plan will
continually attempt to update the attributes, succeed, and then still
think they need to be updated.
@dpetersen
Copy link
Contributor Author

Hey everybody, I'm back to pester you about this pull request. We have been running this for a couple of months, but we're upgrading to Terraform 0.14 so I have rebased this pull request on master (which was the same as v12.3.0 at the moment I pulled and rebased). I am testing the updated version of this pull request against on our staging clusters (it was previously based on v12.0.0, I believe). Rebasing doesn't appear to have changed anything, which is great.

As far as I know, I have implemented the requested changes from your feedback. Ever since I first created this pull request, CI has been failing. Unfortunately, I don't have rights to view the failure to see what the issue is.

Please let me know if there's anything I can do to get this in a release version of the module. Thanks so much, this module saves us a ton of time!

Copy link
Contributor

@morgante morgante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delayed response.

The failing test is:

     expected: {"autoprovisioningNodePoolDefaults"=>{"oauthScopes"=>["https://www.googleapis.com/auth/logging.write"...inimum"=>"5", "resourceType"=>"cpu"}, {"maximum"=>"30", "minimum"=>"10", "resourceType"=>"memory"}]}
          got: {"autoprovisioningNodePoolDefaults"=>{"oauthScopes"=>["https://www.googleapis.com/auth/cloud-platform...inimum"=>"5", "resourceType"=>"cpu"}, {"maximum"=>"30", "minimum"=>"10", "resourceType"=>"memory"}]}

It is configured here with this fixture.

I don't expect this to fix the tests, there is still the issue of the
serviceAccount.
@dpetersen
Copy link
Contributor Author

Oh, thank you. I wish it wasn't truncating the hash like that, but the scopes being different does make sense. I have just pushed up a commit that should fix that, but I don't expect the tests to pass. I'm trying to understand how these specs are run. I'm fairly certain that CI will still fail, but now about the node pool service account.

The specs expect "default", but that's no longer the case. What it will be now is sort of a mystery to me that I think will be dynamic for each run. It comes from a var in the fixture. Can I somehow make that available in my tests like how the Project ID is available? I'm fairly proficient with Ruby and RSpec from my past, but how these are executing is a bit of a mystery. I can't find any existing tests that are checking the service account elsewhere. I can test everything else and ignore the service account name if you'd rather.

@morgante
Copy link
Contributor

@dpetersen Looks like it succeeded after all. :)

FYI for the future you can run tests locally to verify: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/master/CONTRIBUTING.md#integration-testing

@morgante morgante merged commit 4a61f76 into terraform-google-modules:master Dec 29, 2020
@dpetersen
Copy link
Contributor Author

Doh! I went looking for that in the README, I didn't notice the other file. Thanks again, really glad this is in master, and I look forward to the next release. I appreciate all your help with this, and your work on the modules in this ecosystem, they're working great for us.

@dpetersen dpetersen deleted the nap-service-account branch December 29, 2020 21:54
@dpetersen dpetersen restored the nap-service-account branch December 30, 2020 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants