Set auto_provisioning_defaults.service_account #639

dpetersen · 2020-08-21T20:43:16Z

This sets the Service Account that should be used by node VMs created by
node auto-provisioning. This should cause the auto-provisioned nodes to
have the same permissions as the nodes that are manually provisioned.

This is to address #560. I will be testing this in my cluster that has NAP enabled. As far as I can tell, it's safe to set this value whether auto-provisioning is actually enabled or disabled, and that using local.service_account is the right default. That's the default for any manual node_pool that isn't specifically overridden.

comment-bot-dev · 2020-08-21T20:57:49Z

Thanks for the PR! 🚀
✅ Lint checks have passed.

dpetersen · 2020-08-21T21:47:25Z

I have verified that this fixes NAP on my cluster. I was also able to stand up a cluster with the default settings and saw that NAP was not enabled.

As for the CI failure above, I'm not sure if I caused that or how. I get some similar errors when I run make docker_lint_test locally. I'm not 100% sure if I should be running the make build command and committing the output here, or if that's part of the release process? Let me know if there's anything else I should do, and if you think I broke CI.

bharathkkb · 2020-08-22T04:57:06Z

@dpetersen Could you rebase on master? We had some issues with tf0.13 release that has since been fixed on master.
Update: Looks like I was able to, the update branch button popped up :)

dpetersen · 2020-08-24T19:38:18Z

Doh, @bharathkkb I saw your email this weekend and didn't notice your update until I'd already rebased and force pushed. Looks like one job is passing, but linting is not. I am unable to see either of the jobs in Cloud Build to see what failed. I did manually try and replicate what terraform fmt would have done for indenting with my change.

dpetersen · 2020-09-21T18:56:02Z

I have rebased this with master and also added one additional overridable attribute for node auto-provisioning that came up in my testing. I noticed that my NAP (node auto-provisioning) nodes could not pull images from my private registry, despite running as a service account that had access. I realized that those nodes had different oAuth scopes to my manual node:

https://www.googleapis.com/auth/logging.write
https://www.googleapis.com/auth/monitoring

my other pool, set up with this module, had the module's default:

https://www.googleapis.com/auth/cloud-platform

The logging.write + monitoring scopes came from GKE defaults, as far as I can tell. I have left those as the default in my additions to this module but allowed it to be overridden. I'm not sure if the stance of this module is to provide suggestions or to just stick with GKE defaults. Personally, I would be less confused if the default for all node pools created were identical, and the other nodes use the full cloud-platform scope. So I have GKE defaults in right now, but I'm certainly happy to change it.

morgante · 2020-09-21T19:00:24Z

@dpetersen If scopes are not provided, we should probably pick up whatever scopes are specified on the existing node pools (maybe the first node pool), which also default to cloud-platform. Does that make sense?

dpetersen · 2020-09-21T19:34:58Z

That makes sense, I will try and figure it out. Unfortunately, I'm running into a few unrelated issues here. First, I think #676 is breaking my local terraform apply. I will make a separate issue for that and un-rebase this branch for now. I also cannot see the lint output from this failing CI check, so I don't know what to fix. I can see the lint output from my other PR, #682, and it is complaining about code I didn't change.

dpetersen · 2020-09-22T00:28:14Z

Well, I learned something new today. The cluster_autoscaling object I have added the oauth_scopes to cannot have any optional fields.

terraform-google-kubernetes-engine/autogen/main/variables.tf.tmpl

Lines 211 to 229 in dffb047

    
           variable "cluster_autoscaling" { 
        
             type = object({ 
        
               enabled             = bool 
        
               autoscaling_profile = string 
        
               min_cpu_cores       = number 
        
               max_cpu_cores       = number 
        
               min_memory_gb       = number 
        
               max_memory_gb       = number 
        
             }) 
        
             default = { 
        
               enabled             = false 
        
               autoscaling_profile = "BALANCED" 
        
               max_cpu_cores       = 0 
        
               min_cpu_cores       = 0 
        
               max_memory_gb       = 0 
        
               min_memory_gb       = 0 
        
             } 
        
             description = "Cluster autoscaling configuration. See [more details](https://cloud.google.com/kubernetes-engine/docs/reference/rest/v1beta1/projects.locations.clusters#clusterautoscaling)" 
        
           }

There is a long-running, open Github issue on the topic, but if I add oauth_scopes to this list, it becomes a required field on that object. If you want to override any of these fields, you have to override them all.

It probably doesn't make sense for oAuth scopes to be required to enable NAP. And this isn't the last NAP-related override that should probably be optional. At some point, surge upgrade settings for NAP will be supported in the provider, and that should certainly be optional. I think you could probably make the argument that autoscaling_profile should be optional, too.

So the scenario your described above is not possible with this configuration, you can't enable NAP and not specify your oAuth scopes. I suppose you could specify an empty list, but that would be confusing to use. I think I should probably make a new, separate variable, like cluster_autoscaling_oauth_scopes or something. Does that reasonable? I'm not a particularly experienced Terraform module author, so I'm not sure what other tricks may exist to keep this ergonomic.

morgante · 2020-09-22T04:07:06Z

Thanks for investigating and I had forgotten the issue around defaults for objects.

In this case, I think we actually have a nice solution: simply reference the "all" value for local.node_pools_oauth_scopes["all"] which we have already defined. This gives a sensible default and it makes sense that auto-provisioned node pools should have scopes matching "all" your node pools.

dpetersen · 2020-09-22T18:33:47Z

That makes sense and works great for me. I have updated the PR and removed that variable.

dpetersen · 2020-09-22T19:30:12Z

Ugh, I found one additional issue. I set up a minimal cluster to test one of my other PRs, and that cluster did not have NAP enabled. But the plan would continually say it wanted to change my cluster with:

~ cluster_autoscaling {
      autoscaling_profile = "BALANCED"
      enabled             = false

    + auto_provisioning_defaults {
        + oauth_scopes    = [
            + "https://www.googleapis.com/auth/cloud-platform",
          ]
        + service_account = "tf-gke-prtest-sde5@content-don-5ea5.iam.gserviceaccount.com"
      }
  }

It would apply cleanly, but then continue to have the same desired changes over and over. So I think maybe the API doesn't save your auto-provisioning defaults when autoscaling is disabled. That code for a conditional block (which I stole from a medium article) feels like a hack, but it does seem to work.

This sets the Service Account that should be used by node VMs created by node auto-provisioning. This should cause the auto-provisioned nodes to have the same permissions as the nodes that are manually provisioned.

The defaults I included come from the scopes I observed in a cluster I stood up when no scopes were specified. I am assuming these are GKE defaults. This does not match the default scopes for normal node pools in this Terraform module, so it may not be the correct choice.

The API will not reject the defaults, but it also does not save them when auto-provisioning is disabled. Which means your plan will continually attempt to update the attributes, succeed, and then still think they need to be updated.

dpetersen · 2020-12-23T00:52:35Z

Hey everybody, I'm back to pester you about this pull request. We have been running this for a couple of months, but we're upgrading to Terraform 0.14 so I have rebased this pull request on master (which was the same as v12.3.0 at the moment I pulled and rebased). I am testing the updated version of this pull request against on our staging clusters (it was previously based on v12.0.0, I believe). Rebasing doesn't appear to have changed anything, which is great.

As far as I know, I have implemented the requested changes from your feedback. Ever since I first created this pull request, CI has been failing. Unfortunately, I don't have rights to view the failure to see what the issue is.

Please let me know if there's anything I can do to get this in a release version of the module. Thanks so much, this module saves us a ton of time!

morgante

Sorry for the delayed response.

The failing test is:

     expected: {"autoprovisioningNodePoolDefaults"=>{"oauthScopes"=>["https://www.googleapis.com/auth/logging.write"...inimum"=>"5", "resourceType"=>"cpu"}, {"maximum"=>"30", "minimum"=>"10", "resourceType"=>"memory"}]}
          got: {"autoprovisioningNodePoolDefaults"=>{"oauthScopes"=>["https://www.googleapis.com/auth/cloud-platform...inimum"=>"5", "resourceType"=>"cpu"}, {"maximum"=>"30", "minimum"=>"10", "resourceType"=>"memory"}]}

It is configured here with this fixture.

I don't expect this to fix the tests, there is still the issue of the serviceAccount.

dpetersen · 2020-12-29T21:12:35Z

Oh, thank you. I wish it wasn't truncating the hash like that, but the scopes being different does make sense. I have just pushed up a commit that should fix that, but I don't expect the tests to pass. I'm trying to understand how these specs are run. I'm fairly certain that CI will still fail, but now about the node pool service account.

The specs expect "default", but that's no longer the case. What it will be now is sort of a mystery to me that I think will be dynamic for each run. It comes from a var in the fixture. Can I somehow make that available in my tests like how the Project ID is available? I'm fairly proficient with Ruby and RSpec from my past, but how these are executing is a bit of a mystery. I can't find any existing tests that are checking the service account elsewhere. I can test everything else and ignore the service account name if you'd rather.

morgante · 2020-12-29T21:47:14Z

@dpetersen Looks like it succeeded after all. :)

FYI for the future you can run tests locally to verify: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/blob/master/CONTRIBUTING.md#integration-testing

dpetersen · 2020-12-29T21:54:04Z

Doh! I went looking for that in the README, I didn't notice the other file. Thanks again, really glad this is in master, and I look forward to the next release. I appreciate all your help with this, and your work on the modules in this ecosystem, they're working great for us.

dpetersen requested review from bharathkkb, Jberlinsky and a team as code owners August 21, 2020 20:43

dpetersen force-pushed the nap-service-account branch from 2f31c3e to 4bed7e6 Compare August 24, 2020 18:21

dpetersen force-pushed the nap-service-account branch from 4bed7e6 to f5e85e0 Compare September 21, 2020 18:48

dpetersen added 5 commits December 22, 2020 16:01

Set auto_provisioning_defaults.service_account

a2dab75

This sets the Service Account that should be used by node VMs created by node auto-provisioning. This should cause the auto-provisioned nodes to have the same permissions as the nodes that are manually provisioned.

Remove autoscaling oauth_scopes, use "all" configuration

0c1c8c8

Only set auto_provisioning_defaults when enabled

15b9302

The API will not reject the defaults, but it also does not save them when auto-provisioning is disabled. Which means your plan will continually attempt to update the attributes, succeed, and then still think they need to be updated.

Regenerate modules.

b83d8d8

dpetersen force-pushed the nap-service-account branch from a3919cb to b83d8d8 Compare December 23, 2020 00:17

aaronwebre approved these changes Dec 23, 2020

View reviewed changes

morgante suggested changes Dec 28, 2020

View reviewed changes

Fix (hopefully) oauthScopes in matchers.

91d98b3

I don't expect this to fix the tests, there is still the issue of the serviceAccount.

morgante approved these changes Dec 29, 2020

View reviewed changes

morgante merged commit 4a61f76 into terraform-google-modules:master Dec 29, 2020

morgante mentioned this pull request Dec 29, 2020

Set auto_provisioning_defaults.service_account ? (beta-public) #560

Closed

dpetersen deleted the nap-service-account branch December 29, 2020 21:54

dpetersen restored the nap-service-account branch December 30, 2020 00:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set auto_provisioning_defaults.service_account #639

Set auto_provisioning_defaults.service_account #639

dpetersen commented Aug 21, 2020

comment-bot-dev commented Aug 21, 2020 •

edited

dpetersen commented Aug 21, 2020

bharathkkb commented Aug 22, 2020 •

edited

dpetersen commented Aug 24, 2020

dpetersen commented Sep 21, 2020

morgante commented Sep 21, 2020

dpetersen commented Sep 21, 2020

dpetersen commented Sep 22, 2020

morgante commented Sep 22, 2020 •

edited

dpetersen commented Sep 22, 2020

dpetersen commented Sep 22, 2020

dpetersen commented Dec 23, 2020

morgante left a comment

dpetersen commented Dec 29, 2020

morgante commented Dec 29, 2020

dpetersen commented Dec 29, 2020

Set auto_provisioning_defaults.service_account #639

Set auto_provisioning_defaults.service_account #639

Conversation

dpetersen commented Aug 21, 2020

comment-bot-dev commented Aug 21, 2020 • edited

dpetersen commented Aug 21, 2020

bharathkkb commented Aug 22, 2020 • edited

dpetersen commented Aug 24, 2020

dpetersen commented Sep 21, 2020

morgante commented Sep 21, 2020

dpetersen commented Sep 21, 2020

dpetersen commented Sep 22, 2020

morgante commented Sep 22, 2020 • edited

dpetersen commented Sep 22, 2020

dpetersen commented Sep 22, 2020

dpetersen commented Dec 23, 2020

morgante left a comment

Choose a reason for hiding this comment

dpetersen commented Dec 29, 2020

morgante commented Dec 29, 2020

dpetersen commented Dec 29, 2020

comment-bot-dev commented Aug 21, 2020 •

edited

bharathkkb commented Aug 22, 2020 •

edited

morgante commented Sep 22, 2020 •

edited