Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(helm chart): Add metrics-server hardening options #1288

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

mkilchhofer
Copy link

@mkilchhofer mkilchhofer commented Jul 14, 2023

What this PR does / why we need it:

It adds options on howto secure the connection between metrics-server and the kube-apiserver.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

related to:

Resolves #1230

Additional info:
The accepted FR #1230 is implemented also as I need this to test the hardening feature.

/cc: @serathius

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jul 14, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: mkilchhofer / name: Marco Kilchhofer (0f4d76f)

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 14, 2023
@k8s-ci-robot
Copy link
Contributor

Welcome @mkilchhofer!

It looks like this is your first PR to kubernetes-sigs/metrics-server 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/metrics-server has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jul 14, 2023
@k8s-ci-robot
Copy link
Contributor

Hi @mkilchhofer. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jul 14, 2023
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 14, 2023
@serathius
Copy link
Contributor

/assign @stevehipwell

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: mkilchhofer
Once this PR has been reviewed and has the lgtm label, please assign serathius for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Contributor

@stevehipwell stevehipwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @mkilchhofer, I've got a couple of high level review point which it would be good if you could address before I review this in more details

  • There is no need to run functions for the APIService if it's not enabled
  • The helm mode would be a better default (see below)
  • You don't need to recreate the cert on each Helm install, you can look up the existing secret
  • I don't think the chart should support creating a secret, the secret option should only be existing
  • Please remove the extraObjects value, this isn't a "good" pattern

@mkilchhofer
Copy link
Author

  • There is no need to run functions for the APIService if it's not enabled

How do you mean that? The genSelfSignedCert function?

  • The helm mode would be a better default (see below)

Sure, will do

  • You don't need to recreate the cert on each Helm install, you can look up the existing secret

I think we should not rely on the lookup function. There are GitOps solutions in the wild which only use helm for rendering (helm template subcommand). The lookup function of helm returns nothing during template as it cannot talk to the kubeapi by design. Only upgrade or install talks to the apiserver.

  • I don't think the chart should support creating a secret, the secret option should only be existing

Hmm why not? It is pretty common that helm charts support this.

  • Please remove the extraObjects value, this isn't a "good" pattern

How can I test the existing secret method then? I think it is an approved feature request, so contributors can file PRs for this, ref:

@stevehipwell
Copy link
Contributor

How do you mean that? The genSelfSignedCert function?

You moved the test for if the APIService was enabled below the new code.

I think we should not rely on the lookup function. There are GitOps solutions in the wild which only use helm for rendering (helm template subcommand). The lookup function of helm returns nothing during template as it cannot talk to the kubeapi by design. Only upgrade or install talks to the apiserver.

In that case they can use one of the other patterns, GitOps shouldn't define architecture it should support best practice.

Hmm why not? It is pretty common that helm charts support this.

Because it's unnecessary and leaks data into the chart values.

How can I test the existing secret method then? I think it is an approved feature request, so contributors can file PRs for this, ref:

We could add a test secret as part or the CI initialisation or it could be generated through the chart as a testing resource.

@mkilchhofer
Copy link
Author

mkilchhofer commented Jul 17, 2023

Because it's unnecessary and leaks data into the chart values.

Well, why is that an issue? This is how helm is designed for. These values are stored as a K8s secret.

@stevehipwell
Copy link
Contributor

Well, why is that an issue? This is how helm is designed for. These values are stored as a K8s secret.

It's the unnecessary part which is the primary concern here; but the Helm implementation and increased surface area for secret information shouldn't be discounted as unimportant.

But as I've said in my other replies you can easily extend this chart by creating a wrapper chart which depends on it and at which point you can make your own architectural decisions.

@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 18, 2023
@mkilchhofer mkilchhofer force-pushed the feature/helm-chart_hardening branch 2 times, most recently from cddec52 to 81f71cd Compare July 18, 2023 06:28
Copy link
Contributor

@stevehipwell stevehipwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes @mkilchhofer, I've added a couple of structural change requests.

charts/metrics-server/templates/apiservice.yaml Outdated Show resolved Hide resolved
{{- $altNames = append $altNames (printf "%s.%s" $commonName $ns) }}
{{- $altNames = append $altNames (printf "%s.%s.svc" $commonName $ns) }}
{{- $altNames = append $altNames (printf "%s.%s.svc.%s" $commonName $ns .Values.tls.clusterDomain) }}
{{- $certs = genSelfSignedCert $commonName nil $altNames 36500 }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look up the existing secret instead of generating a new certificate each time the chart is deployed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can easily recreate the Secrets every helm upgrade. It's even more secure to change it on a regular basis IMHO.
Using the lookup function isn't safe against helm tooling as already mentioned somewhere in this PR.

People who are using

  • Terraform template datasource
  • Argo CD
  • Kustomize with helm support
  • probably Helmfile (cannot verify my statement here as I don't use this)
  • probably Tanka (cannot verify my statement here as I don't use this)

uses the helm chart only for templating. And template cannot use lookup

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The second point is moot as this is a Helm chart and optimised for idiomatic Helm use, alternative usage patterns can use this if it works but if it doesn't there are alternatives. Also this should behave exactly the same as if lookup wasn't used if called by helm template.

For the first point have you tested the behaviour of changing the certificate during standard usage of MS for HPAs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not test the upgrade procedure. But it should be fine:

Do you need some test results here? :-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to test that MS is reloading the cert when it changes and then what the behaviour during the cert change is. If there is no interruption then I think we could not lookup the cert but add a comment in the code to say it could be done.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I now learned that Secrets/ConfigMaps are not immediately updated via kubelet. When there are no events happening on the Pod, it can take up to 1 minute.

Screen.Recording.2023-07-20.at.17.17.25_compressed.mov

I found the technical background inside a Kyverno policy 😄 :

  Although ConfigMaps and Secrets mounted as volumes to a Pod, when the contents change,
  will eventually propagate to the Pods mounting them, this process may take between 60-90 seconds.
  In order to reduce that time, a modification made to downstream Pods will cause the changes
  to take effect almost instantly. This policy watches for changes to ConfigMaps which have been
  marked for this quick reloading process which contain the label `kyverno.io/watch=true` and
  will write an annotation to any Pods which mount them as volumes causing a fast refresh in their
  contents. See the related policy entitled "Refresh Environment Variables in Pods" for a similar
  reloading process when ConfigMaps and Secrets are consumed as environment variables instead.
  Use of this policy may require providing the Kyverno ServiceAccount with permission
  to update Pods.  

Source: https://kyverno.io/policies/other/refresh-volumes-in-pods/refresh-volumes-in-pods/

Will investigate what I can do here..

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been my experience with optimising Fluent Bit for hot-reload.

If the Helm pattern used lookup it could be documented as interruption safe which run directly from Helm, if interruptions are an issue and you're using an alternative apply method then use Cert Manager (or an external secret).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option would be to use an init container or a helm job hook and something like ingress-nginx from the kubernetes org:
https://github.com/kubernetes/ingress-nginx/tree/main/images/kube-webhook-certgen
Seems to be a fork of https://github.com/jet/kube-webhook-certgen

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like overengineering to me. For the Helm implementation we should use lookup and document why you might not want to do so.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. Lookup is already implemented 👍

charts/metrics-server/templates/certificate.yaml Outdated Show resolved Hide resolved
@mkilchhofer mkilchhofer force-pushed the feature/helm-chart_hardening branch 3 times, most recently from 2e903c2 to 93f1c41 Compare July 24, 2023 21:36
@dashpole
Copy link

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jul 27, 2023
@stevehipwell
Copy link
Contributor

@mkilchhofer could you please rebase and then I'll review again?

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 28, 2023
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 28, 2023
@mkilchhofer
Copy link
Author

@stevehipwell rebase done

# Use helm lookup function to reuse Secret created in previous helm install
lookup: true
# Cert validity duration in days
certDurationDays: 90
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this expiry should probably be longer by default and we should provide a cert replacement mechanism. This could be automated based on an expiry field in the secret or based on a generation chart input to rotate the certificate. Either way the Helm TLS pattern should probably be called out as not suitable for production without additional engineering overhead.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 28, 2023
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stevehipwell
Copy link
Contributor

@mkilchhofer could you rebase this PR?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 29, 2024
@mkilchhofer
Copy link
Author

Will try to rebase the PR as soon as I have time

@yangjunmyfm192085
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 29, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FR] Allow the Chart to create extra manifest
7 participants