Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow default configuration override with annotations #44

Open
jdharmon opened this issue Feb 13, 2020 · 14 comments · May be fixed by #64
Open

Allow default configuration override with annotations #44

jdharmon opened this issue Feb 13, 2020 · 14 comments · May be fixed by #64

Comments

@jdharmon
Copy link
Contributor

It would be nice to be able to override default pod-reaper settings with annotations.

For example if MAX_DURATION=1d, but a pod had the annotation pod-reaper/maxduration: 12h, then that pod would be reaped in 12 hours.

@jdharmon
Copy link
Contributor Author

@JordanSussman Would you propose that the pod defined label(s) should always override the pod-reaper defined behavior? Feels like a minimum the default behavior should be to ignore pod defined labels unless a configuration option has been defined on pod-reaper. Suppose you could also go as granular as to define certain "whitelist" of options that pods are allowed to override.

For simplicity, I think annotations should always override the default configuration if present.

@brianberzins
Copy link
Collaborator

A few thoughts I had on this before I dive into implementation.

There are two kinds of environment variables used to configure pod-reaper. One set is specific to the reaper itself (how often it looks at pods, what namespaces it looks at, etc). The other set defines the rules -- these are the things where we can have pods specify their own overrides.

Overriding rules on a per pod basis means that those rules wouldn't know anything about the first set of rules (this is particularly important for things like random chance -- where checking more/less often matters)

There's another kind of interesting case. Suppose we have a reaper checking every 1 hour for anything that's been running in a specific namespace for > 3 days -- but a pod in that namespace using an annotation to specify an unrelated rule -- like container status is in ImagePullBackoff. That's not really an override, but a declaration of another rule. Would we want this reaper to honor that "new" rule?

Right now, I'm thinking of the following:

  1. add a configurable environment variable/rule to honor rules specified in pod annotations
  2. make sure that rules can load values from pod annotations

Now for some fun stuff!

What do we want the behaviors to be in the following situations?

  • A rule is defined on the reaper, and the same rule on a specific pods
    Assuming that we're honoring pod annotations, this feels like like the pod should override
  • A rule is defined on the reaper that is not defined on the pod
    Default behavior of the reaper
  • a rule is defined on the pod that is not defined in the reaper
    This is where things get dicey. Should a reaper honor a rule specification on a pod when it doesn't have a configuration for that rule? To potentially complicate it even more -- if it should honor them, then do we and the pod specific rules with the other rules defined by the reaper?

So let's say we have pod-reaper/maxduration: 12h annotation on a pod, but the reaper itself isn't configured with any value for MAX_DURATION -- should the reaper honor the annotation? If the answer is "no", then I think this much more simple, but also less powerful.

Additional thought: this really brings to a light a problem with the chaos chance rule because if it's specified as a pod annotation, then there's a separation between where the chance is defined and how often that chance is applied (the reaper's scheduling) which is just begging for trouble. @andrewwipf and I actually did some white-boarding a couple solutions to that problem (that I think I'd like to implement independently of this feature).

@jdharmon
Copy link
Contributor Author

jdharmon commented Feb 14, 2020

Should a reaper honor a rule specification on a pod when it doesn't have a configuration for that rule?

My initial idea was “no” to keep things simple. However that will probably lead to confusion.

I think the following pseudo code will handle all three cases:

rules := loadAllRules()
foreach rule in rules {
    pods := getAllPodsInNs()
    foreach pod in pods{
        setting := getEnv(“DEFAULT_SETTING”)
        annotation = getAnnotation(“key”)
        if annotation != nil
            setting = annotation.value
        if setting = nil
            continue 
         
        shouldDelete := evaluateRule(setting)
        if shouldDelete
            delete(pod)
    }
}

@brianberzins
Copy link
Collaborator

I agree with that overall approach!
Weirdly enough, it's more refactoring than I would have originally thought -- but I'm game!
I'll work on a PR for this (hopefully over the weekend here)!

@brianberzins
Copy link
Collaborator

Worked on this a bit. Doing this in a few steps (in part because I'm a lunatic that likes refactoring). I'll be updating a rule-refactor branch as I go.

Originally, with rules only configured at the reaper level, only the rules that were configured were "loaded". This separated the configuration of the reaper (which happened exactly once at startup) from the actual checking of the rules. If we want to enable overrides from things configured outside the reaper itself, rule configuration needs to be more dynamic. To accomplish that, I want to remove the loading phase of this altogether, and instead always be checking each rule -- letting the rule determine if it should be applied or not.

This leads to a ternary situation. There are now three cases for a rule evaluation:

  1. reap the pod
  2. spare the pod
  3. ignore this rule (it's not configured)

We avoided case 3 originally because rules that weren't configured were never loaded.
Here's the priority chain to keep everything consistent:

  1. if any rule saying "spare" --> the pod will not be reaped
  2. else if there are any number rules saying "reap" --> the pod will be reaped
  3. if NO rules say "reap" the pod is spared

So far, I think this is going well. Just taking me a longer than I wanted. I really convinced myself that this could be a fun feature when I thought the use case where rather than overriding settings, instead there was a generic reaper running with no-environment variable defined rules. Instead it's an entirely "opt-in with your own per pod rules".

One more minor side note: chaos chance does NOT work well with a pod level override. It really needs a time factor to be configured WITH the chance in order to make sense. I have a couple ideas about how to do that, but it'll be separate from this work!

@brianberzins
Copy link
Collaborator

update: have a working reaper where rules are evaluated on the ternary logic of {reap, spare, ignore} as well as having implemented one rule in this way. From here the next step is really simple: tweaking the rules to match the changed interface.

After all this, adding pod-level overrides will be straightforward!

@brianberzins
Copy link
Collaborator

Okay. Feeling pretty good about refactoring how rules work in order to help accommodate this request. For right now, the rework should have functional feature parity (I have changed log messages a bit).

There are still a few things I want to test out with this, but it should be a total drop-in-place of an older version.
Docker container: brianberzins/pod-reaper:alpha-02212020

From here, the request to add pod-level overrides should be MUCH easier than before.
I'm also a pretty big fan of how the rework simplified what a "rule" is away from being a structure and instead being type rule func(v1.Pod) (result, string)

@brianberzins
Copy link
Collaborator

#45 has a work-in-progress of a rule refactor, but I feel like I'm hitting a point where "it doesn't feel right" again.

For a rules to have multiple sources of values leaves me with a lot of questions about what pod-reaper looks like it should be doing from the multiple perspectives. So I'm going to try to write out a super specific situation and see what people think. Using the example above!


Suppose we have the following:

Pod reaper is configured with:

  • MAX_DURATION=1d

A specific pod has the following annotations:

  • pod-reaper/max_duration: 12h
  • pod-reaper/pod_status: Evicted

What do we think it should do when it encounters a pod that has been, and is still running (not Evicted) at 12h1m?

If I only saw the pod-reaper configuration, I would describe the reaper as "kills pods that have been running for more than 1 day" so I would say "don't kill the pod".

If I only saw the pod annotations, I would describe the kill conditions as "the pod is at least 12 hours old AND has been evicted" so I would say "don't kill the pod"

If we honor overrides, but only for rules that have been configured (or if we opted to do something like ENABLE_RULE=max_duration and something like HONOR_POD_ANNOTATION_OVERRIDES=true then I would say "only max duration is enabled, ignore the pod_status annotation, and kill the pod". You would really need to see both sides (reaper and pod configuration) for this to make sense.


Maybe that "AND" there is what's really causing my confusion here.
If you want to "OR" two rules, you can just make two reapers (one with each rule) and then if either one says reap, the pod is killed. Pod annotations don't really make sense in this context, as they'd apply to any/all reapers that are looking for them.

Maybe the solution here is to have two specific configuration options here:

  1. explicitly enabling rules, and treating pod annotations as overrides to those specific values
  2. ignore all local rules, and honor the set of annotations on a pod as a collection of rules

At this point it sounds like I'm overthinking things... but what I've done for #45 doesn't really make sense for the first of these (which I think is more in line with the topic of this issue). It also doesn't really make sense of the second of these -- but for different reasons. As much as I like the idea of the second case, it really feels like a super specific use case that I shouldn't consider right now.


Going to keep thinking about this.

@jdharmon
Copy link
Contributor Author

jdharmon commented Mar 2, 2020

In general, I like the direction of the refactor. The rules returning a named result instead of a boolean, makes things much clearer what the expected behavior is without having to jump around in the code.

I liked

  1. if any rule saying "spare" --> the pod will not be reaped
  2. else if there are any number rules saying "reap" --> the pod will be reaped
  3. if NO rules say "reap" the pod is spared

I would add
4. Pod annotations will enable that rule on a pod and override global configuration (if any).


For the Duration example above, since there is an annotation on the pod, I would expect it to be evicted at 12h1m

Consider the reverse scenario:

  • MAX_DURATION=12h

Pod annotations:

  • pod-reaper/max_duration: 1d

In this case, I would expect the pod to be killed in 1 day, even though the default is 12h.


I do think explicitly enabling rules would probably lead to the least amount of confusion, but would break backward compatibility with the current configuration. I would release it as v3.0 to indicate that there are breaking changes, and clearly document it.

@brianberzins
Copy link
Collaborator

Alright. Way more thought than I'd have liked later, here's my plan.

  1. move forward with Rule refactor #45 (after any cleanup people would like to see done)
  2. add in the ability to explicitly define which rules are used (if unset, "enabling all" wouldn't break backwards comparability -- all for discussion there)
  3. implement pod level overrides as simply as I can think of.

For implementing pod level overrides.

  • add a flag to honor pod level annotations
  • annotations act as overrides (or initial configuration) for any/all enabled rule
  • add in some level of abstraction for getting the pod-reaper configuration key (env variable), override annotation, and name (for use when explicitly enabling rules)

The goal of the last piece is to reduce code duplication and make it easier to work. I'm not exactly sure how that'll look right now, but they idea would be start with a list of all rules, filter out anything not enabled, then provide common methods for getting their configuration value.

Something like func(r *rule) getConfigValue(p v1.Pod) (configured cool, value string)
which can be generalized for all rules.

@jdharmon
Copy link
Contributor Author

@brianberzins Have you made any more progress on annotations? I just ran into a situation where they would be very useful. I'm willing to lend a hand getting this implemented.

@brianberzins
Copy link
Collaborator

I need dive back into this!
It's looking like I was moving towards a very different setup with #45 that would make this a lot easier, but there have been a couple of things since that I'd need to make sure still work as expected.

@jdharmon jdharmon linked a pull request Jan 20, 2021 that will close this issue
@jdharmon
Copy link
Contributor Author

I had an immediate need, so I went ahead and implemented this in #64. I'm open to changes.

@brianberzins
Copy link
Collaborator

brianberzins commented Jan 20, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants