Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

checkPodCount ends preemptively when 0 pods remain after pod killing #331

Open
paigerube14 opened this issue Jun 25, 2021 · 3 comments · May be fixed by #332
Open

checkPodCount ends preemptively when 0 pods remain after pod killing #331

paigerube14 opened this issue Jun 25, 2021 · 3 comments · May be fixed by #332

Comments

@paigerube14
Copy link
Contributor

After killing the only pod that is running in your certain namespace, the checkPodCount ends incorrectly before a pod comes back and is running again.
I would expect the checkPodCount to continue for the entire duration of the timeout before passing or failing.

Sample scenario yaml:

config:
  runStrategy:
    runs: 1
    maxSecondsBetweenRuns: 30
    minSecondsBetweenRuns: 1
scenarios:
  - name: "delete etcd pods"
    steps:
    - podAction:
        matches:
          - labels:
              namespace: "etcd"
              selector: "k8s-app=etcd"
        filters:
          - randomSample:
              size: 1
        actions:
          - kill:
              probability: 1
              force: true
    - podAction:
        matches:
          - labels:
              namespace: "etcd"
              selector: "k8s-app=etcd"
        retries:
          retriesTimeout:
            timeout: 180

        actions:
          - checkPodCount:
              count: 1

Output:

2021-06-25 19:16:09 INFO __main__ No cloud driver - some functionality disabled
2021-06-25 19:16:09 INFO __main__ Using stdout metrics collector
2021-06-25 19:16:09 INFO __main__ NOT starting the UI server
2021-06-25 19:16:09 INFO __main__ STARTING AUTONOMOUS MODE
2021-06-25 19:16:12 INFO scenario.delete etcd pod Starting scenario 'delete etcd pods' (2 steps)
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Matching 'labels' {'labels': {'namespace': 'etcd', 'selector': 'k8s-app=etcd'}}
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Matched 1 pods for selector k8s-app=etcd in namespace etcd
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Initial set length: 1
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Filtered set length: 1
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Pod killed: [pod #0 name=etcd-master-00.qe-pr-sno2.qe.devcluster.openshift.com namespace=etcd containers=4state=Running labels:app=etcd,etcd=true,k8s-app=etcd,revision=2 annotations:kubernetes.io/config.hash=*,kubernetes.io/config.seen=2021-06-25T14:30:12.819685290Z,kubernetes.io/config.source=file,target.workload.openshift.io/management={"effect": "PreferredDuringScheduling"}]
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Matching 'labels' {'labels': {'namespace': 'etcd', 'selector': 'k8s-app=etcd'}}
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Matched 0 pods for selector k8s-app=etcd in namespace etcd
2021-06-25 19:16:12 INFO action_nodes_pods.delete etcd pod Initial set length: 0
2021-06-25 19:16:12 INFO scenario.delete etcd pod Scenario finished
2021-06-25 19:16:12 INFO policy_runner All done here!
@chaitanyaenr
Copy link
Contributor

@seeker89 PTAL when you get time. Thanks.

@jcstanaway
Copy link
Contributor

Per the documentation, retries specifies "An object of retry criteria to rerun set actions". As the actions are only performed on matched pods which passed the filter criteria, and there were zero such pods at the moment that matches was evaluated, the actions are never run.

I'd suggest inserting a waitAction prior to the second podAction.

@paigerube14 paigerube14 linked a pull request Jun 30, 2021 that will close this issue
@paigerube14
Copy link
Contributor Author

In this case the waitAction is not super helpful because I would have to guess when the pod comes back which is the whole point of the retries in the podAction. The retires in the podAction should be used to verify the number of pods that exist. If 0 pods exist at the current time it should still wait until the time limit or retry count before failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants