Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canary is restarted to "Starting canary analysis for podinfo.test" on Confirm Traffic Increase fail #1248

Open
andylibrian opened this issue Aug 3, 2022 · 3 comments · May be fixed by #1252
Assignees
Labels
kind/bug Something isn't working

Comments

@andylibrian
Copy link
Contributor

Describe the bug

Given

    webhooks:
      - name: confirm-traffic
        url: http://flagger-loadtester.test/notfound  # notice /notfound
        timeout: 5s
        type: confirm-traffic-increase

The canary is always restarted over to "Starting canary analysis"

  Normal   Synced  7m59s (x4 over 8m29s)  flagger  Starting canary analysis for podinfo.test
  Warning  Synced  7m59s (x4 over 8m29s)  flagger  Halt podinfo.test advancement waiting for traffic increase approval

Thus the pre-rollout webhook is executed repeatedly.

To Reproduce

  1. Follow installation step in https://docs.flagger.app/tutorials/gloo-progressive-delivery
  2. Use this canary.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: test
spec:
  provider: gloo
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  autoscalerRef:
    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    name: podinfo
  service:
    port: 9898
    targetPort: 9898
  analysis:
    interval: 10s
    threshold: 5
    maxWeight: 50
    stepWeight: 5
    webhooks:
      - name: automata
        url: http://flagger-loadtester.test/notfound
        timeout: 5s
        type: confirm-traffic-increase

Expected behavior

As the docs says,

confirm-traffic-increase hooks are executed right before the weight on the canary is increased. The canary advancement is paused until this hook returns HTTP 200.

I would expect the canary is paused without being restarted to pre-rollout again.

Additional context

  • Flagger version: 1.22.1
  • Kubernetes version: 1.24.0
  • Service Mesh provider:
  • Ingress provider: Gloo

Notes

If this is a bug, I would be happy contribute a fix. I think the problem here is

The runConfirmTrafficIncreaseHooks on failure returns with the same Phase: Progressing.

if promote := c.runConfirmTrafficIncreaseHooks(cd); !promote {
return
}

Then in the next run, this condition is satisfied (canaryWeight: 0, status.Iterations: 0)

if canaryWeight == 0 && cd.Status.Iterations == 0 &&
!(cd.GetAnalysis().Mirror && mirrored) {
c.recordEventInfof(cd, "Starting canary analysis for %s.%s", cd.Spec.TargetRef.Name, cd.Namespace)

A possible solution could be to have a dedicated status for WaitingTrafficIncrease so that we can exclude this phase from restarting analysis, similar to CanaryPhaseWaitingPromotion.

	CanaryPhaseWaitingTrafficIncrease CanaryPhase = "WaitingTrafficIncrease"
@aryan9600 aryan9600 added the kind/bug Something isn't working label Aug 3, 2022
@aryan9600
Copy link
Member

@andylibrian thanks for opening this issue and showing interest to contribute a fix, I'm assigning you this issue. Feel free to ask for any help/suggestions if needed, looking forward to your PR :)

@andylibrian
Copy link
Contributor Author

@aryan9600 Sure. For the first step, I would like to get a comment / feedback from maintainers, whether what I described is a bug. Is my description on "Expected behavior" correct / agreed?

If so, is there any comment on my proposed solution? The one in the "Notes" section.

@aryan9600
Copy link
Member

Yes, the expected behavior is for analysis to halt till the hooks return a HTTP 2xx response.
Yes it seems like having a dedicated phase for this is the way yo go.

andylibrian added a commit to andylibrian/flagger that referenced this issue Aug 5, 2022
This is to fix fluxcd#1248 where canary is restarted to
"Starting canary analysis" on confirm traffic increase failure.

Signed-off-by: Andy Librian <andy.librian@gmail.com>
andylibrian added a commit to andylibrian/flagger that referenced this issue Aug 5, 2022
This is to fix fluxcd#1248 where canary is restarted to
"Starting canary analysis" on confirm traffic increase failure.

Signed-off-by: Andy Librian <andy.librian@gmail.com>
@andylibrian andylibrian linked a pull request Aug 5, 2022 that will close this issue
andylibrian added a commit to andylibrian/flagger that referenced this issue Aug 5, 2022
This is to fix fluxcd#1248 where canary is restarted to
"Starting canary analysis" on confirm traffic increase failure.

Signed-off-by: Andy Librian <andylibrian@gmail.com>
andylibrian added a commit to andylibrian/flagger that referenced this issue Aug 8, 2022
This is to fix fluxcd#1248 where canary is restarted to
"Starting canary analysis" on confirm traffic increase failure.

Signed-off-by: Andy Librian <andylibrian@gmail.com>
andylibrian added a commit to andylibrian/flagger that referenced this issue Aug 12, 2022
This is to fix fluxcd#1248 where canary is restarted to
"Starting canary analysis" on confirm traffic increase failure.

Signed-off-by: Andy Librian <andylibrian@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants