Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dedicated phase waiting traffic increase #1252

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

andylibrian
Copy link
Contributor

@andylibrian andylibrian commented Aug 5, 2022

This is to fix #1248 where canary is restarted to "Starting canary analysis" on confirm traffic increase failure.

Signed-off-by: Andy Librian andylibrian@gmail.com

@andylibrian andylibrian force-pushed the dedicated-phase-waiting-traffic-increase branch from 7a0d9a5 to 5ec4f87 Compare August 5, 2022 04:27
@codecov-commenter
Copy link

codecov-commenter commented Aug 5, 2022

Codecov Report

Merging #1252 (eb276db) into main (e65dfbb) will decrease coverage by 0.04%.
The diff coverage is 32.00%.

@@            Coverage Diff             @@
##             main    #1252      +/-   ##
==========================================
- Coverage   54.74%   54.69%   -0.05%     
==========================================
  Files          81       81              
  Lines        6949     6966      +17     
==========================================
+ Hits         3804     3810       +6     
- Misses       2550     2561      +11     
  Partials      595      595              
Impacted Files Coverage Δ
pkg/canary/status.go 46.83% <0.00%> (-0.91%) ⬇️
pkg/controller/scheduler_hooks.go 7.31% <0.00%> (-0.80%) ⬇️
pkg/controller/scheduler.go 46.48% <88.88%> (+0.61%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@andylibrian andylibrian force-pushed the dedicated-phase-waiting-traffic-increase branch from 5ec4f87 to 7a4edef Compare August 8, 2022 03:57
@andylibrian andylibrian changed the title [Draft] Add dedicated phase waiting traffic increase Add dedicated phase waiting traffic increase Aug 8, 2022
@andylibrian andylibrian marked this pull request as ready for review August 8, 2022 04:18
Copy link
Member

@aryan9600 aryan9600 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for opening this PR! :)
Please sign off your commit: https://stackoverflow.com/questions/1962094/what-is-the-sign-off-feature-in-git-for

pkg/apis/flagger/v1beta1/status.go Outdated Show resolved Hide resolved
pkg/controller/scheduler.go Show resolved Hide resolved
pkg/controller/scheduler_hooks.go Outdated Show resolved Hide resolved
@andylibrian
Copy link
Contributor Author

Thanks for the review so far, I signed off my first commit. Then in the review I committed the suggested change through GitHub, and it turned out it's missing the signed-off-by. I will add the missing signed-off-by message and rebase this after the review is completed because it requires force push and it would restart the review. If you think otherwise, please let me know.

@aryan9600
Copy link
Member

That's not an issue, please rebase and force push.

This is to fix fluxcd#1248 where canary is restarted to
"Starting canary analysis" on confirm traffic increase failure.

Signed-off-by: Andy Librian <andylibrian@gmail.com>
@andylibrian andylibrian force-pushed the dedicated-phase-waiting-traffic-increase branch from b2238ef to eb276db Compare August 12, 2022 03:33
@andylibrian
Copy link
Contributor Author

Hi @aryan9600, I have rebased and force push this branch as suggested. Do you have any more comment / feedback?

@aryan9600
Copy link
Member

I'm currently testing this with various Canary objects, to make sure that adding this new phase doesn't break any existing functionality. If you want, you can do some testing as well, to make sure no bugs creep in inadvertently, thanks.

Copy link
Member

@aryan9600 aryan9600 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to work okay when the confirm-traffic-increase hook returns successfully at the very start of the analysis and then returns an error later during the analysis. But it doesn't work as expected if the confirm-traffic-increase hook returns an error at the very first iteration.
In this case, we never progress the canary forward (since the hook returned an error) and set the Canary's phase to WaitingTrafficIncrease while the canary weight is still 0, so in the next iterations, we just keep running the metric checks, which return an error since no traffic has reached the canary yet.

if ok := c.runAnalysis(cd); !ok {

Now, if after a few iterations, the confirm-traffic-increase hook returns successfully, the analysis doesn't proceed as expected. This is due to the fact that instead of progressing the canary forward and consequently increasing the canary weight up from zero, we keep running the metric checks, which again return an error because no traffic reached the canary ever, leading to:

{"level":"info","ts":"2022-08-24T23:20:27.962+0530","caller":"controller/events.go:45","msg":"Halt podinfo.test advancement waiting for traffic increase approval automata","canary":"podinfo.test"}
{"level":"info","ts":"2022-08-24T23:20:42.948+0530","caller":"controller/events.go:45","msg":"Halt advancement no values found for contour metric request-success-rate probably podinfo.test is not receiving traffic: running query failed: no values found","canary":"podinfo.test"}
{"level":"info","ts":"2022-08-24T23:20:57.948+0530","caller":"controller/events.go:45","msg":"Halt advancement no values found for contour metric request-success-rate probably podinfo.test is not receiving traffic: running query failed: no values found","canary":"podinfo.test"}

Steps to repro:

  • Create this Canary:
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
spec:
  analysis:
    interval: 15s
    maxWeight: 50
    metrics:
    - interval: 1m
      name: request-success-rate
      threshold: 99
    - interval: 1m
      name: request-duration
      threshold: 500
    stepWeight: 10
    threshold: 15
    webhooks:
    - name: automata
      timeout: 5s
      type: confirm-traffic-increase
      url: http://<flagger-loadtester-url>/gate/check
    - metadata:
        cmd: hey -z 1m -q 5 -c 2 -host app.example.com http://envoy.projectcontour
        logCmdOutput: "true"
        type: cmd
      name: load-test
      timeout: 5s
      url: http://<flagger-loadtster-url>
  progressDeadlineSeconds: 60
  provider: contour
  service:
    port: 80
    targetPort: 9898
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  • Close the gate:
curl -v -X POST --data '{"type": "confirm-traffic-increase", "name": "podinfo", "namespace": "test", "phase": "Initialized"}' http://<flagger-loadtester-url>/gate/close
  • Kick off the analysis: kubectl -n test set image deploy/podinfo podinfod==ghcr.io/stefanprodan/podinfo:6.0.1
  • Wait for the first couple of iterations and then open the gate:
 curl -v -X POST --data '{"type": "confirm-traffic-increase", "name": "podinfo", "namespace": "test", "phase": "WaitingTrafficIncrease"}' http://<flagger-loadtester-url>/gate/open

Please let me know if I'm unclear and you want me to elaborate further :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Canary is restarted to "Starting canary analysis for podinfo.test" on Confirm Traffic Increase fail
3 participants