Fixed has_release? when called in multiple windows #168

cvanderschuere · 2022-03-23T21:06:51Z

Currently has_release does not work with long running workflows. Once has_release has been hit once, it will never change its value for the lifetime of a workflow.

Example

Original

def execute
    counter = 0
    loop do
      workflow.sleep(2.minutes.to_i)
      HelloActivity.execute("loop_#{counter}")
      counter += 1
    end
  end

Modification

def execute
    counter = 0
    loop do
      workflow.sleep(2.minutes.to_i)
      if workflow.has_release?('testing')
        HelloActivity.execute("TESTING_#{counter}")
      else
        HelloActivity.execute("loop_#{counter}")
      end
      counter += 1
    end
  end

The desired outcome with releasing this modification is that all future loops will call the new code, but previous tasks will execute the original version. Prior to this PR, the original code will always be executed.

With this change, here are the possible scenarios:

has_release seen for the first time when not replaying --- mark release
has_release seen for the first time while replaying --- set to false locally
has_release seen for first time while replaying and then again while not replaying -- mark release
has_release seen for first time when not replaying and then again while not replaying -- use previous value / don't mark release

antstorm

@cvanderschuere thanks a lot for the PR 👍

Can you please help me understand your thought process here?

Originally I've implemented this in a way where you can expect a has_release?('version') to return the same value within a single workflow instance no matter at what point the new code got deployed. The reasoning here is simple — you can add a change with the same condition to different parts of the workflow code and expect it to either always be true or false, but never change the value along the way.

I can definitely see how your example is also valid where you want the new release to affect the loops that haven't yet happened. But this breaks the previous assumption

cvanderschuere · 2022-03-24T20:04:20Z

The main use case that is not well supported right now is for long running workflows (specifically those with loops). If you have a workflow that runs for days or even weeks, there is currently no good way to make safe / deterministic updates and that's a critical feature for us.

As far as tools to make breaking changes, there are currently two approaches: version the entire workflow (i.e. WorkflowV1, WorkflowV2, etc) or use has_release? to be more specific with the versioning.

For workflows that are short (i.e. minutes), the current approach works just fine:

(assume multiple workflow executions running)
Wrap new codepath in has_release? and deploy to worker(s)
a. For currently running workflows, they complete as though they would have before
b. For new workflows, they use the new code
After the max workflow duration (minutes in this example), all workflows are running the new code

The problem is when the max workflow duration is longer than a tolerable time to make changes. In our case, workflows can last for weeks which is more than tolerable.

I see two paths forward:
A. We change the assumption for has_release? to be consistent based on time in event history rather for entire workflow execution -- model has_release? like a SideEffect rather than more like a global variable
B. Add a new way to give more granular control over workflow execution potentially copying from the GetVersion example in the Go SDK

I would favor option A, but would be curious to hear use-cases where that would be a significant breaking change.

antstorm · 2022-04-10T22:53:12Z

@cvanderschuere thanks for explaining this 🙌

What you're describing does make sense and it might be a better default behaviour. However it is a breaking change if there are people right now who depend on the old behaviour.

There might be a 3rd option which is a compromise — either add a new method or add a flag (e.g. has_release?('my_release', sticky: false)) that alters the behaviour of the has_release? for your use-case. At some point we'll have to go for the GetVersion-like API to give people more flexibility, but we can probably delay for a bit longer.

What do you think?

cvanderschuere added 2 commits March 23, 2022 13:55

Fixed has_release? to work in long loops

1d2a6bc

typo

ff9f8c8

antstorm reviewed Mar 24, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed has_release? when called in multiple windows #168

Fixed has_release? when called in multiple windows #168

cvanderschuere commented Mar 23, 2022

antstorm left a comment

cvanderschuere commented Mar 24, 2022

antstorm commented Apr 10, 2022

Fixed has_release? when called in multiple windows #168

Are you sure you want to change the base?

Fixed has_release? when called in multiple windows #168

Conversation

cvanderschuere commented Mar 23, 2022

Example

Original

Modification

antstorm left a comment

Choose a reason for hiding this comment

cvanderschuere commented Mar 24, 2022

antstorm commented Apr 10, 2022