Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow backoff #178

Closed
rommeA opened this issue May 13, 2024 · 6 comments
Closed

Workflow backoff #178

rommeA opened this issue May 13, 2024 · 6 comments

Comments

@rommeA
Copy link

rommeA commented May 13, 2024

Thank you for this package, I've been using it for a while now.
In one of my Workflows I constantly get this error in jobs (don't know why yet):

image

I've seen you've implemented backoff for Workflow, but did you decide to rollback this feature? Thanks!

@rmcdaniel
Copy link
Contributor

rmcdaniel commented May 13, 2024

@rommeA I did not roll back that feature. The problem is with the database migration that creates the jobs table.

php artisan queue:table

It seems that this table is using a signed smallinit which limits the max value to 32767 and therefore 32768 is too big.

First I would ask that you check if 32768 attempts seems like too many. If you think that an activity/workflow should retry more than 32767 times then you can increase the size of that column.

I hope this has answered your question. If not please feel free to ask again.

@rommeA
Copy link
Author

rommeA commented May 13, 2024

@rmcdaniel yeah, I understand that the problem is in the database. But 32768 attempts really is too many for a workflow (not an activity). I really cannot find the reason why only one specific workflow retries so many times and I struggle to reproduce this bug (if it really is a bug).

The workflow looks like this:
image

When this "stuck" job occurres in jobs table, the workflow is waiting for acceptance (line 86):

yield WorkflowStub::await(fn() => $this->accepted);

To fix a stuck job I manually change max_retries value for the job in the database, then the job fails, I run

php artisan cache:clear

and the job completes successfully.

I'll give my feedback If I manage to reproduce it locally.

P.S.: maybe it has smth to do with the cache. My project implements custom replication - I replicate workflows from the other instance of my app - models are created silently. I have many other workflows in my project and this problem with too many attempts occures only for replicated (exchanged) workflow class.
Are there any observers or events or smth like that on workflow creation / start?

@rmcdaniel
Copy link
Contributor

Yes there are events https://laravel-workflow.com/docs/features/events that get created.

@rmcdaniel
Copy link
Contributor

@rommeA After doing some investigating it seems Workflows don't have a backoff and never did, it is only activities that do, The documentation is wrong. It seems that in some cases a workflow can spinwait when there are idle workers and this can trigger the maximum count but this is intentional. Perhaps we should suggest that the migrations should be modified. But we should confirm the spinwait issue is what's happening first.

@rmcdaniel
Copy link
Contributor

Here is an example of what I mean.

class SimpleWorkflow extends Workflow
{
    public function execute()
    {
        return yield ActivityStub::all([
            ActivityStub::make(ActivityOne::class),
            ActivityStub::make(ActivityTwo::class),
        ]);
    }
}
class ActivityOne extends Activity
{
    public function execute()
    {
        sleep(5);
        return 'one';
    }
}
class ActivityTwo extends Activity
{
    public function execute()
    {
        return 'two';
    }
}

Assume there are two workers running php artisan queue:work.

One worker will be busy with ActivityOne for 5 seconds because of the sleep(5). However, the second worker will complete ActivityTwo and then the SimpleWorkflow will spinwait on the second worker for the 5 seconds while the workflow is waiting for ActivityOne to finish. It keeps running, getting released because it's not ready and then running again in a loop.

Screen Shot 2024-05-14 at 6 54 56 PM

This is normal, it has to work like this to ensure that the workflow will eventually be notified and run to completion. However, it increments the number of attempts and this can eventually overflow that database column.

@lucabartoli
Copy link

I would also point out that if an activity fails to save its state when returning, it keeps trying despite of maximum tries, basically looping the workflow forever, without apparent failures.
This happens when the returned payload is too long for the workflow_logs column.
You can experience this issue by returning a big JSON or a big serialized class.
Our solution is to limit the exchanged data between workflows and activities to the bare minimum and when this is not enough, we change the columns for workflows, workflow_logs and workflow_exceptions from text to long text.

This issue could be the root cause of workflows trying too many times for no apparent reason.

Hope this helps!

@laravel-workflow laravel-workflow locked and limited conversation to collaborators May 19, 2024
@rmcdaniel rmcdaniel converted this issue into discussion #180 May 19, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants