Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shell Jobs randomly getting 'stuck' #1483

Open
giggsey opened this issue Feb 22, 2024 · 3 comments
Open

Shell Jobs randomly getting 'stuck' #1483

giggsey opened this issue Feb 22, 2024 · 3 comments

Comments

@giggsey
Copy link
Contributor

giggsey commented Feb 22, 2024

Describe the bug

Some shell jobs are randomly getting stuck in the 'busy' state, and don't clear until we restart the dkron agent.

The job processes launched by dkron agent are finishing as expected and exiting cleanly. Dkron doesn't notice, so the job stays in the 'busy' status, and further executions don't happen (we prevent concurrency).

We recently upgraded from 3.2.0 to 3.2.7, and this problem is now happening randomly with some of our jobs.

Job Description

{

    "id": "cloud_sync_user-data_private_app-server",
    "name": "cloud_sync_user-data_private_app-server",
    "displayname": "",
    "timezone": "",
    "schedule": "@every 5m",
    "owner": "cloud_sync",
    "owner_email": "",
    "success_count": 134712,
    "error_count": 506,
    "last_success": "2024-02-22T09:50:48.739347143Z",
    "last_error": "2024-02-22T09:40:03.273059208Z",
    "disabled": false,
    "tags": {
        "my_tag": "private_app-server:1"
    }

    ,
    "metadata": null,
    "retries": 0,
    "dependent_jobs": null,
    "parent_job": "",
    "processors": {}

    ,
    "concurrency": "forbid",
    "executor": "shell",
    "executor_config": {
        "command": "/usr/local/bin/dkron_run 0002 \"my-daemons\" \"/var/log/dkron-jobs/cloud_sync_user-data_private_app-server.log\" \"PATH=/sbin:/bin:/usr/sbin:/usr/bin /path/cloud_sync/bin/sync user-data -vvv\""
    }

    ,
    "status": "success",
    "next": "2024-02-22T09:58:52Z",
    "ephemeral": false,
    "expires_at": null
}

Logs

The job took around ~15 seconds to execute (I can see from the logs that my dkron_run script generates).

2024-02-22T01:48:52.020093+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:52Z" level=info msg="grpc_agent: Starting job" job=cloud_sync_user-data_private_app-server node=HR-P-APP-2.production.my.network
2024-02-22T01:48:52.020272+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:52Z" level=debug msg="grpc_agent: calling executor plugin" node=HR-P-APP-2.production.my.network plugin=shell
2024-02-22T01:48:52.020993+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:52Z" level=debug msg="2024/02/22 01:48:52 shell: going to run /usr/local/bin/dkron_run 0002 \"my-daemons\" \"/var/log/dkron-jobs/cloud_sync_user-data_private_app-server.log\" \"PATH=/sbin:/bin:/usr/sbin:/usr/bin /path/cloud_sync/bin/sync user-data -vvv\"" node=HR-P-APP-2.production.my.network subsystem_name=plugins.dkron-executor-shell

2024-02-22T01:49:07.300497+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:49:07Z" level=debug msg="2024/02/22 01:49:07 NewProcess err: process does not exist" node=HR-P-APP-2.production.my.network subsystem_name=plugins.dkron-executor-shell
2024-02-22T01:49:07.300664+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:49:07Z" level=debug msg="2024/02/22 01:49:07 Error getting pid statistics: process does not exist" node=HR-P-APP-2.production.my.network subsystem_name=plugins.dkron-executor-shell

Logs for a good job that ran at around the same time. This other job took less than a second to execute.

2024-02-22T01:48:53.019439+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:53Z" level=info msg="grpc_agent: Starting job" job=cloud_sync-alert-data_private_app-server node=HR-P-APP-2.production.my.network
2024-02-22T01:48:53.020427+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:53Z" level=debug msg="grpc_agent: calling executor plugin" node=HR-P-APP-2.production.my.network plugin=shell
2024-02-22T01:48:53.021941+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:53Z" level=debug msg="2024/02/22 01:48:53 shell: going to run /usr/local/bin/dkron_run 0002 \"my-daemons\" \"/var/log/dkron-jobs/cloud_sync-alert-data_private_app-server.log\" \"PATH=/sbin:/bin:/usr/sbin:/usr/bin /path/cloud_sync/bin/sync alert-data -vvv\"" node=HR-P-APP-2.production.my.network subsystem_name=plugins.dkron-executor-shell
2024-02-22T01:48:56.043127+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:56Z" level=debug msg="2024/02/22 01:48:56 NewProcess err: process does not exist" node=HR-P-APP-2.production.my.network subsystem_name=plugins.dkron-executor-shell
2024-02-22T01:48:56.043387+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:56Z" level=debug msg="2024/02/22 01:48:56 Error getting pid statistics: process does not exist" node=HR-P-APP-2.production.my.network subsystem_name=plugins.dkron-executor-shell
2024-02-22T01:48:56.043548+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:56Z" level=debug msg="2024/02/22 01:48:56 shell: Command output " node=HR-P-APP-2.production.my.network subsystem_name=plugins.dkron-executor-shell
2024-02-22T01:48:57.591651+00:00 HR-P-APP-2 dkron[1119]: time="2024-02-22T01:48:57Z" level=info msg="2024/02/22 01:48:57 [DEBUG] memberlist: Initiating push/pull sync with: HR-P-SQL-3.production.my.network 192.168.181.23:8946"

Specifications

  • OS: AlmaLinux 9.3
  • Version: 3.2.7
@morgan-atproperties
Copy link

We are seeing the same thing. We were running 3.2.6 previously, and we're going to roll back to that version.

@mladoi
Copy link

mladoi commented Mar 13, 2024

Having the same issue with 3.2.7.
Job is merely calling a redis-cli publish check 1 every 5 minutes.
Approx. 1 in 1000 job executions get stuck in busy state forever.
Even setting timeout in executor_config doesn't change the busy state.

@morgan-atproperties
Copy link

Rolling back to 3.2.6 fixed it for us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants