example does not work finished_jobs_log missing #87

bjartek · 2022-09-02T11:45:42Z

CREATE TABLE IF NOT EXISTS finished_jobs_log
(
    job_id      BIGSERIAL   NOT NULL PRIMARY KEY,
    run_at      TIMESTAMPTZ NOT NULL,
    type        TEXT        NOT NULL,
    queue       TEXT        NOT NULL
);

The text was updated successfully, but these errors were encountered:

bjartek · 2022-09-02T11:56:40Z

Also when I ran the default example the finisher panics but it just makes the job run endlessly, should the hookFunc return an error?

vgarvardt · 2022-09-02T18:13:58Z

Thx for reporting, let me try to reproduce this with the test and if I'll be able - I'll fix this issue.

should the hookFunc return an error?

it should not by design, pls check hook type description - https://pkg.go.dev/github.com/vgarvardt/gue/v4#HookFunc

vgarvardt · 2022-09-02T18:33:57Z

Tried to reproduce an issue with a test and I can not - I see a record in the finished jobs log table and panic is not interrupting the job flow.

Pls check #90, maybe you have an idea how to reproduce the problem.

bjartek · 2022-09-03T06:15:47Z

One issue is that in the example code or in the ddl there is no sql for the finished logs table.

vgarvardt · 2022-09-03T09:51:34Z

Ah, this is just an example of hook usage, not the production-ready code.

bjartek · 2022-09-03T10:37:54Z

Ok, well if you add the finished job table to the ddl it will work better imho.

BillBuilt · 2023-10-02T16:56:00Z

I just experienced a similar problem. I set up a finished_jobs_log table and tweaked it a bit. However, my INSERT query in finishedJobsLog introduced an SQL error, so the j.tx.Commit() in j.Done() never occurs and the error causes a rollback. Therefore, the job record remains in-tact, thus causing the job to re-run infinitely. As there is no way that I can find to capture this error, I was not aware it was occurring until I stepped through many lines of code using the debugger.

To reproduce:

Create finished_jobs_log table:

CREATE TABLE IF NOT EXISTS gue_finished_jobs
(
   job_id      TEXT        NOT NULL PRIMARY KEY,
   job_type    TEXT        NOT NULL,
   queue       TEXT        NOT NULL,
   run_at      TIMESTAMPTZ NOT NULL,
   finished_at TIMESTAMPTZ NOT NULL
);

Modify the INSERT query in finishedJobsLog like so:

_, err = j.Tx().Exec(
  ctx,
  "INSERT INTO gue_finished_jobs (job_id,job_type,queue,run_at,finished_at) VALUES ($1, $2, $3, $4, now())",
  j.ID,
  j.Type,
  j.Queue,
  j.RunAt,
)

Run the example

Using j.ID for job_id will cause the error - to fix use j.ID.String()

vgarvardt · 2023-10-02T19:26:50Z

@BillBuilt added new hook type that should help spotting an issue like this #224. Although failure already produces an error message, so having alert on error messages in log collector is good practice in general. Another good practice that may help avoid issues like this is tests, in your case covering happy path could help avoiding the issue.

TechGeorgii · 2023-10-20T05:13:52Z

@vgarvardt Hi, thank you for this new hook. Could you suggest how to introduce backoff in this situation?

In job execution something goes wrong in SQL query.
Transaction becomes "failed" thus not committed.
Job falls into "failed" state and executed indefinitely, without any backoff.

That's what I want to avoid as it is definitely not good when job keeps executing with errors without any delay.

I see one of the solutions is to use some counter in my local memory to keep track if job "failed" and delay next execution. How would you recommend to do this?

vgarvardt · 2023-10-24T06:40:52Z

@TechGeorgii in-memory counter will work for sure, but it means that you expect the logic to fail in a non-recoverable way.

I would go towards more general approach - make an SLI from produced error messages count and include it into Error Budget calculation - it should be drained very fast, given that failed job will get into the infinite loop. If you have smart CD process then error logs count metric should be used as a decision point for automatic rollback. Or sudden spike in the error logs count must trigger the critical alert.

The main idea is not to workaround the issue, but to prevent the issue from affecting the prod, or spot it and mitigate/hotfix as soon as possible.

vgarvardt mentioned this issue Sep 2, 2022

Trying to reproduce and fix hook job done panic #90

Closed

vgarvardt mentioned this issue Oct 2, 2023

feat: hooksJobUndone to let user handle critical job done failure #224

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

example does not work finished_jobs_log missing #87

example does not work finished_jobs_log missing #87

bjartek commented Sep 2, 2022

bjartek commented Sep 2, 2022

vgarvardt commented Sep 2, 2022

vgarvardt commented Sep 2, 2022

bjartek commented Sep 3, 2022

vgarvardt commented Sep 3, 2022

bjartek commented Sep 3, 2022

BillBuilt commented Oct 2, 2023

vgarvardt commented Oct 2, 2023

TechGeorgii commented Oct 20, 2023 •

edited

vgarvardt commented Oct 24, 2023

example does not work finished_jobs_log missing #87

example does not work finished_jobs_log missing #87

Comments

bjartek commented Sep 2, 2022

bjartek commented Sep 2, 2022

vgarvardt commented Sep 2, 2022

vgarvardt commented Sep 2, 2022

bjartek commented Sep 3, 2022

vgarvardt commented Sep 3, 2022

bjartek commented Sep 3, 2022

BillBuilt commented Oct 2, 2023

vgarvardt commented Oct 2, 2023

TechGeorgii commented Oct 20, 2023 • edited

vgarvardt commented Oct 24, 2023

TechGeorgii commented Oct 20, 2023 •

edited