Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

example does not work finished_jobs_log missing #87

Open
bjartek opened this issue Sep 2, 2022 · 10 comments
Open

example does not work finished_jobs_log missing #87

bjartek opened this issue Sep 2, 2022 · 10 comments

Comments

@bjartek
Copy link

bjartek commented Sep 2, 2022

CREATE TABLE IF NOT EXISTS finished_jobs_log
(
    job_id      BIGSERIAL   NOT NULL PRIMARY KEY,
    run_at      TIMESTAMPTZ NOT NULL,
    type        TEXT        NOT NULL,
    queue       TEXT        NOT NULL
);

@bjartek
Copy link
Author

bjartek commented Sep 2, 2022

Also when I ran the default example the finisher panics but it just makes the job run endlessly, should the hookFunc return an error?

@vgarvardt
Copy link
Owner

Thx for reporting, let me try to reproduce this with the test and if I'll be able - I'll fix this issue.

should the hookFunc return an error?

it should not by design, pls check hook type description - https://pkg.go.dev/github.com/vgarvardt/gue/v4#HookFunc

@vgarvardt
Copy link
Owner

Tried to reproduce an issue with a test and I can not - I see a record in the finished jobs log table and panic is not interrupting the job flow.

Pls check #90, maybe you have an idea how to reproduce the problem.

@bjartek
Copy link
Author

bjartek commented Sep 3, 2022

One issue is that in the example code or in the ddl there is no sql for the finished logs table.

@vgarvardt
Copy link
Owner

Ah, this is just an example of hook usage, not the production-ready code.

@bjartek
Copy link
Author

bjartek commented Sep 3, 2022

Ok, well if you add the finished job table to the ddl it will work better imho.

@BillBuilt
Copy link

I just experienced a similar problem. I set up a finished_jobs_log table and tweaked it a bit. However, my INSERT query in finishedJobsLog introduced an SQL error, so the j.tx.Commit() in j.Done() never occurs and the error causes a rollback. Therefore, the job record remains in-tact, thus causing the job to re-run infinitely. As there is no way that I can find to capture this error, I was not aware it was occurring until I stepped through many lines of code using the debugger.

To reproduce:

  1. Create finished_jobs_log table:
CREATE TABLE IF NOT EXISTS gue_finished_jobs
(
   job_id      TEXT        NOT NULL PRIMARY KEY,
   job_type    TEXT        NOT NULL,
   queue       TEXT        NOT NULL,
   run_at      TIMESTAMPTZ NOT NULL,
   finished_at TIMESTAMPTZ NOT NULL
);
  1. Modify the INSERT query in finishedJobsLog like so:
_, err = j.Tx().Exec(
  ctx,
  "INSERT INTO gue_finished_jobs (job_id,job_type,queue,run_at,finished_at) VALUES ($1, $2, $3, $4, now())",
  j.ID,
  j.Type,
  j.Queue,
  j.RunAt,
)
  1. Run the example

Using j.ID for job_id will cause the error - to fix use j.ID.String()

@vgarvardt
Copy link
Owner

@BillBuilt added new hook type that should help spotting an issue like this #224. Although failure already produces an error message, so having alert on error messages in log collector is good practice in general. Another good practice that may help avoid issues like this is tests, in your case covering happy path could help avoiding the issue.

@TechGeorgii
Copy link

TechGeorgii commented Oct 20, 2023

@vgarvardt Hi, thank you for this new hook. Could you suggest how to introduce backoff in this situation?

  1. In job execution something goes wrong in SQL query.
  2. Transaction becomes "failed" thus not committed.
  3. Job falls into "failed" state and executed indefinitely, without any backoff.

That's what I want to avoid as it is definitely not good when job keeps executing with errors without any delay.

I see one of the solutions is to use some counter in my local memory to keep track if job "failed" and delay next execution. How would you recommend to do this?

@vgarvardt
Copy link
Owner

@TechGeorgii in-memory counter will work for sure, but it means that you expect the logic to fail in a non-recoverable way.

I would go towards more general approach - make an SLI from produced error messages count and include it into Error Budget calculation - it should be drained very fast, given that failed job will get into the infinite loop. If you have smart CD process then error logs count metric should be used as a decision point for automatic rollback. Or sudden spike in the error logs count must trigger the critical alert.

The main idea is not to workaround the issue, but to prevent the issue from affecting the prod, or spot it and mitigate/hotfix as soon as possible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants