Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle read-only filesystem gracefully #355

Open
jawnsy opened this issue Dec 30, 2022 · 1 comment
Open

Handle read-only filesystem gracefully #355

jawnsy opened this issue Dec 30, 2022 · 1 comment

Comments

@jawnsy
Copy link
Contributor

jawnsy commented Dec 30, 2022

This may not be worth fixing or this may be "as designed" behavior, but it seems that, if the filesystem is read-only or otherwise non-writable (I have not verified, but I suspect similar things would happen for permission issues or if the disk is full), logs will be collected (e.g. via Pub/Sub) and then simply dropped on the floor.

Thankfully, the collector does log errors:

E [default] Could not allocate tempfile for logs: open /tmp/532025703: read-only file system
I [default] Submitted compact activity snapshot successfully
E [default] Could not allocate tempfile for logs: open /tmp/1803624709: read-only file system
I [default] Submitted compact activity snapshot successfully

There are a few possible approaches here:

  1. The collector could sanitize files in-memory, rather than writing a temporary file
  2. The collector could avoid acknowledging the message until it is saved successfully (this is probably a good idea, anyways, since if the collector crashes, messages may not have been processed even though they were acknowledged)

It's also reasonable that the collector may be designed to be best-effort, so dropping logs (or acknowledging but not processing messages in exceptional cases like a crash) might be acceptable

@lfittl
Copy link
Member

lfittl commented Jan 20, 2023

Thanks for bringing this up - overall I would say we do not aim to make the collector work with a read-only file system at this point. For context, separate from log files being written out temporarily, there is a state file that is used for restarts/reloads - whilst we could refactor how that works for reloads specifically (i.e. keep it in memory somehow), it'd cause some extra complexity.

Regarding not acknowledging Pub/Sub messages in error cases - we could avoid doing so if the log processing fails in the early stages, but there are failure cases later in the process too (mostly related to out of order log messages), and as you note crashes can occur too.

I think re-architecting this to not acknowledge the messages until they are saved (i.e. much later in the process than currently, at least 3 seconds later) will cause a different problem, as the Pub/Sub queue may back up noticeably when there is a high volume of log messages, and I recall that can cause issues with GCP's APIs - in my understanding they want readers to acknowledge messages quickly.

All that said, I'm open to a proposal how to handle things differently (preferably with a working PR), but its not a priority to do major rework here for us at this point. We may revise this in a future when we add more auditing (as in pg_audit) focused features to pganalyze.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants