Handle read-only filesystem gracefully #355

jawnsy · 2022-12-30T01:19:27Z

This may not be worth fixing or this may be "as designed" behavior, but it seems that, if the filesystem is read-only or otherwise non-writable (I have not verified, but I suspect similar things would happen for permission issues or if the disk is full), logs will be collected (e.g. via Pub/Sub) and then simply dropped on the floor.

Thankfully, the collector does log errors:

E [default] Could not allocate tempfile for logs: open /tmp/532025703: read-only file system
I [default] Submitted compact activity snapshot successfully
E [default] Could not allocate tempfile for logs: open /tmp/1803624709: read-only file system
I [default] Submitted compact activity snapshot successfully

There are a few possible approaches here:

The collector could sanitize files in-memory, rather than writing a temporary file
The collector could avoid acknowledging the message until it is saved successfully (this is probably a good idea, anyways, since if the collector crashes, messages may not have been processed even though they were acknowledged)

collector/input/system/google_cloudsql/logs.go

Line 72 in 878802b

pubsubMsg.Ack()

It's also reasonable that the collector may be designed to be best-effort, so dropping logs (or acknowledging but not processing messages in exceptional cases like a crash) might be acceptable

The text was updated successfully, but these errors were encountered:

lfittl · 2023-01-20T07:28:26Z

Thanks for bringing this up - overall I would say we do not aim to make the collector work with a read-only file system at this point. For context, separate from log files being written out temporarily, there is a state file that is used for restarts/reloads - whilst we could refactor how that works for reloads specifically (i.e. keep it in memory somehow), it'd cause some extra complexity.

Regarding not acknowledging Pub/Sub messages in error cases - we could avoid doing so if the log processing fails in the early stages, but there are failure cases later in the process too (mostly related to out of order log messages), and as you note crashes can occur too.

I think re-architecting this to not acknowledge the messages until they are saved (i.e. much later in the process than currently, at least 3 seconds later) will cause a different problem, as the Pub/Sub queue may back up noticeably when there is a high volume of log messages, and I recall that can cause issues with GCP's APIs - in my understanding they want readers to acknowledge messages quickly.

All that said, I'm open to a proposal how to handle things differently (preferably with a working PR), but its not a priority to do major rework here for us at this point. We may revise this in a future when we add more auditing (as in pg_audit) focused features to pganalyze.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle read-only filesystem gracefully #355

Handle read-only filesystem gracefully #355

jawnsy commented Dec 30, 2022

lfittl commented Jan 20, 2023

Handle read-only filesystem gracefully #355

Handle read-only filesystem gracefully #355

Comments

jawnsy commented Dec 30, 2022

lfittl commented Jan 20, 2023