-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plagued by segmentation faults #2286
Comments
Agreed.
I don't think this will work since the process having the issue is two fork/execs away from |
Confirmed that catchsegv doesn't yield anything... Will try to get a coredump. |
Do these help? |
Hmm, I can't attach the tarball apparently. So here we go.
|
Unfortunately these are not as useful as I had hoped. It looks like an error has been thrown and these traces show the attempted cleanup. My guess from past experience is we are getting some sort of unexpected output/error from the S3 clone and it is causing problems. Not sure why this would cause a segfault, though. We'll need to get some logs to get an idea of what is happening when the error occurs. Add this configuration to one of your systems that has frequent errors:
When you get an error, identify the subprocess that failed (e.g. local-3 process terminated unexpectedly) and find the log file associated with that subprocess, in this case There probably won't be any errors in this log so you'll need to look for log blocks (separated by ----) that do not end in Please note that the logs might get rather large, depending on WAL volume. |
I've set this up on our most affected server. And now we wait ;-) |
Ok, so we caught some signal 11 and a signal 6. Note that I had to redact a few more informations ("redacted" w/o "<>").
** pgBackRest repo host (local process log on DB server didn't sport anything from that time):
** repo host
Obviously, we'll be working on solving the root cause of |
It looks like the remotes are working as expected, i.e. getting an error and throwing it to the local process. But then the local crashes and the remote errors on EOF. Can we get the relevant local logs?
This is a pretty awful place throw a rate limiting error, when the upload is complete and we are just trying to stitch it together. Worse, it is not returned with a proper HTTP code or sensible error. None of this should cause a crash, of course... |
As mentioned, the local logs have nothing in them, e.g.:
(while the crash of the local-003 process was at 2024-02-22 20:49:31) |
I'm confused -- doesn't look like nothing to me -- in fact looks a lot like what I was expecting. Can we get more above the entry at |
D'uh! What I didn't realize was that that was actually the subprocess we're looking for...! Someone else had extracted the logs already, so I just copied them from the ticket system... mea culpa! So here's the beginning of the log on the repo host up to the point pasted earlier:
and I try to attach the snippet of the -archive-push-async-local-003.log |
No idea why I can't attach files... sorry!
|
@dwsteele , do these shed some light on the issue? |
Somewhat, but I have not had a chance to reproduce. Even if I can reproduce it I doubt there is an easy fix. The real answer is to commit the protocol changes in #2108 but there are some performance problems there to address. I'm hoping to get to that in this release cycle. |
I spent some time trying to reproduce this but with no luck. I always see the error properly reported and a normal abort. Given that, I think it makes sense to see if #2108 fixes this issue. |
#2108 certainly looks like something that will have some influence on this, yeah ;-) |
I have made the updates to #2108. We can't be sure that this will fix your issue, but I'm hoping so. It should now be a lot harder for the protocol to get into a bad state. |
Please provide the following information when submitting an issue (feature requests or general comments can skip this):
pgBackRest version:
2.49 and also former releases
PostgreSQL version:
13.x
Operating system/version - if you have more than one server (for example, a database server, a repository host server, one or more standbys), please specify each:
Ubuntu 20.04
Did you install pgBackRest from source or from a package?
Package (PGDG repo)
Please attach the following as applicable:
pgbackrest.conf
file(s)postgresql.conf
settings applicable to pgBackRest (archive_command
,archive_mode
,listen_addresses
,max_wal_senders
,wal_level
,port
)/var/log/pgbackrest
for the commands run (e.g./var/log/pgbackrest/mystanza_backup.log
)Not allowed to fully disclose, but the essence is:
Repo host uses an internal S3 service.
My client sees a lot of segfaults (between 0 and 30+ per day per cluster) on the DB servers:
with corresponding syslog entries:
(occasional signal 6 ones too)
Apart from it being annoying and spamming the monitoring, this is not really critical, but alas! it feels wrong... ;-)
I've changed the
archive_command
on one cluster to includecatchsegv
, which may hopefully shed some light on the root cause?The text was updated successfully, but these errors were encountered: