Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

postgres process killed with killed -9 command (look like it is bug) #396

Open
jatindersandhu opened this issue Feb 4, 2021 · 12 comments
Open

Comments

@jatindersandhu
Copy link

we are running postgresql 12.1 inside the container (we use this image https://catalog.redhat.com/software/containers/rhel8/postgresql-12/5db133bd5a13461646df330b). We are facing strange situation in which some postgre process killed with kill -9 command. I enable the debug in postgresql. First I thought it is OOM kill then I raise the resource in container. It is not anything related to resource as untilization of memory and CPU is way under the allocated resource (like cpu are 8 and memory is 100GB allocated and used only cpu 1 and 37.8GB memory). This issue randomly appear (some time DB is not doing anything). One more strange thing process ID which killed by Kill -9 , it never appear before in logs like if process never is forked. Due to postgres run under container (Open shift envirnoment) we can not get kernal dmesg.

Here is snippet of log and full log also attached

:28:06 EST,session-id=600ccd26.50f5cLOCATION: SharedInvalBackendInit, sinvaladt.c:324
2021-01-23 20:28:07.648 EST [46]=[33197-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:08 EST,session-id=6009272c.2eDEBUG: 00000: recycled write-ahead log file "000000010000014900000021"
2021-01-23 20:28:07.648 EST [46]=[33198-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:08 EST,session-id=6009272c.2eLOCATION: RemoveXlogFile, xlog.c:4106
2021-01-23 20:28:07.805 EST [331612]=[13-1]user=postgres,db=postgres,app=[unknown],client=[local],session-start=2021-01-23 20:28:06 EST,session-id=600ccd26.50f5cDEBUG: 00000: StartTransaction(1) name: unnamed; blockState: DEFAULT; state: INPROGRESS, xid/subid/cid: 0/1/0
2021-01-23 20:28:07.805 EST [331612]=[14-1]user=postgres,db=postgres,app=[unknown],client=[local],session-start=2021-01-23 20:28:06 EST,session-id=600ccd26.50f5cLOCATION: ShowTransactionStateRec, xact.c:5327
2021-01-23 20:28:08.142 EST [1]=[292190-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:07 EST,session-id=6009272b.1DEBUG: 00000: reaping dead processes
2021-01-23 20:28:08.142 EST [1]=[292191-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:07 EST,session-id=6009272b.1LOCATION: reaper, postmaster.c:2888
2021-01-23 20:28:08.142 EST [1]=[292192-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:07 EST,session-id=6009272b.1DEBUG: 00000: server process (PID 331611) was terminated by signal 9: Killed
2021-01-23 20:28:08.142 EST [1]=[292193-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:07 EST,session-id=6009272b.1LOCATION: LogChildExit, postmaster.c:3686
2021-01-23 20:28:08.142 EST [1]=[292194-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:07 EST,session-id=6009272b.1LOG: 00000: server process (PID 331611) was terminated by signal 9: Killed
2021-01-23 20:28:08.142 EST [1]=[292195-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:07 EST,session-id=6009272b.1LOCATION: LogChildExit, postmaster.c:3686
2021-01-23 20:28:08.142 EST [1]=[292196-1]user=,db=,app=,client=,session-start=2021-01-21 02:03:07 EST,session-id=6009272b.1LOG: 00000: terminating any other active server processes

@jatindersandhu
Copy link
Author

This is very urgent issue as we are testing new postgresql solution

@jpopelka
Copy link
Contributor

Hello @jatindersandhu
Have you figured out anything since reporting this? We're seeing this too after moving to a new Openshift cluster and I'm quite clueless.

@jatindersandhu
Copy link
Author

jatindersandhu commented Nov 24, 2021 via email

@phracek
Copy link
Member

phracek commented Nov 26, 2021

@jatindersandhu Thanks for posting this issue/bug. Can you help me, how to debug and/or simulate the error?
I am a bit familiar with OpenShift 4 :) Thank you for your help.

@jpopelka
Copy link
Contributor

jpopelka commented Nov 26, 2021

It's difficult to say what's causing this.
We're seeing this on ROSA, i.e. Openshift 4.8 on AWS.
With postgresql-12-centos7 and postgresql-13-centos7.

[1] LOG:  server process (PID 24643) was terminated by signal 9: Killed
[1] LOG:  terminating any other active server processes
[24622] WARNING:  terminating connection because of crash of another server process
[24622] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
[24622] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
[1] LOG:  all server processes terminated; reinitializing
[24645] LOG:  database system was interrupted; last known up at 2021-11-26 16:51:16 UTC
[24645] LOG:  database system was not properly shut down; automatic recovery in progress
[24645] LOG:  invalid record length at 0/1DEE328: wanted 24, got 0
[24645] LOG:  redo is not required
[1] LOG:  database system is ready to accept connections

It doesn't matter whether we started with an empty database or imported pg_dumpall-ed data with psql.
Giving the pod a lot of resources (gigs of mem, units of CPU) seems to help initially, but the killer strikes back after a while.
One thing that helps with our staging database, where there isn't much data, is to run watch pg_isready in a terminal of the pod.
I know it's ridiculous and can't be possibly related, but that really scares the s*** out of the killer and it disappears.
In our production database, where there's more data, this however didn't help when I was trying.
Then when I created a new pod with postgres-13 (the previous one was postgres-12, so I wanted to try v13 as well) and imported the production data, it looked like it's working without any problems. But then, when I tried postgres-13 on staging as well, it showed up again.
So as you can see, this is quite indeterminate and there's no reliable reproducer AFAIK.

@jatindersandhu
Copy link
Author

jatindersandhu commented Nov 26, 2021

Sorry for confusion relating this openshift and chruncy operator. It seems it is resource issue. Kill 9 can be issue by only kubernete as postgres does not use kill -9 to kill the process because it can restart the whole database. I think we need check if there is any liveness probe or readiness probe configured. Some time openshift has bug where probe timeout and kubernate kill the postgres. Solution is for this increase probe timeout like 5 min

@jpopelka
Copy link
Contributor

jpopelka commented Nov 26, 2021

Experimenting with the probes was the first thing I did and from what I remember I was seeing the problem even without any probe, so the probe shouldn't be a problem per se.

@TomasTomecek
Copy link
Contributor

I wonder if we should increase logging and analyze those logs, it could reveal how much memory postgres tries to acquire and why

Pity this one isn't merged yet: #366

@jpopelka
Copy link
Contributor

I wonder if we should increase logging

reporter of this issue did that, see the first comment

@praiskup
Copy link
Member

FTR, service file gets this:

# Disable OOM kill on the postmaster
OOMScoreAdjust=-1000

Is this kernel OOM killer enabled in the OpenShift environment?

@jpopelka
Copy link
Contributor

jpopelka commented Dec 1, 2021

Good point Pavel, thanks.

Managing Kernel Resources chapter says:

Another approach is to set the process-specific OOM score adjustment value for the postmaster process to -1000,
thereby guaranteeing it will not be targeted by the OOM killer.
The simplest way to do this is to execute
echo -1000 > /proc/self/oom_score_adj
in the postmaster's startup script just before invoking the postmaster.
Note that this action must be done as root, or it will have no effect; so a root-owned startup script is the easiest place to do it.
If you do this, you should also set these environment variables in the startup script before invoking the postmaster:
export PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj
export PG_OOM_ADJUST_VALUE=0
These settings will cause postmaster child processes to run with the normal OOM score adjustment of zero,
so that the OOM killer can still target them at need.

and yes, the postgresql.service follows that with

# Disable OOM kill on the postmaster
OOMScoreAdjust=-1000
# ... but allow it still to be effective for child processes
# (note that these settings are ignored by Postgres releases before 9.5)
Environment=PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj
Environment=PG_OOM_ADJUST_VALUE=0

In K8s/Openshift the OOM score adjustment can't be turned off completely (i.e. set to -1000), but can be set to a minimum if the pod has Guaranteed QoS, i.e. when memory limit and memory request sizes are set the same.

The PG_OOM_ADJUST_FILE=/proc/self/oom_score_adj is also very important, otherwise, the child processes will run with the same OOM score adjustment as the postmaster, which is unwise since the whole point is to ensure that the postmaster has a preferential setting.

I tried that, gave the pod 2Gi mem requested&limit, even set the PG_OOM_ADJUST_VALUE=900 (default is 0), and still, sooner or later I can see
[1] LOG: server process (PID 114118) was terminated by signal 9: Killed in the log.
Which, as I understand it, says, that it wasn't the main process (postmaster) that was killed, but some other/child process, but still, all connections were terminated.

Also, when I gave the pod a really small amount of memory, like (requests&limits) 60Mi, because I explicitly wanted to see the OOMkiller in action, it was running just fine for quite some time before the killer showed up.

I more and more tend to think that it's the memory overcommit which Kubernetes actively sets to vm.overcommit_memory=1 (always overcommit) but PostgreSQL community recommends it to be set to vm.overcommit_memory=2 (don't overcommit).

Also this article summarizes it very well.

The result is that I still don't have any clue how to make it work reliably :(

@jpopelka
Copy link
Contributor

Experimenting with the probes was the first thing I did and from what I remember I was seeing the problem even without any probe, so the probe shouldn't be a problem per se.

Looks like it really was a readiness probe (related) issue after all. After I, once more, tried to remove the probe, the problem has disappeared for good. I have no idea why it hadn't worked the first time I tried it. Maybe, it's the combination of all the changes I did, which is:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants