Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket creation problem with Upgrading database (by switching to newer PostgreSQL image version) #362

Open
kevindafm opened this issue Mar 3, 2020 · 16 comments
Labels
documentation resolution of this issue requires fixing documentation easyfix

Comments

@kevindafm
Copy link

Hello. This process "Upgrading database (by switching to newer PostgreSQL image version)" looks awesome and is a really cool idea. See this readme -> https://github.com/sclorg/postgresql-container/tree/generated/9.6

Unfortunately it does not work fully for me. It would appear there is a problem creating the socket.

The logs files from the upgrade are below.

sh-4.2$ more pg_upgrade_internal.log


pg_upgrade run on Mon Mar 2 15:19:46 2020

Performing Consistency Checks

Checking cluster versions ok

failure
Consult the last few lines of "pg_upgrade_server.log" for
the probable cause of the failure.

connection to database failed: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/lib/pgsql/data/.s.PGSQL.50432"?

could not connect to old postmaster started with the command:
"/opt/rh/rh-postgresql95/root/usr/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/pgsql/data/userdata" -o "-p 50432 -b -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/var/lib/pgsql/data'" st
art


pg_upgrade run on Mon Mar 2 15:20:58 2020

Performing Consistency Checks

Checking cluster versions ok

failure
Consult the last few lines of "pg_upgrade_server.log" for
the probable cause of the failure.

connection to database failed: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/lib/pgsql/data/.s.PGSQL.50432"?

could not connect to old postmaster started with the command:
"/opt/rh/rh-postgresql95/root/usr/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/pgsql/data/userdata" -o "-p 50432 -b -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/var/lib/pgsql/data'" st
art


pg_upgrade run on Mon Mar 2 15:22:17 2020

Performing Consistency Checks

Checking cluster versions ok

failure
Consult the last few lines of "pg_upgrade_server.log" for
the probable cause of the failure.

connection to database failed: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/lib/pgsql/data/.s.PGSQL.50432"?

could not connect to old postmaster started with the command:
"/opt/rh/rh-postgresql95/root/usr/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/pgsql/data/userdata" -o "-p 50432 -b -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/var/lib/pgsql/data'" st
art


pg_upgrade run on Mon Mar 2 15:24:13 2020

Performing Consistency Checks

Checking cluster versions ok

failure
Consult the last few lines of "pg_upgrade_server.log" for
the probable cause of the failure.

connection to database failed: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/lib/pgsql/data/.s.PGSQL.50432"?

could not connect to old postmaster started with the command:
"/opt/rh/rh-postgresql95/root/usr/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/pgsql/data/userdata" -o "-p 50432 -b -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/var/lib/pgsql/data'" st
art


pg_upgrade run on Mon Mar 2 15:28:06 2020

Performing Consistency Checks

Checking cluster versions ok

failure
Consult the last few lines of "pg_upgrade_server.log" for
the probable cause of the failure.

connection to database failed: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/lib/pgsql/data/.s.PGSQL.50432"?

could not connect to old postmaster started with the command:
"/opt/rh/rh-postgresql95/root/usr/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/pgsql/data/userdata" -o "-p 50432 -b -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/var/lib/pgsql/data'" st
art


pg_upgrade run on Mon Mar 2 15:29:18 2020

Performing Consistency Checks

Checking cluster versions ok

failure
Consult the last few lines of "pg_upgrade_server.log" for
the probable cause of the failure.

connection to database failed: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/lib/pgsql/data/.s.PGSQL.50432"?

could not connect to old postmaster started with the command:
"/opt/rh/rh-postgresql95/root/usr/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/pgsql/data/userdata" -o "-p 50432 -b -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/var/lib/pgsql/data'" st
art


pg_upgrade run on Mon Mar 2 15:30:56 2020

Performing Consistency Checks

Checking cluster versions ok

failure
Consult the last few lines of "pg_upgrade_server.log" for
the probable cause of the failure.

connection to database failed: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/lib/pgsql/data/.s.PGSQL.50432"?

could not connect to old postmaster started with the command:
"/opt/rh/rh-postgresql95/root/usr/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/pgsql/data/userdata" -o "-p 50432 -b -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/var/lib/pgsql/data'" st
art


pg_upgrade run on Mon Mar 2 15:32:49 2020

Performing Consistency Checks

Checking cluster versions ok

failure
Consult the last few lines of "pg_upgrade_server.log" for
the probable cause of the failure.

connection to database failed: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/lib/pgsql/data/.s.PGSQL.50432"?

could not connect to old postmaster started with the command:
"/opt/rh/rh-postgresql95/root/usr/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/pgsql/data/userdata" -o "-p 50432 -b -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/var/lib/pgsql/data'" st
art


pg_upgrade run on Mon Mar 2 15:34:46 2020

Performing Consistency Checks

Checking cluster versions ok

failure
Consult the last few lines of "pg_upgrade_server.log" for
the probable cause of the failure.

connection to database failed: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/lib/pgsql/data/.s.PGSQL.50432"?

could not connect to old postmaster started with the command:
"/opt/rh/rh-postgresql95/root/usr/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/pgsql/data/userdata" -o "-p 50432 -b -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/var/lib/pgsql/data'" st
art


pg_upgrade run on Mon Mar 2 15:36:35 2020

Performing Consistency Checks

Checking cluster versions ok

failure
Consult the last few lines of "pg_upgrade_server.log" for
the probable cause of the failure.

connection to database failed: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/lib/pgsql/data/.s.PGSQL.50432"?

could not connect to old postmaster started with the command:
"/opt/rh/rh-postgresql95/root/usr/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/pgsql/data/userdata" -o "-p 50432 -b -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/var/lib/pgsql/data'" st
art


pg_upgrade run on Mon Mar 2 15:39:46 2020

Performing Consistency Checks

Checking cluster versions ok

failure
Consult the last few lines of "pg_upgrade_server.log" for
the probable cause of the failure.

connection to database failed: could not connect to server: No such file or directory
Is the server running locally and accepting
connections on Unix domain socket "/var/lib/pgsql/data/.s.PGSQL.50432"?

could not connect to old postmaster started with the command:
"/opt/rh/rh-postgresql95/root/usr/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/pgsql/data/userdata" -o "-p 50432 -b -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/var/lib/pgsql/data'" st
art

I wonder is this a permissions issue or is postgresql already started. It is worth noting that normally the socket gets created in /tmp but during the upgrade process it attempts to create the socket in /var/lib/pgsql/data/.s.PGSQL.50432. I am not sure which user attempts to create the socket during the upgrade process.

Is it the uid below
sh-4.2$ id
uid=1001440000 gid=0(root) groups=0(root),2044,1001440000
sh-4.2$

Or the postgres user?

sh-4.2$ more /etc/passwd
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
nobody:x:99:99:Nobody:/:/sbin/nologin
systemd-network:x:192:192:systemd Network Management:/:/sbin/nologin
dbus:x:81:81:System message bus:/:/sbin/nologin
default:x:1001:0:Default Application User:/opt/app-root/src:/sbin/nologin
postgres:x:26:26:PostgreSQL Server:/var/lib/pgsql:/bin/bash

sh-4.2$ ls -a /tmp
. .. .ICE-unix .Test-unix .X11-unix .XIM-unix .font-unix .s.PGSQL.5432 .s.PGSQL.5432.lock ks-script-OQQXsO yum.log
sh-4.2$

I have also raised a redhat support ticket for this issue (02599066). Any information/discussion much appreciated.

@pkubatrh
Copy link
Member

pkubatrh commented Mar 4, 2020

Hi @kevindafm , thanks for getting in touch!

I will look into this. I tried reproduce the issue you are having just via docker, but I am unable to do so - the upgrade always succeeds for me. Will try out on an Openshift instance as well, to see what might be different.

For your question about what user runs the upgrade - the upgrade is run by the user running the image, which should be in your case:
uid=1001440000 gid=0(root) groups=0(root),2044,1001440000

Note that this is also the postgres user, since we are using nss_wrapper internally to map the random uids Openshift generates for the runtime to postgres username (postgres datadir should hold a passwd file if you check it).

@kevindafm
Copy link
Author

kevindafm commented Mar 4, 2020

Hi @pkubatrh

Thanks for looking into this. You are correct about the uids. I missed the nss_wrapper piece but I can see the entry in /var/lib/pgsql/passwd for postgres maps the uid to 1001440000

sh-4.2$ more /var/lib/pgsql/passwd
root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin
sync:x:5:0:sync:/sbin:/bin/sync
shutdown:x:6:0:shutdown:/sbin:/sbin/shutdown
halt:x:7:0:halt:/sbin:/sbin/halt
mail:x:8:12:mail:/var/spool/mail:/sbin/nologin
operator:x:11:0:operator:/root:/sbin/nologin
games:x:12:100:games:/usr/games:/sbin/nologin
ftp:x:14:50:FTP User:/var/ftp:/sbin/nologin
nobody:x:99:99:Nobody:/:/sbin/nologin
systemd-network:x:192:192:systemd Network Management:/:/sbin/nologin
dbus:x:81:81:System message bus:/:/sbin/nologin
default:x:1001:0:Default Application User:/opt/app-root/src:/sbin/nologin
postgres:x:1001440000:0:PostgreSQL Server:/var/lib/pgsql:/bin/bash

One difference between direct docker and OpenShift is that Openshift generally reduces the permissions a bit more. This may also happen in Docker but not always.

From my postgresql pod yaml I see the following.

securityContext:
capabilities:
drop:
- KILL
- MKNOD
- SETGID
- SETUID
privileged: false
runAsUser: 1001440000

Also this piece

securityContext:
fsGroup: 1001440000
seLinuxOptions:
level: 's0:c38,c17'

I am not sure a direct run in docker is as restrictive?

Thanks again for looking into it.

Kevin.

@pkubatrh
Copy link
Member

pkubatrh commented Mar 9, 2020

Hi @kevindafm

Tried out the upgrade on Openshift as well and was not able to reproduce with a simple test deployment. Took the following steps:

  1. Created a postgresql depolyment via the persistent template from this repo:
$ oc process -f examples/postgresql-persistent-template.json -p POSTGRESQL_VERSION=9.5 | oc create -f -
  1. Waited until the pod is created and running, oc rsh-ed into it and filled the db with some testing data
postgres=# select * from testing;
 i 
---
 1
 2
 3
(3 rows)
  1. Patched the deploymentConfig to move to postgresql:9.6 and added the POSTGRESQL_UPGRADE=copy env definition.
$ oc set triggers dc/postgresql --from-image="openshift/postgresql:9.6" -c postgresql
$ oc set triggers dc/postgresql --from-image="openshift/postgresql:9.5" -c postgresql --remove=true
$ oc set env dc/postgresql POSTGRESQL_UPGRADE=copy
  1. Waited for the redeploy to finish and checked the data in the upgraded pod
$ oc exec postgresql-3-75fhs -- bash -c "psql --version"
psql (PostgreSQL) 9.6.10
$ oc exec postgresql-3-75fhs -- bash -c "psql -c 'select * from testing;'"
 i 
---
 1
 2
 3
(3 rows)

Any ideas how to reproduce are appreciated.

@pkubatrh
Copy link
Member

Remembered today that we had a similar issue on Guster-backed PVC with regards to sockets and upgrades, could be related: https://bugzilla.redhat.com/show_bug.cgi?id=1696732

@kevindafm
Copy link
Author

kevindafm commented Mar 10, 2020

Hi Petr,

Thanks for response and further procedures. I think in my original attempt I had some mistakes for example I did not have the remove of 9.5 image.

I tred your steps above but I get the errors below.

usr/share/container-scripts/postgresql/common.sh: line 121: [: 9.6: integer expression expected

========== $PGDATA upgrade: 9.5 -> 9.6 ==========

===> Initialize new data directory; we will migrate to that.

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.

The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".

Data page checksums are disabled.

creating directory /var/lib/pgsql/data/userdata-new ... ok
creating subdirectories ... ok
selecting default max_connections ... 100
selecting default shared_buffers ... 128MB
selecting dynamic shared memory implementation ... posix
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
syncing data to disk ... ok

Success. You can now start the database server using:

/opt/rh/rh-postgresql96/root/usr/bin/pg_ctl -D /var/lib/pgsql/data/userdata-new -l logfile start

WARNING: enabling "trust" authentication for local connections
You can change this by editing pg_hba.conf or using the option -A, or
--auth-local and --auth-host, the next time you run initdb.

===> Starting the pg_upgrade process.

failure
Consult the last few lines of "pg_upgrade_server.log" for
the probable cause of the failure.

There seems to be a postmaster servicing the old cluster.
Please shutdown that postmaster and try again.
Failure, exiting


pg_upgrade run on Tue Mar 10 14:46:57 2020

command: "/opt/rh/rh-postgresql95/root/usr/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/pgsql/data/userdata" -o "-p 50432 -c autovacuum=off -c autovacuum_freeze_max_age=2000000000 -c listen_addresses='' -c unix_socket_permissions=0700" start >> "pg_upgrade_server.log" 2>&1
pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....FATAL: lock file "postmaster.pid" already exists
HINT: Is another postmaster (PID 1) running in data directory "/var/lib/pgsql/data/userdata"?
stopped waiting
pg_ctl: could not start server
Examine the log output.

Just to say I am following your procedure above when I get the errors above.

I wonder what else I can try?

Best Regards,

Kevin.

@kevindafm
Copy link
Author

Hi @pkubatrh,

I have proved this is a gluster file versus block problem. If I run the upgrade process on a postgress database with gluster block volumes it works no problem. If I run it on gluster file volumes it does not work.

Unfortunately the default storage class is gluster file in the postgress deployment config / template. Hence I have a number of postgres databases running on gluster file.

So I wonder can this be fixed or at least investigated here -> https://bugzilla.redhat.com/show_bug.cgi?id=1696732

Best Regards,

Kevin.

@pkubatrh
Copy link
Member

Hi @kevindafm,

thanks for providing more info! If this is tied to using gluster file I am afraid it might be the same case as bug 1696732. I do not think we ever figured out what is wrong with gluster itself but we managed to figure out a workaround for the socket permissions issue - you just need to set POSTGRESQL_UPGRADE_PGUPGRADE_OPTIONS environment variable to use a different socketdir (eg. /tmp) for the upgrade step.

Unfortunately, there is a bug in postgresql itself that causes the user-provided arguments to be overwritten by hard-coded ones (see upstream discussion). This is something that needs to be fixed in the rpms, so please get in touch with your support representative and tell them to create a bug against the Red Hat Software Collections product, postgresql component, rh-postgresql96 version.

@UncleAlbie Since it has been mostly you working on the bug above, is there anything else important I missed?

@hhorak
Copy link
Member

hhorak commented Mar 12, 2020

@pkubatrh Just to make sure I understand the potential solution -- would that problem be fixed if we backported the same patch as we're considering in bug 1696732?

@kevindafm
Copy link
Author

Hi @pkubatrh ,

Thanks for coming back. Can you provide an example of POSTGRESQL_UPGRADE_PGUPGRADE_OPTIONS?

I tried setting to just /tmp. I also tried setting to unix_socket_directories='/tmp'. I get the same error as before.

Do I need to set all options or just the /tmp?

I see this in the logs of the postgres upgrade process.

command: "/opt/rh/rh-postgresql95/root/usr/bin/pg_ctl" -w -l "pg_upgrade_server.log" -D "/var/lib/pgsql/data/userdata" -o "-p 50432 -b -c listen_addresses='' -c unix_socket_permissions=0700 -c unix_socket_directories='/var/lib/pgsql/dat

Does that help me know what to set the POSTGRESQL_UPGRADE_PGUPGRADE_OPTIONS to?

Any help much appreciated.

Best Regards,

Kevin.

@panovotn
Copy link

@pkubatrh I think you've summarised it pretty well. I don't think I have anything to add here.

@kevindafm As described by pkubatrh, there's a bug in postgresql. The unix_socket_directories option gets overridden by the value hardcoded into the pg_upgrade tool. See the references in pkubatrh's answer.

@kevindafm
Copy link
Author

@UncleAlbie for information and confirmation.

I have support call 02599066 raised about this issue also. Does combination of discussion here and the support call mean the hardcoded values in pg_upgrade tool will get fixed?

There are 2 options here.

  1. Fix pg_upgrade tool over riding hardcoded values.
  2. Fix the original gluster file / postgres writing socket problem.

I think the pg_upgrade tool fix will be easiest to fix but I don't know for sure. I also thin having something like POSTGRESQL_UPGRADE_PGUPGRADE_OPTIONS is useful for future upgrades etc.

Best Regards,

Kevin.

@hhorak
Copy link
Member

hhorak commented Mar 12, 2020

I have support call 02599066 raised about this issue also. Does combination of discussion here and the support call mean the hardcoded values in pg_upgrade tool will get fixed?

@kevindafm It definitely helped us to find out where is the problem. That said, this github repo is not a support tool, so I'd like to ask you to continue discussing this issue with the Red Hat Support, because they need to drive the update itself. In any case, thank you for spending time on debugging this and helping us to find a way forward. This is really appreciated.

@kevindafm
Copy link
Author

No problem @hhorak I am keeping the support call updated with all the discussions here and other things.

Ultimately I have a bunch of postgres 9.5 (now out of support) with several CVEs I need to to upgrade to more recent postgres so I need a solution. postgres 9.5 is no longer available on the red hat container registry so nothing to fix the CVE gap! This means I must upgrade.

Thanks all on this discussion here. It has helped inform the support call and at least we all know the problems now.

@hhorak
Copy link
Member

hhorak commented Mar 13, 2020

While this issue needs to be solved either in rh-postgresql96 packages or gluster/OCP itself, I think what we can do in case of this container is documenting POSTGRESQL_UPGRADE_PGUPGRADE_OPTIONS and when it can be useful. I'd just do it after the bz#1696732 is fixed. Something like:

`POSTGRESQL_UPGRADE_PGUPGRADE_OPTIONS` -- sets additional arguments to the `pg_upgrade`
In some cases it might be useful to have a way to pass some additional arguments
to the `pg_upgrade` binary when running the upgrade within the container.
One example of such use case is work-arounding the bug in Gluster support in OpenShift
as we saw in [BZ#1696732](https://bugzilla.redhat.com/show_bug.cgi?id=1696732).
In that case, setting this variable to `/tmp` makes the trick.

How does that sound?

@kevindafm
Copy link
Author

That sounds great.

@pkubatrh
Copy link
Member

All releases of postgresql should have the socketdir option now. Hopefully all that is left is documenting POSTGRESQL_UPGRADE_PGUPGRADE_OPTIONS environment variable as Honza pointed out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation resolution of this issue requires fixing documentation easyfix
Projects
None yet
Development

No branches or pull requests

4 participants