Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polardb shared storage file-dio:///var/polardb/shared_datadir is unavailable #503

Open
qwe123520 opened this issue Apr 26, 2024 · 26 comments
Labels
question Further information is requested

Comments

@qwe123520
Copy link

Describe the problem

docker单节点启动polardb-pg修改配置文件起不来报错polardb shared storage file-dio:///var/polardb/shared_datadir is unavailable
配置文件如下:

postgresql.txt

...

@qwe123520 qwe123520 added the question Further information is requested label Apr 26, 2024
@polardb-bot
Copy link
Contributor

polardb-bot bot commented Apr 26, 2024

Hi @qwe123520 ~ Thanks for opening this issue! 🎉

Please make sure you have provided enough information for subsequent discussion.

We will get back to you as soon as possible. ❤️

@qwe123520
Copy link
Author

错误日志如下:
2024-04-26 15:31:50.066 CST [14] [14] LOG: forked new process, pid is 16, true pid is 16
2024-04-26 15:31:50.066 CST [14] [14] LOG: forked new process, pid is 17, true pid is 17
2024-04-26 15:31:50.078 CST [14] [14] LOG: polardb try start vfs process
2024-04-26 15:31:50.078 CST [14] [14] LOG: pfs in localfs mode
2024-04-26 15:31:50.081 CST [14] [14] FATAL: polardb shared storage file-dio:///var/polardb/shared_datadir is unavailable.
2024-04-26 15:31:50.081 CST [14] [14] BACKTRACE:
/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin/postgres(elog_finish+0x1fd) [0x555e31bde55d]
/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin/postgres(+0x7db1ae) [0x555e31a4d1ae]
/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin/postgres(PostmasterMain+0xf53) [0x555e319dbf63]
/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin/postgres(main+0x830) [0x555e316bacf0]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7f6ace30cd90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7f6ace30ce40]
/home/postgres/tmp_basedir_polardb_pg_1100_bld/bin/postgres(_start+0x25) [0x555e316ca6d5]
2024-04-26 15:31:50.202 CST [14] [14] LOG: database system is shut down

@mrdrivingduck
Copy link
Member

@qwe123520 What is your docker startup command?

@qwe123520
Copy link
Author

使用的这个镜像”polardb/polardb_pg_local_instance“,没有配置额外的启动命令。

@mrdrivingduck
Copy link
Member

@qwe123520 跟镜像没有关系,跟从镜像上启动容器的方式有关系。所以我在询问启动容器的命令是什么?用下面的命令启动容器呢?

docker pull polardb/polardb_pg_local_instance
docker run -it --rm polardb/polardb_pg_local_instance psql

@qwe123520
Copy link
Author

docker run -d --name polardb -v /data/polardb/:/var/polardb/ polardb/polardb_pg_local_instance使用的这个命令启动的。

@qwe123520
Copy link
Author

docker run -it --rm polardb/polardb_pg_local_instance psql我只要-v使用本机目录就不行

@mrdrivingduck
Copy link
Member

docker run -d --name polardb -v /data/polardb/:/var/polardb/ polardb/polardb_pg_local_instance使用的这个命令启动的。

本机目录上 /data/polardb/ 这个目录存在且非空吗?

@qwe123520
Copy link
Author

是的,它存在并且非空

@mrdrivingduck
Copy link
Member

是的,它存在并且非空

需要用一个存在且空白的目录来启动容器,这样容器启动脚本发现目录为空就会在这个目录中 initdb 创建数据目录;如果启动脚本发现目录不为空,就会按启动脚本中指定好的数据目录拉起数据库,如果目录中已有内容是一些别的文件就有问题。

@qwe123520
Copy link
Author

这个目录是之前启动的时候创建出来的,然后修改了postgres.conf然后就起不来了

@SamirWell
Copy link

@mrdrivingduck 快来回答问题啦

@SamirWell
Copy link

就是修改里面postgres.conf之后才会出现这样的问题 就不知道和shared_datadir 有啥关系 快出来解决问题啦~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

快快快

@SamirWell
Copy link

还有就是恢复之前的conf内容 都不行 就改不得

@mrdrivingduck
Copy link
Member

mrdrivingduck commented May 22, 2024

@qwe123520 @SamirWell

  1. 具体修改了什么内容?可否提供下 diff?
  2. 根据启动命令,/data/polardb/ 下应该会有 primary_dir/ 之类的几个目录。可以看下每个目录中的 current_logfiles 找到错误日志名称,看看最后的错误日志内容是什么

@SamirWell
Copy link

2024-05-22 17:56:18.943 CST [20] [20] LOG: vfs_unlink file-dio:///var/polardb/shared_datadir/polar_flog/flashback_log.history.tmp
2024-05-22 17:56:18.944 CST [20] [20] LOG: vfs_rename from file-dio:///var/polardb/shared_datadir/polar_flog/flashback_log.history.tmp to file-dio:///var/polardb/shared_datadir/polar_flog/flashback_log.history
2024-05-22 17:56:18.944 CST [20] [20] LOG: The flashback log will switch from 0/877E0 to 0/10000000
2024-05-22 17:56:18.944 CST [20] [20] LOG: The flashback log shared buffer is ready now, the current point(position) is 0/10000000(0/FF3FFF0), previous point(position) is 0/0(0/0), initalized upto point is 0/10000000
2024-05-22 17:56:18.945 CST [20] [20] LOG: enable persisted slot, read slot from polarstore.
2024-05-22 17:56:18.945 CST [20] [20] LOG: vfs open dir pg_replslot, num open dir 1
2024-05-22 17:56:18.945 CST [20] [20] LOG: vfs open dir file-dio:///var/polardb/shared_datadir/pg_replslot, num open dir 1
2024-05-22 17:56:18.945 CST [20] [20] LOG: vfs_unlink file-dio:///var/polardb/shared_datadir/pg_replslot/replica1/state.tmp
2024-05-22 17:56:18.946 CST [20] [20] LOG: restore slot replica1 with version 10002, replay_lsn is 0/1BA24B8, restart_lsn is 0/1752788
2024-05-22 17:56:18.946 CST [20] [20] LOG: vfs_unlink file-dio:///var/polardb/shared_datadir/pg_replslot/replica2/state.tmp
2024-05-22 17:56:18.946 CST [20] [20] LOG: restore slot replica2 with version 10002, replay_lsn is 0/1BA24B8, restart_lsn is 0/1752788
2024-05-22 17:56:18.946 CST [20] [20] LOG: vfs open dir pg_replslot, num open dir 1
2024-05-22 17:56:18.946 CST [20] [20] LOG: vfs open dir file-dio:///var/polardb/shared_datadir/pg_twophase, num open dir 1
2024-05-22 17:56:18.946 CST [20] [20] LOG: database system was not properly shut down; automatic recovery in progress
2024-05-22 17:56:18.946 CST [20] [20] LOG: state is 4
2024-05-22 17:56:18.965 CST [19] [19] LOG: polar_flog_index log index is insert from 28
2024-05-22 17:56:19.023 CST [19] [19] WARNING: The flashback log record at 0/895F0 will be ignore. and switch to 0/10000028
2024-05-22 17:56:19.023 CST [19] [19] LOG: Recover the flashback logindex to 0/10000000
2024-05-22 17:56:19.362 CST [21] [21] PANIC: polardb shared storage is unavailable.
2024-05-22 17:56:19.362 CST [21] [21] BACKTRACE:
postgres(5432): polar worker process (+0x3fdc5e) [0x560ccc2d4c5e]
/home/postgres/tmp_basedir_polardb_pg_1100_bld/lib/polar_worker.so(polar_worker_handler_main+0xd6) [0x7fdf24745ff6]
postgres(5432): polar worker process (StartBackgroundWorker+0x2d7) [0x560ccc629517]
postgres(5432): polar worker process (+0x76441c) [0x560ccc63b41c]
postgres(5432): polar worker process (+0x765dbe) [0x560ccc63cdbe]
postgres(5432): polar worker process (PostmasterMain+0xd4c) [0x560ccc640d5c]
postgres(5432): polar worker process (main+0x830) [0x560ccc31fcf0]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fdf231fed90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fdf231fee40]
postgres(5432): polar worker process (_start+0x25) [0x560ccc32f6d5]

@SamirWell
Copy link

修改了所有目录内的conf里面 max_connections = 2000

@SamirWell
Copy link

其实我刚才的做法是 先 docker 初始化了数据库 没有启动
修改了所有里面的配置最大连接数为2000

然后启动docker 是ok的,

我再次重启一下容器 就不行了额, 应该是另有原因,看着像是重新挂载方面的问题

inline int
polar_mount(void)
{
	int ret = 0;
	if (polar_vfs[polar_vfs_switch].vfs_mount)
		ret = polar_vfs[polar_vfs_switch].vfs_mount();
	if (polar_enable_io_fencing && ret == 0)
	{
		/* POLAR: FATAL when shared storage is unavailable, or force to write RWID. */
		if (polar_shared_storage_is_available())
		{
			polar_hold_shared_storage(false);
			POLAR_IO_FENCING_SET_STATE(polar_io_fencing_get_instance(), POLAR_IO_FENCING_WAIT);
		}
		else
			elog(FATAL, "polardb shared storage %s is unavailable.", polar_datadir);
	}
	return ret;
}

inline int
polar_remount(void)
{
	int ret = 0;
	if (polar_vfs[polar_vfs_switch].vfs_remount)
		ret = polar_vfs[polar_vfs_switch].vfs_remount();
	if (polar_enable_io_fencing && ret == 0)
	{
		/* POLAR: FATAL when shared storage is unavailable, or force to write RWID. */
		if (polar_shared_storage_is_available())
		{
			polar_hold_shared_storage(true);
			POLAR_IO_FENCING_SET_STATE(polar_io_fencing_get_instance(), POLAR_IO_FENCING_WAIT);
		}
		else
			elog(FATAL, "polardb shared storage %s is unavailable.", polar_datadir);
	}
	return ret;
}

@SamirWell
Copy link

@mrdrivingduck 要不你测试下场景

@mrdrivingduck
Copy link
Member

我测试了如下场景,没有发现问题:

$ mkdir polardb_pg
$ docker run -it --rm \
    --env POLARDB_PORT=5432 \
    --env POLARDB_USER=u1 \
    --env POLARDB_PASSWORD=your_password \
    -v ./polardb_pg:/var/polardb \
    polardb/polardb_pg_local_instance \
    echo 'done'

## edit max_connections in three postgresql.conf files

$ docker run -d \
    -p 54320-54322:5432-5434 \
    -v ./polardb_pg:/var/polardb \ 
    polardb/polardb_pg_local_instance

36c196cd8cb3e7b3dfcd2b9268409377462ee42caf95289080ce20f17ab45f61

$ docker exec -it 36c196cd8cb3e7b3dfcd2b9268409377462ee42caf95289080ce20f17ab45f61 bash
$ ps -ef
$ exit

$ docker stop 36c196cd8cb3e7b3dfcd2b9268409377462ee42caf95289080ce20f17ab45f61            
36c196cd8cb3e7b3dfcd2b9268409377462ee42caf95289080ce20f17ab45f61

$ docker run -d \                                                                      
    -p 54320-54322:5432-5434 \
    -v ./polardb_pg:/var/polardb \
    polardb/polardb_pg_local_instance

cdbffcd6b3e6e2f55ac98ee61bfd48ac185db624f5142f3dfc7a0f920ac7a154

$ docker exec -it cdbffcd6b3e6e2f55ac98ee61bfd48ac185db624f5142f3dfc7a0f920ac7a154 bash
$ ps -ef

@SamirWell
Copy link

可能是我在k3s上面部署的原因吗?

@mrdrivingduck
Copy link
Member

可能是我在k3s上面部署的原因吗?

需要看下在容器内能否正确访问 /var/polardb/shared_datadir,以及里面的文件是否符合预期。另外确保 volume 没有被多个容器挂载。

@SamirWell
Copy link

SamirWell commented May 22, 2024

可能是我在k3s上面部署的原因吗?

需要看下在容器内能否正确访问 /var/polardb/shared_datadir,以及里面的文件是否符合预期。另外确保 volume 没有被多个容器挂载。

如果是k3s或者k8s这种滚动升级,存在同时挂载的时间窗, 就会挂掉是不~

刚才又重新测试下这种 延迟重启的场景 还是挂的 o(╥﹏╥)o

@mrdrivingduck
Copy link
Member

可能是我在k3s上面部署的原因吗?

需要看下在容器内能否正确访问 /var/polardb/shared_datadir,以及里面的文件是否符合预期。另外确保 volume 没有被多个容器挂载。

如果是k3s或者k8s这种滚动升级,存在同时挂载的时间窗, 就会挂掉是不~

polardb_pg_local_instance 这个镜像是一个在单机运行共享存储集群的 demo,里面有个简单的 entrypoint 脚本来做管理,目的是方便快速拉起并体验。如果有外部的集群管理和存储管理,那么会和这里面运行的 entrypoint 脚本冲突。建议直接使用纯二进制镜像 polardb/polardb_pg_binary 来适配集群管理工具,这里面是没有管理脚本的。

@SamirWell
Copy link

最后测试重启前执行

rm -f $shared_datadir/DEATH

就好了,这样就适合在k8s/k3s上单节点部署使用了吧

@mrdrivingduck
Copy link
Member

最后测试重启前执行

rm -f $shared_datadir/DEATH

就好了,这样就适合在k8s/k3s上单节点部署使用了吧

产生这个文件说明至少有两个数据库实例在同一份数据目录上启动了。这样是有问题的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants