Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transport endpoint is not connected #4330

Open
Franco-Sparrow opened this issue Apr 3, 2024 · 5 comments
Open

Transport endpoint is not connected #4330

Franco-Sparrow opened this issue Apr 3, 2024 · 5 comments

Comments

@Franco-Sparrow
Copy link

Franco-Sparrow commented Apr 3, 2024

Description of problem

This is a random problem related with gluster client disconnection and we cant reproduce it always, it occurs randomly (we guess this occur under heavy loads to the SDS). We have upgraded from gluster 8.4, passing through all versions of gluster 10.x and even with latest 10.5 we keep facing same problem. The mount point get a brief disconnection, and thats is fatal for an SDS providing service to VMs. This time the mount point automatically recovered itself, but that brief disconnection is enough to throw to I/O errors all VMs currently running in the node.

In this new version of gluster the problem was mitigated to only the affected volume. Before this, was required a reboot to the entire node, because affected all gluster mount points in the affected node. So, is the same base problem, but now different behavior. I know that Gluster Distributed Two ways Replicated is not the best solution, and with Replica 3 I might not face this problem on same way, because of the quorum and the protections against the node disconnections...but is there any way to fix this gluster client disconnection?

imagen

Expected results

Don't getting disconnection from the rest of the cluster

Mandatory info:**

The output of the gluster volume info command

Volume Name: vol2
Type: Distributed-Replicate
Volume ID: e1158040-4e60-4254-a281-e1125a27ba23
Status: Started
Snapshot Count: 0
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: SERVER-N1:/data/glusterfs/vol2/brick0-1/data
Brick2: SERVER-N2:/data/glusterfs/vol2/brick0/data
Brick3: SERVER-N1:/data/glusterfs/vol2/brick1-1/data
Brick4: SERVER-N3:/data/glusterfs/vol2/brick0/data
Brick5: SERVER-N2:/data/glusterfs/vol2/brick1/data
Brick6: SERVER-N3:/data/glusterfs/vol2/brick1/data
Options Reconfigured:
performance.client-io-threads: off
nfs.disable: on
transport.address-family: inet
storage.fips-mode-rchecksum: on
features.shard: enable
features.shard-block-size: 5GB
cluster.favorite-child-policy: mtime
user.cifs: off
performance.read-ahead: off
performance.quick-read: off
performance.io-cache: off
cluster.eager-lock: enable
network.remote-dio: enable
storage.owner-gid: 9869
storage.owner-uid: 9869

The output of the gluster volume status command

Status of volume: vol2
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick SERVER-N1:/data/glusterfs/vol2/bric
k0-1/data                                   51756     0          Y       5341
Brick SERVER-N2:/data/glusterfs/vol2/bric
k0/data                                     56493     0          Y       7444
Brick SERVER-N1:/data/glusterfs/vol2/bric
k1-1/data                                   60464     0          Y       5373
Brick SERVER-N3:/data/glusterfs/vol2/bric
k0/data                                     54439     0          Y       7897
Brick SERVER-N2:/data/glusterfs/vol2/bric
k1/data                                     54583     0          Y       7476
Brick SERVER-N3:/data/glusterfs/vol2/bric
k1/data                                     49841     0          Y       7929
Self-heal Daemon on localhost               N/A       N/A        Y       5405
Self-heal Daemon on SERVER-N2             N/A       N/A        Y       7508
Self-heal Daemon on SERVER-N3             N/A       N/A        Y       7961

Task Status of Volume vol2
------------------------------------------------------------------------------
There are no active volume tasks

The output of the gluster volume heal command

gluster volume heal vol2 info
Brick SERVER-N1:/data/glusterfs/vol2/brick0-1/data
Status: Connected
Number of entries: 0

Brick SERVER-N2:/data/glusterfs/vol2/brick0/data
Status: Connected
Number of entries: 0

Brick SERVER-N1:/data/glusterfs/vol2/brick1-1/data
Status: Connected
Number of entries: 0

Brick SERVER-N3:/data/glusterfs/vol2/brick0/data
Status: Connected
Number of entries: 0

Brick SERVER-N2:/data/glusterfs/vol2/brick1/data
Status: Connected
Number of entries: 0

Brick SERVER-N3:/data/glusterfs/vol2/brick1/data
Status: Connected
Number of entries: 0

At the moment of writing this, there wasnt any entries on healing, but there was healing, reported by the monitoring system (Zabbix) and our custom checks for it:

imagen

Provide logs present on following locations of client and server nodes

No error on glusterd:

Is there any crash ? Provide the backtrace and coredump

My node4 is a gluster client and got disconnected from the cluster.

[2024-04-03 19:58:11.743421 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743441 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-4: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743457 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743530 +0000] E [rpc-clnt.c:172:call_bail] 0-vol2-client-4: bailing out frame type(GlusterFS 4.x v1), op(FGETXATTR(35)), xid = 0x8ac758e, unique = 445798451, sent = 2024-04-03 19:28:03.
739203 +0000, timeout = 1800 for 192.168.21.22:54583
[2024-04-03 19:58:11.743589 +0000] E [rpc-clnt.c:172:call_bail] 0-vol2-client-4: bailing out frame type(GlusterFS 4.x v1), op(FGETXATTR(35)), xid = 0x8ac758d, unique = 445798450, sent = 2024-04-03 19:28:03.
739161 +0000, timeout = 1800 for 192.168.21.22:54583
[2024-04-03 19:58:11.743613 +0000] E [rpc-clnt.c:172:call_bail] 0-vol2-client-4: bailing out frame type(GlusterFS 4.x v1), op(WRITE(13)), xid = 0x8ac758c, unique = 445798448, sent = 2024-04-03 19:28:03.7390
78 +0000, timeout = 1800 for 192.168.21.22:54583
[2024-04-03 19:58:11.743664 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-4: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743684 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743858 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.743888 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744034 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2024-04-03 19:28:12.301934 +0000 (xid=0x8ac759a)
[2024-04-03 19:58:11.744034 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(FSTAT(25)) called at 2024-04-03 19:28:03.738876 +0000 (xid=0x392c45a)
[2024-04-03 19:58:11.744155 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(STATFS(14)) called at 2024-04-03 19:28:15.383425 +0000 (xid=0x8ac759b)
[2024-04-03 19:58:11.744264 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744288 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(STATFS(14)) called at 2024-04-03 19:28:15.388829 +0000 (xid=0x8ac759c)
[2024-04-03 19:58:11.744288 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(READ(12)) called at 2024-04-03 19:28:03.741119 +0000 (xid=0x392c45b)
[2024-04-03 19:58:11.744360 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744452 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(STATFS(14)) called at 2024-04-03 19:28:15.401962 +0000 (xid=0x8ac759d)
[2024-04-03 19:58:11.744600 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(WRITE(13)) called at 2024-04-03 19:28:03.799048 +0000 (xid=0x392c45c)
[2024-04-03 19:58:11.744601 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(FINODELK(30)) called at 2024-04-03 19:28:19.437836 +0000 (xid=0x8ac759e)
[2024-04-03 19:58:11.744630 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-4: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744665 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-2: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744697 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-3: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744730 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2024-04-03 19:28:21.116111 +0000 (xid=0x8ac759f)
[2024-04-03 19:58:11.744727 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744777 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-5: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744871 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(FINODELK(30)) called at 2024-04-03 19:28:26.973849 +0000 (xid=0x8ac75a0)
[2024-04-03 19:58:11.744882 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-2: forced unwinding frame type(GlusterFS 4.x v1) op(WRITE(13)) called at 2024-04-03 19:28:03.799479 +0000 (xid=0x392c45d)
[2024-04-03 19:58:11.744897 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-4: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.744958 +0000] E [MSGID: 114031] [client-rpc-fops_v2.c:1368:client4_0_finodelk_cbk] 0-vol2-client-3: remote operation failed. [{errno=107}, {error=Transport endpoint is not connected}]
[2024-04-03 19:58:11.745003 +0000] E [rpc-clnt.c:333:saved_frames_unwind] (--> /lib/x86_64-linux-gnu/libglusterfs.so.0(_gf_log_callingfn+0x199)[0x7f08f4c51539] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x79
3a)[0x7f08f4bec93a] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x113)[0x7f08f4bf4ae3] (--> /lib/x86_64-linux-gnu/libgfrpc.so.0(+0x109f8)[0x7f08f4bf59f8] (--> /lib/x86_64-linux-gnu/
libgfrpc.so.0(rpc_transport_notify+0x2a)[0x7f08f4bf0e1a] ))))) 0-vol2-client-4: forced unwinding frame type(GlusterFS 4.x v1) op(LOOKUP(27)) called at 2024-04-03 19:28:39.624805 +0000 (xid=0x8ac75a1)

The operating system / glusterfs version

On each node:

lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.4 LTS
Release:        22.04
Codename:       jammy

On server nodes:

glusterd --version
glusterfs 10.5
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.

On node4 (client):

glusterfs --version
glusterfs 10.5
Repository revision: git://git.gluster.org/glusterfs.git
Copyright (c) 2006-2016 Red Hat, Inc. <https://www.gluster.org/>
GlusterFS comes with ABSOLUTELY NO WARRANTY.
It is licensed to you under your choice of the GNU Lesser
General Public License, version 3 or any later version (LGPLv3
or later), or the GNU General Public License, version 2 (GPLv2),
in all cases as published by the Free Software Foundation.
@kCyborg
Copy link

kCyborg commented Apr 3, 2024

This is a recurrent issue, as we have faced very similar problems in the past. And as the OP mentioned above, my team and I have upgraded from 6.x, to 8.x and now with 10.5 we faced a very similar problem.

@aravindavk
Copy link
Member

Please share the full mount logs from the client machine where you observed this issue.

@Franco-Sparrow
Copy link
Author

Franco-Sparrow commented Apr 10, 2024

@aravindavk Hi Sir, thanks for your attention. Please, check the following logs and lets us know if there is something that can fix this issue. This problem is being reiterative with our client and is getting anoying.

gluster_mount_v10.5_vol2.zip

These are the logs from the client that had the issue.

@Franco-Sparrow
Copy link
Author

@aravindavk Hi Sir

May we have a loop on this?

@Franco-Sparrow
Copy link
Author

@aravindavk Hi Sir

May we have a follow up on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants