Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO hang 无感知 #154

Open
ZhengHG123 opened this issue Dec 16, 2021 · 6 comments
Open

IO hang 无感知 #154

ZhengHG123 opened this issue Dec 16, 2021 · 6 comments
Assignees

Comments

@ZhengHG123
Copy link

ZhengHG123 commented Dec 16, 2021

故障现象 : 当主节点所用数据盘变为不可读写或只读,导致所有涉及磁盘io的请求hang住而不会报错,mysqld进程一直存在,此时Xenon在主节点发生这种故障时不会触发选主切换。
实验手段 : 我们利用混沌测试的方式对该现象进行复现,通过对数据盘所有io操作注入100s延迟来模拟io hang住。
实验现象 :

 0. sysbench持续发压,压力稳定并确认集群状态正常后注入io延迟故障;
 1. 磁盘不可读写,sysbench端qps掉0;
 2. xenoncli cluster status 查看集群状态,Mysql字段为空,IO/SQL字段变为false:

$ ./xenoncli cluster status
+------------------+-------------------------------+---------+---------+----------------------------+---------------------+----------------+------------------+
|        ID        |             Raft              | Mysqld  | Monitor |           Backup           |        Mysql        | IO/SQL_RUNNING |     MyLeader     |
+------------------+-------------------------------+---------+---------+----------------------------+---------------------+----------------+------------------+
| ip1:port1        | [ViewID:39 EpochID:9]@FOLLOWER | RUNNING | ON      | state:[NONE]␤              | [ALIVE] [READONLY]  | [true/true]    | ip2:port2 |
|                  |                               |         |         | LastError:␤                |                     |                |                  |
+------------------+-------------------------------+---------+---------+----------------------------+---------------------+----------------+------------------+
| ip2:port2        | [ViewID:39 EpochID:9]@LEADER | RUNNING | ON      | state:[NONE]␤              | []                     | [false/false]    | ip2:port2 |
|                  |                               |         |         | LastError:␤                |                     |                |                  |
+------------------+-------------------------------+---------+---------+----------------------------+---------------------+----------------+------------------+
| ip3:port3        | [ViewID:39 EpochID:9]@FOLLOWER   | RUNNING | ON      | state:[NONE]␤              | [ALIVE] [READWRITE] | [true/true]    | ip2:port2		 |
|                  |                               |         |         | LastError:␤                |                     |                |                  |
+------------------+-------------------------------+---------+---------+----------------------------+---------------------+----------------+------------------+
(3 rows)

 3. 查看xenon日志,发现两项ERROR级日志持续报错:

2021/12/08 09:56:00.747267 api.go:292:          [ERROR]        mysql.get.master.gtid.error[db.query.timeout[20000, SHOW MASTER STATUS]]
 2021/12/08 09:56:00.747309 trace.go:37:         [ERROR]        LEADER[ID: ip2:port2, V:1, E:2].mysql.get.gtid.error[db.query.timeout[20000, SHOW MASTER STATUS]]

 4. 登入mysql主库正常,执行insert语句hang住;
【注】此测例xenon相关文件和mysqld数据文件不在同一块数据盘,io延迟故障仅注入mysqld所用数据盘;
xenon与mysqld在同一块盘的情况尚未测试;

@ZhengHG123 ZhengHG123 changed the title IO hang IO hang 无感知 Dec 16, 2021
@andyli029
Copy link
Contributor

ACK

@andyli029
Copy link
Contributor

@ZhengHG123

  1. 混沌测试是采用下面的方案嘛: https://chaos-mesh.org/website-zh/docs/simulate-disk-pressure-in-physical-nodes/
  2. 数据盘所有io操作注入100s延迟是怎么做到的呢?

@andyli029 andyli029 assigned andyli029 and caphash and unassigned caphash Dec 29, 2021
@ZhengHG123
Copy link
Author

ZhengHG123 commented Dec 30, 2021

@ZhengHG123

  1. 混沌测试是采用下面的方案嘛: https://chaos-mesh.org/website-zh/docs/simulate-disk-pressure-in-physical-nodes/
  2. 数据盘所有io操作注入100s延迟是怎么做到的呢?

是采用这种方案:https://chaos-mesh.org/website-zh/docs/simulate-io-chaos-on-kubernetes/
io延迟100s 请参考 latency 示例 ;

另外,我们在私有云复现了之前掉盘引发的io hang,测试了Xenon(日志,raft文件等)与mysqld数据文件在同一块NeonSAN盘的情况,现象如下:
1. 磁盘读写hang住,不报错(执行ls、touch等命令阻塞);
2. sysbench端qps掉0;
3. 执行xenoncli cluster stats命令阻塞,无法查看xenon集群状态;
4. 写ip未切换;
5. 以上所有异常在io hang结束后恢复;
@andyli029

@andyli029
Copy link
Contributor

@ZhengHG123

  1. 混沌测试是采用下面的方案嘛: https://chaos-mesh.org/website-zh/docs/simulate-disk-pressure-in-physical-nodes/
  2. 数据盘所有io操作注入100s延迟是怎么做到的呢?

是采用这种方案:https://chaos-mesh.org/website-zh/docs/simulate-io-chaos-on-kubernetes/ io延迟100s 请参考 latency 示例 ;

另外,我们在私有云复现了之前掉盘引发的io hang,测试了Xenon(日志,raft文件等)与mysqld数据文件在同一块NeonSAN盘的情况,现象如下: 1. 磁盘读写hang住,不报错(执行ls、touch等命令阻塞); 2. sysbench端qps掉0; 3. 执行xenoncli cluster stats命令阻塞,无法查看xenon集群状态; 4. 写ip未切换; 5. 以上所有异常在io hang结束后恢复; @andyli029

有 xenon.log 嘛

@ZhengHG123
Copy link
Author

ZhengHG123 commented Dec 30, 2021

@ZhengHG123

  1. 混沌测试是采用下面的方案嘛: https://chaos-mesh.org/website-zh/docs/simulate-disk-pressure-in-physical-nodes/
  2. 数据盘所有io操作注入100s延迟是怎么做到的呢?

是采用这种方案:https://chaos-mesh.org/website-zh/docs/simulate-io-chaos-on-kubernetes/ io延迟100s 请参考 latency 示例 ;
另外,我们在私有云复现了之前掉盘引发的io hang,测试了Xenon(日志,raft文件等)与mysqld数据文件在同一块NeonSAN盘的情况,现象如下: 1. 磁盘读写hang住,不报错(执行ls、touch等命令阻塞); 2. sysbench端qps掉0; 3. 执行xenoncli cluster stats命令阻塞,无法查看xenon集群状态; 4. 写ip未切换; 5. 以上所有异常在io hang结束后恢复; @andyli029

有 xenon.log 嘛

hang结束后看xenonlog 仅有这三条连续ERROR (leader xenon)

138968  2021/12/28 10:17:10.525831 api.go:292:          [ERROR]        mysql.get.master.gtid.error[db.query.timeout[20000, SHOW MASTER STATUS]]
138969  2021/12/28 10:17:15.485798 api.go:282:          [ERROR]        mysql.get.slave.gtid.error[db.query.timeout[20000, SHOW SLAVE STATUS]]
138970  2021/12/28 10:23:32.178925 trace.go:37:         [ERROR]        LEADER[ID:master_ip:master_port, V:62, E:0].mysql.get.gtid.error[db.query.timeout[20000, SHOW SLAVE STATUS]]

@xiaozhimengmengda
Copy link

@ZhengHG123

  1. 混沌测试是采用下面的方案嘛: https://chaos-mesh.org/website-zh/docs/simulate-disk-pressure-in-physical-nodes/
  2. 数据盘所有io操作注入100s延迟是怎么做到的呢?

是采用这种方案:https://chaos-mesh.org/website-zh/docs/simulate-io-chaos-on-kubernetes/ io延迟100s 请参考 latency 示例 ;

另外,我们在私有云复现了之前掉盘引发的io hang,测试了Xenon(日志,raft文件等)与mysqld数据文件在同一块NeonSAN盘的情况,现象如下: 1. 磁盘读写hang住,不报错(执行ls、touch等命令阻塞); 2. sysbench端qps掉0; 3. 执行xenoncli cluster stats命令阻塞,无法查看xenon集群状态; 4. 写ip未切换; 5. 以上所有异常在io hang结束后恢复; @andyli029

请教下 "磁盘读写hang住,不报错(执行ls、touch等命令阻塞)" 这个通过pod 直接挂载 NeonSAN吗?
@ZhengHG123

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants