Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #2146 E112 EHOSTDOWN in short and pooled connection #2177

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

Huixxi
Copy link
Contributor

@Huixxi Huixxi commented Mar 25, 2023

What problem does this PR solve?

Issue Number:

Problem Summary:
以下引起E112错误之后无法自行恢复的几种场景:

  1. 网络抖动后遗症,一直出现E112无法自行恢复
    配置dns访问外网,dns只解析出一个ip,然后网络抖动过程中出现少量连接失败,返回错误码为ECONNREFUSED或ENETUNREACH或EHOSTUNREACH或EINVAL,满足does_error_affect_main_socket从而触发main socket SetFailed。
    但是正常来说,SetFailed之后过100ms就开始健康检查,每3秒重试一次。所以按理当网格抖动恢复后,最长3秒后,就不应该再出现E112了。但是实际上仍有E112持续出现,如果重启进程,则不会再出现。
  2. 以single server模式初始化channel,第一次连接失败之后socket被SetFailed,后面重试就都选不出服务了。
  3. 出现一个ETIMEOUT之后,就一直出现EHOSTDOWN,没有恢复。

What is changed and the side effects?

Changed:
目前针对各种原因导致的E112错误,提供一套通用的解决方案:
一、防止实例封禁导致的E112:在对实例进行封禁的时候,加上是否为唯一可用实例判断

  1. 如果是SingleServer模式(即直接以ip+port方式初始化Channel,不通过NamingService+LB),则只有唯一一个实例,所以不做封禁
  2. 如果是集群模式(NamingService+LB方式初始化Channel),则尝试以排除当前实例的条件下SelectServer,如果这种条件下选不出实例,或者选出实例仍为当前实例,则说明当前实例为LB内部唯一可用的实例,在这种情况下不做封禁
    二、防止重试去重或LB调权导致的E112:在选择实例的时候,增加一次额外的尝试
  3. 即使LB内部存在可用的实例,也有可能因为重试去重或LB调权的原因无法选中。解决方法为在选择实例的时候,如果返回E112,则改变选择条件:excluded_server=NULL, changable_weights=false,以这种条件重新选择,如果LB内部存在可用实例,则一定能够选择出实例。

注意:以上方案仅对连接池和短连接模式下生效,单连接的情况下,main socket和rpc实际通信的socket是同一个,通信过程可能有各种原因导致socket被SetFail,从而导致main socket失效,无法被LB选择。解决该问题的根本办法是将单连接的main socket和rpc通信的socket进行拆分(有点类似于只有一个连接的连接池),当rpc通信socket SetFailed之后,可以从main socket新建新的socket进行连接,但这个方案对brpc的改动较大。

Side effects:

  • Performance effects(性能影响):

  • Breaking backward compatibility(向后兼容性):


Check List:

  • Please make sure your changes are compilable(请确保你的更改可以通过编译).
  • When providing us with a new feature, it is best to add related tests(如果你向我们增加一个新的功能, 请添加相关测试).
  • Please follow Contributor Covenant Code of Conduct.(请遵循贡献者准则).

@@ -1824,6 +1824,39 @@ class ChannelTest : public ::testing::Test{
MyEchoService _svc;
};

void TestBlockServer(bool single_server, bool short_connection, const char* lb) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestBlockServer 应该 是ChannelTest 的成员函数。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestBlockServer 应该 是ChannelTest 的成员函数。

哦哦 知道了

src/brpc/controller.cpp Show resolved Hide resolved
@@ -719,6 +719,37 @@ inline bool does_error_affect_main_socket(int error_code) {
error_code == EINVAL/*returned by connect "0.0.0.1"*/;
}

inline void maybe_block_server(int error_code, Controller* cntl, SharedLoadBalancer* lb, SocketId sock) {
if (!does_error_affect_main_socket(error_code)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

连接超时的情况可以优化吗?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我这周再研究研究。。

int rc = _lb->SelectServer(sel_in, &sel_out);
if (rc == EHOSTDOWN) {
// If no server is available, include accessed server and try to SelectServer again
sel_in.excluded = NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样是不是就会选到期望要excluded的实例呢?这样合理吗?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

目前excluded的实例是在之前的重试中选过的实例,如果返回EHOSTDOWN说明已经没有实例可选,RPC必然失败,那就退而求其次,选择excluded中的实例,这样还有成功的可能

@wwbmmm
Copy link
Contributor

wwbmmm commented Mar 28, 2023

CI单测失败了

@Huixxi
Copy link
Contributor Author

Huixxi commented Mar 28, 2023

CI单测失败了

嗯嗯。。 我先在本地调试一下。

@@ -608,11 +608,11 @@ TEST_F(SocketTest, app_level_health_check) {

for (int i = 0; i < 4; ++i) {
// although ::connect would succeed, the stall in hc_service makes
// the health check rpc fail.
// the health check rpc overtime but not fail.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的备注需要修改一下

@@ -1070,8 +1074,9 @@ TEST_F(LoadBalancerTest, revived_from_all_failed_intergrated) {
// all servers are down, the very first call that trigger recovering would
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的备注需要修改一下

@Huixxi Huixxi added the bug the code does not work as expected label Apr 10, 2023
(sending_sock == NULL || sending_sock->id() != peer_id)) {
Socket::SetFailed(peer_id);
maybe_block_server(error_code, c, c->_lb.get(), peer_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

实现是"block socket"不是block server". 或许叫checkAndSetFailed更明确一点?

// main socket should die as well
// NOTE: main socket may be wrongly set failed (provided that
// short/pooled socket does not hold a ref of the main socket).
// E.g. a in-parallel RPC sets the peer_id to be failed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'an'

@@ -719,6 +719,37 @@ inline bool does_error_affect_main_socket(int error_code) {
error_code == EINVAL/*returned by connect "0.0.0.1"*/;
}

inline void maybe_block_server(int error_code, Controller* cntl, SharedLoadBalancer* lb, SocketId sock) {
if (!does_error_affect_main_socket(error_code)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么要把does_error_affect_main_socket放到这个函数里?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

因为这个函数最终的动作是SetFailed main socket,所以要先检查一下错误号是否满足影响main socket 的条件

@chenBright
Copy link
Contributor

@Huixxi @wwbmmm 是不是有关于lb E61的优化,可以提个PR吗?

@wwbmmm
Copy link
Contributor

wwbmmm commented Apr 26, 2023

@Huixxi @wwbmmm 是不是有关于lb E61的优化,可以提个PR吗?

有,那个是另一个问题,有空提下

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug the code does not work as expected
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants