Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.X和3.X elasticjob-lite版本选主逻辑在遍历存活实例instances下的ip去匹配servers下的ip时使用startsWith,在某种异常情况下会导致选主进入死循环 #2226

Open
McdullFei opened this issue May 31, 2023 · 4 comments · May be fixed by #2227
Assignees
Labels

Comments

@McdullFei
Copy link

McdullFei commented May 31, 2023

Bug Report

The 2.x and 3.x elasticjob-lite versions of the primary selector logic starts with the ip address of the live instances to match the ip address of the servers. The primary selector logic starts with an exception that causes the primary to enter an infinite loop

2.X和3.X elasticjob-lite版本选主逻辑在遍历存活实例instances下的ip去匹配servers下的ip时使用startsWith,在某种异常情况下会导致选主进入死循环,进而使zk cpu load异常

使用版本elasticjob-lite 2.1.5,elasticjob-lite 3.0.3版本源码也有此问题

ElasticJob-Lite

生产bug产生的场景的最终结论:

  1. 由于某种历史原因,有一个disabled=true 的job 在zk注册的server node下存在部分ip node 并没有打上DISABLED的标志(至于为啥没有打上DISABLED标志暂时无法追溯)
  2. 某一次生产服务部署产生一个ip为172.28.3.87的实例,(zk job 的instance下生成172.28.3.87@-@1 这个node),刚好发现servers下有一个node name 为 172.28.3.8的节点并没有DISABLED标志,此节点创建时间是2020年
  3. 反向分析代码发现
    com.dangdang.ddframe.job.lite.internal.election.LeaderService#isLeaderUntilBlock
    image
public boolean isLeaderUntilBlock() {
    while (!hasLeader() && serverService.hasAvailableServers()) {
        log.info("Leader is electing, waiting for {} ms", 100);
        BlockUtils.waitingShortTime();
        if (!JobRegistry.getInstance().isShutdown(jobName) && serverService.isAvailableServer(JobRegistry.getInstance().getJobInstance(jobName).getIp())) {
            electLeader();
        }
    }
    return isLeader();
}

com.dangdang.ddframe.job.lite.internal.server.ServerService#hasAvailableServers
image

public boolean **hasAvailableServers**() {
    List<String> servers = jobNodeStorage.getJobNodeChildrenKeys(ServerNode.ROOT);
    for (String each : servers) {
        if (isAvailableServer(each)) {
            return true;
        }
    }
    return false;
}

public boolean isAvailableServer(final String ip) {
    return isEnableServer(ip) && hasOnlineInstances(ip);
}

private boolean **hasOnlineInstances**(final String ip) {
    for (String each : jobNodeStorage.getJobNodeChildrenKeys(InstanceNode.ROOT)) {
        if (each.startsWith(ip)) {
            return true;
        }
    }
    return false;
}

ServerService#hasAvailableServers调用的hasOnlineInstances 方法会拿当前存活实例instances下的node通过startsWith方法匹配server下没有打上DISABLED标志的node name:("172.28.3.87@-@1".startsWith("172.28.3.8"))

  1. 如果job disabled之后,server下的 node没有正常打上DISABLED标志,这个bug触发到的几率就会很大

请帮忙修复 ServerService#hasOnlineInstances方法:通过截取@-@ 字符前的ip进行全字符匹配

@TeslaCN
Copy link
Member

TeslaCN commented May 31, 2023

Hi @McdullFei Thanks for your feedback. Could you submit a PR to fix this?

@McdullFei
Copy link
Author

image

测试看,server下面node在job disabled之前的node列表, 是不会打上disabled标志的。

@McdullFei McdullFei linked a pull request Jun 1, 2023 that will close this issue
@McdullFei
Copy link
Author

Hi @McdullFei Thanks for your feedback. Could you submit a PR to fix this?

#2227

@SuperCarrys
Copy link

如果任务pod在新启动时,设置了disabled=true,这时job就应该把之前所有serves下ip节点的状态都改为DISABLED,而不是只有新ip下的状态为DISABLED,或者就直接将servers下的ip节点从永久节点改为临时节点,老的pod下线后直接删除ip信息

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants