Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AT模式下,二阶段回滚失败后,seata-server服务端TC重试回滚定时任务长时间不执行问题 #6478

Open
1 task
nakerny opened this issue Apr 15, 2024 · 20 comments · May be fixed by #6499
Open
1 task
Assignees

Comments

@nakerny
Copy link

nakerny commented Apr 15, 2024

  • I have searched the issues of this repository and believe that this is not a duplicate.

Ⅰ. Issue Description

在AT模式下,出现二阶段回滚失败的全局事务后,不会立即进行回滚重试,而是在大约2分钟后才会看到重试日志并且回滚成功

Ⅱ. Describe what happened

在经过业务分析和代码调试之后发现如下问题
1、当系统中存在大量的需要回滚的全局事务时(业务逻辑问题,大约2分钟内有600条左右),global_table表中的记录均处于Rollbacking状态(status=4),且该类型记录大约在2分钟后才会被删除,但是该类型的全局事务其实已经完成二阶段回滚
2、上述问题原因如下:
a、发生业务异常进行全局事务回滚,服务端代码流程为调用
org.apache.seata.server.coordinator.DefaultCore#rollback-->org.apache.seata.server.coordinator.DefaultCore#doGlobalRollback

f43e373cab65c3cd2e4e263cc391fb7
b、当所有分支事务回滚成功后,此处没有赋予全局事务一个最终状态。进入endRollbacked方法后,由于retryGlobal=false,也未对全局事务状态进行变更
508febbdc8b1e55beb54ba9eea2e799
eb473e4e3abde6b22aa2445e03f678f
c、因此global_table表中存在大量Rollbacking状态(status=4)的记录(实际状态应该为GlobalStatus.Rollbacked)

2、造成的影响:
a、在上述情况下某个全局事务由于业务异常发生回滚,但是在某个分支事务进行二阶段回滚时发生异常造成全局事务回滚失败,且回滚失败上报seata-server服务端TC成功,数据库记录变更状态为GlobalStatus.RollbackRetrying

二阶段回滚失败

b、服务端进行回滚失败重试流程如下:
     1)默认配置下每隔1s钟从数据库中获取状态为GlobalStatus.TimeoutRollbacking,
        GlobalStatus.TimeoutRollbackRetrying, GlobalStatus.RollbackRetrying, GlobalStatus.Rollbacking四种状态的记录
     2)默认配置store.db.queryLimit=100,也就是说每次定时任务只能从global_table表中获取100条
     3)默认配置DeadSession的时间为2分10秒钟(这也是为什么大约2分钟后Rollbacking状态记录才会被删除)

image

     4)由于global_table表中存在大量RollBacking状态的异常记录(大于100条),每次定时任务仅能从数据库中拿取100条记录,且这100条记录没一条都会直到2分10秒后DeadSesssion状态触发后才会被删除处理,上述二阶段回滚失败的记录也会直到2分钟之后才会进行重试(该事务持有的全局事务锁和业务数据库异常记录都会对业务造成影响)

e33affb5cf9e70eb7a6857047f944f0

Ⅲ. Describe what you expected to happen

第一次全局事务回滚成功后需要一个最终状态(GlobalStatus.Rollbacked),并进行变更
现在能想到是通过计数器记录回滚分支事务的数量,全部分支事务回滚完成后变更全局事务状态,但也不是很合理
image

Ⅵ. Environment:

  • JDK version(e.g. java -version):1.8.0_341
  • Seata client/server version: v2.0.0
@funky-eyes
Copy link
Contributor

funky-eyes commented Apr 15, 2024

如果在同步回滚下出现回滚失败,状态将被重置为RollbackRetrying,会在1s后自动进行重试。
同步回滚时Rollbacking是不会被更改,目的是为了在db和redis下减少网络io和磁盘io,再2分10秒后的异步线程中再进行修改状态和删除,实际上这种globalsession下已经没有branchsession了,可以不需要关注。
If there is a rollback failure under sync up rollback, state will be reset to RollbackRetrying and will automatically retry after 1s.
When sync up rolls back, Rollbacking will not be changed. The purpose is to reduce network io and disk io under db and redis, and then modify state and delete in the asynchronization/asynchronous thread after 2 minutes and 10 seconds. In fact, there is no branchsession under this globalsession, so you don't need to pay attention.

@nakerny
Copy link
Author

nakerny commented Apr 15, 2024

rollbacking状态本身的事务是没问题,但是如果global_table中这种状态的记录太多,超过100条后,如果此时再发生二阶段回滚失败的RollbackRetrying的事务,就得等2分钟删除后才会被处理

@nakerny
Copy link
Author

nakerny commented Apr 15, 2024

因为默认配置store.db.queryLimit=100,每次取100条处理这个限制

@nakerny
Copy link
Author

nakerny commented Apr 15, 2024

@funky-eyes

@nakerny
Copy link
Author

nakerny commented Apr 15, 2024

@funky-eyes 这个配置可以增大,两分钟内回滚事务过多也属于业务异常,但是这个回滚重试受到这个状态记录数影响也不是很合理

@funky-eyes
Copy link
Contributor

1.提高store.db.queryLimit的值
2.或许我们应该独立rollbacking和committing的处理线程池,避免将疑似需要重试的事务和真正需要重试的事务混淆?
你怎么看?
Increasing the value of store.db.queryLimit.
Perhaps we should consider having separate thread pools for rollbacking and committing processes to avoid confusion between transactions that may appear to need retrying and those that actually need retrying.
What do you think?

@nakerny
Copy link
Author

nakerny commented Apr 15, 2024

1.提高store.db.queryLimit的值 2.或许我们应该独立rollbacking和committing的处理线程池,避免将疑似需要重试的事务和真正需要重试的事务混淆? 你怎么看? Increasing the value of store.db.queryLimit. Perhaps we should consider having separate thread pools for rollbacking and committing processes to avoid confusion between transactions that may appear to need retrying and those that actually need retrying. What do you think?

提高queryLimit的值不能彻底解决这个问题,倒是独立rollbacking和commiting的线程池能有之前IO优化的收益,也能解决这个问题。

@funky-eyes
Copy link
Contributor

还有一种办法,将status进行排序,因为committing状态为2,所有需要重试的状态都比他大,rollbacking也是一样。只要将sql进行改动,就可以优先查询到真正需要回滚的事务。
There is another way to sort the status, because committing state is 2, all states that need to be retried are larger than him, and the same is true for rollbacking. As long as the sql is changed, the transactions that really need to be rolled back can be searched first.

@nakerny
Copy link
Author

nakerny commented Apr 15, 2024

这个能解决问题也容易操作,但是逻辑隔离性不如另开线程池处理,另外DB模式下高频全局事务对数据库性能消耗本来就很大,高频排序查询也不是很好,如果另开线程还可以降低数据库查询频率

@funky-eyes
Copy link
Contributor

方案3.对mysql会有性能影响
方案2独立后,我想到一个比较好的,也不用每1s去查询一次数据库的方法,大部分这种rollbacking都是需要删除的,等待时间呢又是2分10秒以后,每1s查询对数据库的影响非常大,而且捞出来还没什么用。
弃用scheduleAtFixedRate
改用schedule
假设schedule查到100条数据,但是第一条是2分钟后才需要处理,那么schedule定时为2分钟后再执行查询任务,为什么呢?
因为插入的globalsession是有序的,第一条如果没达到需要删除的时间,第二条肯定也是没有的,不需要再无脑1s查询,如果查询没有任何一条数据时,schedule就改为2分10秒后再查询,这样大大减少了查询次数和数据量,大家怎么看?

Solution 3 may have performance implications on MySQL.

For Solution 2, after separating them, I came up with a better approach that doesn't require querying the database every second. Most rollbacking operations involve deletions, and they usually need to wait for at least 2 minutes and 10 seconds. Querying the database every second would have a significant impact on performance, and the fetched data wouldn't be useful immediately.

Instead of using scheduleAtFixedRate, we could switch to using schedule. Suppose schedule fetches 100 records, but the first record doesn't need processing until 2 minutes later. In that case, we can schedule the next query task to execute after 2 minutes. Why? Because the inserted global sessions are ordered, so if the first record isn't due for deletion, neither are the subsequent ones. There's no need for continuous querying every second. If the query returns no results, we can switch back to querying after 2 minutes and 10 seconds. This approach significantly reduces the number of queries and the amount of data processed. What do you think?

@nakerny
Copy link
Author

nakerny commented Apr 15, 2024

方案3.对mysql会有性能影响 方案2独立后,我想到一个比较好的,也不用每1s去查询一次数据库的方法,大部分这种rollbacking都是需要删除的,等待时间呢又是2分10秒以后,每1s查询对数据库的影响非常大,而且捞出来还没什么用。 弃用scheduleAtFixedRate 改用schedule 假设schedule查到100条数据,但是第一条是2分钟后才需要处理,那么schedule定时为2分钟后再执行查询任务,为什么呢? 因为插入的globalsession是有序的,第一条如果没达到需要删除的时间,第二条肯定也是没有的,不需要再无脑1s查询,如果查询没有任何一条数据时,schedule就改为2分10秒后再查询,这样大大减少了查询次数和数据量,大家怎么看?

Solution 3 may have performance implications on MySQL.

For Solution 2, after separating them, I came up with a better approach that doesn't require querying the database every second. Most rollbacking operations involve deletions, and they usually need to wait for at least 2 minutes and 10 seconds. Querying the database every second would have a significant impact on performance, and the fetched data wouldn't be useful immediately.

Instead of using scheduleAtFixedRate, we could switch to using schedule. Suppose schedule fetches 100 records, but the first record doesn't need processing until 2 minutes later. In that case, we can schedule the next query task to execute after 2 minutes. Why? Because the inserted global sessions are ordered, so if the first record isn't due for deletion, neither are the subsequent ones. There's no need for continuous querying every second. If the query returns no results, we can switch back to querying after 2 minutes and 10 seconds. This approach significantly reduces the number of queries and the amount of data processed. What do you think?

逻辑有点复杂不好控制,另外降级到2分钟后执行这个操作,一般低频的回滚能够实现,但是低频回滚也不会触发上述问题,高频回滚降级的几率也不大,例如在业务低峰降级2分钟执行,但是业务高峰时可能还会触发
这个问题不好决策,不如另开线程的执行频率给个配置,使用者根据具体的业务进行灵活处理比较简单

@nakerny
Copy link
Author

nakerny commented Apr 15, 2024

另外所有的Rollbacking状态都可以直接删除吗?这个Rollbacking状态是不是会在几秒后变更为Retrying状态

@nakerny
Copy link
Author

nakerny commented Apr 15, 2024

另外所有的Rollbacking状态都可以直接删除吗?这个Rollbacking状态是不是会在几秒后变更为Retrying状态

这种情况灵活配置也不行

@nakerny
Copy link
Author

nakerny commented Apr 15, 2024

我觉得还是需要一个最终状态,根据中间状态和持续时间怎么搞都不是很合理

@funky-eyes
Copy link
Contributor

我觉得还是需要一个最终状态,根据中间状态和持续时间怎么搞都不是很合理

多一个中间状态会使性能降低,如果你改用raft模式,是不会出现这种问题的,这个处理仅限于存算分离的模式,db和redis
Adding an intermediate state will degrade performance. If you switch to raft mode, this problem will not occur. This processing is limited to the memory separation mode, db and redis.

@funky-eyes
Copy link
Contributor

另外所有的Rollbacking状态都可以直接删除吗?这个Rollbacking状态是不是会在几秒后变更为Retrying状态

只有出现异常的情况下才会变更为Retrying,Rollbacking这个状态指的是tm决议发给tc,tc改为rollbacking后开始回滚,然后tm等待tc下发二阶段结束后响应。
此时tm是同步等待tc的处理结果的,Retrying基本上只会在tc同步处理时出现异常,才会放入到Retrying这个状态,异步线程里的Rollbacking基本上不可能会变为Retrying

@nakerny
Copy link
Author

nakerny commented Apr 16, 2024

另外所有的Rollbacking状态都可以直接删除吗?这个Rollbacking状态是不是会在几秒后变更为Retrying状态

只有出现异常的情况下才会变更为Retrying,Rollbacking这个状态指的是tm决议发给tc,tc改为rollbacking后开始回滚,然后tm等待tc下发二阶段结束后响应。 此时tm是同步等待tc的处理结果的,Retrying基本上只会在tc同步处理时出现异常,才会放入到Retrying这个状态,异步线程里的Rollbacking基本上不可能会变为Retrying

好的,我先试试raft模式。至于分离存储模式下的这个问题,您还有什么建议吗?

@funky-eyes
Copy link
Contributor

另外所有的Rollbacking状态都可以直接删除吗?这个Rollbacking状态是不是会在几秒后变更为Retrying状态

只有出现异常的情况下才会变更为Retrying,Rollbacking这个状态指的是tm决议发给tc,tc改为rollbacking后开始回滚,然后tm等待tc下发二阶段结束后响应。 此时tm是同步等待tc的处理结果的,Retrying基本上只会在tc同步处理时出现异常,才会放入到Retrying这个状态,异步线程里的Rollbacking基本上不可能会变为Retrying

好的,我先试试raft模式。至于分离存储模式下的这个问题,您还有什么建议吗?

存算分离未来不会是社区的推进核心,未来可能会考虑在multi-raft上做处理。所以该存算分离的导致的问题,不会投入太大的精力去处理,一般来说会做一些兜底措施和处理逻辑上的优化,大的架构变动应该是不会出现。
The separation of memory and computing will not be the core of the community in the future, and it may be considered to deal with it on multi-raft in the future. Therefore, the problems caused by the separation of memory and computing will not be dealt with with too much energy. Generally speaking, some fallback measures and processing logic optimization will be done, and major architecture changes should not occur.

@nakerny
Copy link
Author

nakerny commented Apr 16, 2024

另外所有的Rollbacking状态都可以直接删除吗?这个Rollbacking状态是不是会在几秒后变更为Retrying状态

只有出现异常的情况下才会变更为Retrying,Rollbacking这个状态指的是tm决议发给tc,tc改为rollbacking后开始回滚,然后tm等待tc下发二阶段结束后响应。 此时tm是同步等待tc的处理结果的,Retrying基本上只会在tc同步处理时出现异常,才会放入到Retrying这个状态,异步线程里的Rollbacking基本上不可能会变为Retrying

好的,我先试试raft模式。至于分离存储模式下的这个问题,您还有什么建议吗?

存算分离未来不会是社区的推进核心,未来可能会考虑在multi-raft上做处理。所以该存算分离的导致的问题,不会投入太大的精力去处理,一般来说会做一些兜底措施和处理逻辑上的优化,大的架构变动应该是不会出现。 The separation of memory and computing will not be the core of the community in the future, and it may be considered to deal with it on multi-raft in the future. Therefore, the problems caused by the separation of memory and computing will not be dealt with with too much energy. Generally speaking, some fallback measures and processing logic optimization will be done, and major architecture changes should not occur.

明白了

@liuqiufeng liuqiufeng self-assigned this Apr 19, 2024
@liuqiufeng
Copy link
Contributor

I will optimize it using option 2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants