AT模式下，二阶段回滚失败后，seata-server服务端TC重试回滚定时任务长时间不执行问题 #6478

nakerny · 2024-04-15T02:33:00Z

I have searched the issues of this repository and believe that this is not a duplicate.

Ⅰ. Issue Description

在AT模式下，出现二阶段回滚失败的全局事务后，不会立即进行回滚重试，而是在大约2分钟后才会看到重试日志并且回滚成功

Ⅱ. Describe what happened

在经过业务分析和代码调试之后发现如下问题
1、当系统中存在大量的需要回滚的全局事务时（业务逻辑问题，大约2分钟内有600条左右），global_table表中的记录均处于Rollbacking状态（status=4），且该类型记录大约在2分钟后才会被删除，但是该类型的全局事务其实已经完成二阶段回滚
2、上述问题原因如下：
a、发生业务异常进行全局事务回滚，服务端代码流程为调用
org.apache.seata.server.coordinator.DefaultCore#rollback-->org.apache.seata.server.coordinator.DefaultCore#doGlobalRollback

b、当所有分支事务回滚成功后，此处没有赋予全局事务一个最终状态。进入endRollbacked方法后，由于retryGlobal=false，也未对全局事务状态进行变更


c、因此global_table表中存在大量Rollbacking状态（status=4）的记录（实际状态应该为GlobalStatus.Rollbacked）

2、造成的影响：
a、在上述情况下某个全局事务由于业务异常发生回滚，但是在某个分支事务进行二阶段回滚时发生异常造成全局事务回滚失败，且回滚失败上报seata-server服务端TC成功，数据库记录变更状态为GlobalStatus.RollbackRetrying

b、服务端进行回滚失败重试流程如下：
     1）默认配置下每隔1s钟从数据库中获取状态为GlobalStatus.TimeoutRollbacking,
        GlobalStatus.TimeoutRollbackRetrying, GlobalStatus.RollbackRetrying, GlobalStatus.Rollbacking四种状态的记录
     2）默认配置store.db.queryLimit=100，也就是说每次定时任务只能从global_table表中获取100条
     3）默认配置DeadSession的时间为2分10秒钟（这也是为什么大约2分钟后Rollbacking状态记录才会被删除）

     4）由于global_table表中存在大量RollBacking状态的异常记录（大于100条），每次定时任务仅能从数据库中拿取100条记录，且这100条记录没一条都会直到2分10秒后DeadSesssion状态触发后才会被删除处理，上述二阶段回滚失败的记录也会直到2分钟之后才会进行重试（该事务持有的全局事务锁和业务数据库异常记录都会对业务造成影响）

Ⅲ. Describe what you expected to happen

第一次全局事务回滚成功后需要一个最终状态（GlobalStatus.Rollbacked），并进行变更
现在能想到是通过计数器记录回滚分支事务的数量，全部分支事务回滚完成后变更全局事务状态，但也不是很合理

Ⅵ. Environment:

JDK version(e.g. java -version):1.8.0_341
Seata client/server version: v2.0.0

The text was updated successfully, but these errors were encountered:

funky-eyes · 2024-04-15T02:38:25Z

如果在同步回滚下出现回滚失败，状态将被重置为RollbackRetrying，会在1s后自动进行重试。
同步回滚时Rollbacking是不会被更改，目的是为了在db和redis下减少网络io和磁盘io，再2分10秒后的异步线程中再进行修改状态和删除，实际上这种globalsession下已经没有branchsession了，可以不需要关注。
If there is a rollback failure under sync up rollback, state will be reset to RollbackRetrying and will automatically retry after 1s.
When sync up rolls back, Rollbacking will not be changed. The purpose is to reduce network io and disk io under db and redis, and then modify state and delete in the asynchronization/asynchronous thread after 2 minutes and 10 seconds. In fact, there is no branchsession under this globalsession, so you don't need to pay attention.

nakerny · 2024-04-15T02:40:53Z

rollbacking状态本身的事务是没问题，但是如果global_table中这种状态的记录太多，超过100条后，如果此时再发生二阶段回滚失败的RollbackRetrying的事务，就得等2分钟删除后才会被处理

nakerny · 2024-04-15T02:43:06Z

因为默认配置store.db.queryLimit=100，每次取100条处理这个限制

nakerny · 2024-04-15T02:46:00Z

@funky-eyes

nakerny · 2024-04-15T02:49:50Z

@funky-eyes 这个配置可以增大，两分钟内回滚事务过多也属于业务异常，但是这个回滚重试受到这个状态记录数影响也不是很合理

funky-eyes · 2024-04-15T02:50:27Z

1.提高store.db.queryLimit的值
2.或许我们应该独立rollbacking和committing的处理线程池，避免将疑似需要重试的事务和真正需要重试的事务混淆？
你怎么看？
Increasing the value of store.db.queryLimit.
Perhaps we should consider having separate thread pools for rollbacking and committing processes to avoid confusion between transactions that may appear to need retrying and those that actually need retrying.
What do you think?

nakerny · 2024-04-15T02:54:09Z

1.提高store.db.queryLimit的值 2.或许我们应该独立rollbacking和committing的处理线程池，避免将疑似需要重试的事务和真正需要重试的事务混淆？你怎么看？ Increasing the value of store.db.queryLimit. Perhaps we should consider having separate thread pools for rollbacking and committing processes to avoid confusion between transactions that may appear to need retrying and those that actually need retrying. What do you think?

提高queryLimit的值不能彻底解决这个问题，倒是独立rollbacking和commiting的线程池能有之前IO优化的收益，也能解决这个问题。

funky-eyes · 2024-04-15T02:55:54Z

还有一种办法，将status进行排序，因为committing状态为2，所有需要重试的状态都比他大，rollbacking也是一样。只要将sql进行改动，就可以优先查询到真正需要回滚的事务。
There is another way to sort the status, because committing state is 2, all states that need to be retried are larger than him, and the same is true for rollbacking. As long as the sql is changed, the transactions that really need to be rolled back can be searched first.

nakerny · 2024-04-15T03:02:32Z

这个能解决问题也容易操作，但是逻辑隔离性不如另开线程池处理，另外DB模式下高频全局事务对数据库性能消耗本来就很大，高频排序查询也不是很好，如果另开线程还可以降低数据库查询频率

funky-eyes · 2024-04-15T03:11:51Z

方案3.对mysql会有性能影响
方案2独立后，我想到一个比较好的，也不用每1s去查询一次数据库的方法，大部分这种rollbacking都是需要删除的，等待时间呢又是2分10秒以后，每1s查询对数据库的影响非常大，而且捞出来还没什么用。
弃用scheduleAtFixedRate
改用schedule
假设schedule查到100条数据，但是第一条是2分钟后才需要处理，那么schedule定时为2分钟后再执行查询任务，为什么呢？
因为插入的globalsession是有序的，第一条如果没达到需要删除的时间，第二条肯定也是没有的，不需要再无脑1s查询，如果查询没有任何一条数据时，schedule就改为2分10秒后再查询，这样大大减少了查询次数和数据量，大家怎么看？

Solution 3 may have performance implications on MySQL.

For Solution 2, after separating them, I came up with a better approach that doesn't require querying the database every second. Most rollbacking operations involve deletions, and they usually need to wait for at least 2 minutes and 10 seconds. Querying the database every second would have a significant impact on performance, and the fetched data wouldn't be useful immediately.

Instead of using scheduleAtFixedRate, we could switch to using schedule. Suppose schedule fetches 100 records, but the first record doesn't need processing until 2 minutes later. In that case, we can schedule the next query task to execute after 2 minutes. Why? Because the inserted global sessions are ordered, so if the first record isn't due for deletion, neither are the subsequent ones. There's no need for continuous querying every second. If the query returns no results, we can switch back to querying after 2 minutes and 10 seconds. This approach significantly reduces the number of queries and the amount of data processed. What do you think?

nakerny · 2024-04-15T03:47:46Z

方案3.对mysql会有性能影响方案2独立后，我想到一个比较好的，也不用每1s去查询一次数据库的方法，大部分这种rollbacking都是需要删除的，等待时间呢又是2分10秒以后，每1s查询对数据库的影响非常大，而且捞出来还没什么用。弃用scheduleAtFixedRate 改用schedule 假设schedule查到100条数据，但是第一条是2分钟后才需要处理，那么schedule定时为2分钟后再执行查询任务，为什么呢？因为插入的globalsession是有序的，第一条如果没达到需要删除的时间，第二条肯定也是没有的，不需要再无脑1s查询，如果查询没有任何一条数据时，schedule就改为2分10秒后再查询，这样大大减少了查询次数和数据量，大家怎么看？

Solution 3 may have performance implications on MySQL.

For Solution 2, after separating them, I came up with a better approach that doesn't require querying the database every second. Most rollbacking operations involve deletions, and they usually need to wait for at least 2 minutes and 10 seconds. Querying the database every second would have a significant impact on performance, and the fetched data wouldn't be useful immediately.

Instead of using scheduleAtFixedRate, we could switch to using schedule. Suppose schedule fetches 100 records, but the first record doesn't need processing until 2 minutes later. In that case, we can schedule the next query task to execute after 2 minutes. Why? Because the inserted global sessions are ordered, so if the first record isn't due for deletion, neither are the subsequent ones. There's no need for continuous querying every second. If the query returns no results, we can switch back to querying after 2 minutes and 10 seconds. This approach significantly reduces the number of queries and the amount of data processed. What do you think?

逻辑有点复杂不好控制，另外降级到2分钟后执行这个操作，一般低频的回滚能够实现，但是低频回滚也不会触发上述问题，高频回滚降级的几率也不大，例如在业务低峰降级2分钟执行，但是业务高峰时可能还会触发
这个问题不好决策，不如另开线程的执行频率给个配置，使用者根据具体的业务进行灵活处理比较简单

nakerny · 2024-04-15T03:54:14Z

另外所有的Rollbacking状态都可以直接删除吗？这个Rollbacking状态是不是会在几秒后变更为Retrying状态

nakerny · 2024-04-15T03:55:22Z

另外所有的Rollbacking状态都可以直接删除吗？这个Rollbacking状态是不是会在几秒后变更为Retrying状态

这种情况灵活配置也不行

nakerny · 2024-04-15T03:58:35Z

我觉得还是需要一个最终状态，根据中间状态和持续时间怎么搞都不是很合理

funky-eyes · 2024-04-15T05:36:38Z

我觉得还是需要一个最终状态，根据中间状态和持续时间怎么搞都不是很合理

多一个中间状态会使性能降低，如果你改用raft模式，是不会出现这种问题的，这个处理仅限于存算分离的模式，db和redis
Adding an intermediate state will degrade performance. If you switch to raft mode, this problem will not occur. This processing is limited to the memory separation mode, db and redis.

funky-eyes · 2024-04-15T05:38:59Z

另外所有的Rollbacking状态都可以直接删除吗？这个Rollbacking状态是不是会在几秒后变更为Retrying状态

只有出现异常的情况下才会变更为Retrying，Rollbacking这个状态指的是tm决议发给tc，tc改为rollbacking后开始回滚，然后tm等待tc下发二阶段结束后响应。
此时tm是同步等待tc的处理结果的，Retrying基本上只会在tc同步处理时出现异常，才会放入到Retrying这个状态，异步线程里的Rollbacking基本上不可能会变为Retrying

nakerny · 2024-04-16T00:54:55Z

另外所有的Rollbacking状态都可以直接删除吗？这个Rollbacking状态是不是会在几秒后变更为Retrying状态

只有出现异常的情况下才会变更为Retrying，Rollbacking这个状态指的是tm决议发给tc，tc改为rollbacking后开始回滚，然后tm等待tc下发二阶段结束后响应。此时tm是同步等待tc的处理结果的，Retrying基本上只会在tc同步处理时出现异常，才会放入到Retrying这个状态，异步线程里的Rollbacking基本上不可能会变为Retrying

好的，我先试试raft模式。至于分离存储模式下的这个问题，您还有什么建议吗？

funky-eyes · 2024-04-16T02:26:03Z

另外所有的Rollbacking状态都可以直接删除吗？这个Rollbacking状态是不是会在几秒后变更为Retrying状态

只有出现异常的情况下才会变更为Retrying，Rollbacking这个状态指的是tm决议发给tc，tc改为rollbacking后开始回滚，然后tm等待tc下发二阶段结束后响应。此时tm是同步等待tc的处理结果的，Retrying基本上只会在tc同步处理时出现异常，才会放入到Retrying这个状态，异步线程里的Rollbacking基本上不可能会变为Retrying

好的，我先试试raft模式。至于分离存储模式下的这个问题，您还有什么建议吗？

存算分离未来不会是社区的推进核心，未来可能会考虑在multi-raft上做处理。所以该存算分离的导致的问题，不会投入太大的精力去处理，一般来说会做一些兜底措施和处理逻辑上的优化，大的架构变动应该是不会出现。
The separation of memory and computing will not be the core of the community in the future, and it may be considered to deal with it on multi-raft in the future. Therefore, the problems caused by the separation of memory and computing will not be dealt with with too much energy. Generally speaking, some fallback measures and processing logic optimization will be done, and major architecture changes should not occur.

nakerny · 2024-04-16T03:44:38Z

另外所有的Rollbacking状态都可以直接删除吗？这个Rollbacking状态是不是会在几秒后变更为Retrying状态

只有出现异常的情况下才会变更为Retrying，Rollbacking这个状态指的是tm决议发给tc，tc改为rollbacking后开始回滚，然后tm等待tc下发二阶段结束后响应。此时tm是同步等待tc的处理结果的，Retrying基本上只会在tc同步处理时出现异常，才会放入到Retrying这个状态，异步线程里的Rollbacking基本上不可能会变为Retrying

好的，我先试试raft模式。至于分离存储模式下的这个问题，您还有什么建议吗？

存算分离未来不会是社区的推进核心，未来可能会考虑在multi-raft上做处理。所以该存算分离的导致的问题，不会投入太大的精力去处理，一般来说会做一些兜底措施和处理逻辑上的优化，大的架构变动应该是不会出现。 The separation of memory and computing will not be the core of the community in the future, and it may be considered to deal with it on multi-raft in the future. Therefore, the problems caused by the separation of memory and computing will not be dealt with with too much energy. Generally speaking, some fallback measures and processing logic optimization will be done, and major architecture changes should not occur.

明白了

liuqiufeng · 2024-04-19T08:56:13Z

I will optimize it using option 2

liuqiufeng self-assigned this Apr 19, 2024

liuqiufeng linked a pull request Apr 26, 2024 that will close this issue

optimize: split the task thread pool for committing and rollbacking statuses #6499

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AT模式下，二阶段回滚失败后，seata-server服务端TC重试回滚定时任务长时间不执行问题 #6478

AT模式下，二阶段回滚失败后，seata-server服务端TC重试回滚定时任务长时间不执行问题 #6478

nakerny commented Apr 15, 2024

funky-eyes commented Apr 15, 2024 •

edited

nakerny commented Apr 15, 2024

nakerny commented Apr 15, 2024

nakerny commented Apr 15, 2024

nakerny commented Apr 15, 2024

funky-eyes commented Apr 15, 2024

nakerny commented Apr 15, 2024

funky-eyes commented Apr 15, 2024

nakerny commented Apr 15, 2024

funky-eyes commented Apr 15, 2024

nakerny commented Apr 15, 2024

nakerny commented Apr 15, 2024

nakerny commented Apr 15, 2024

nakerny commented Apr 15, 2024

funky-eyes commented Apr 15, 2024

funky-eyes commented Apr 15, 2024

nakerny commented Apr 16, 2024

funky-eyes commented Apr 16, 2024

nakerny commented Apr 16, 2024

liuqiufeng commented Apr 19, 2024

AT模式下，二阶段回滚失败后，seata-server服务端TC重试回滚定时任务长时间不执行问题 #6478

AT模式下，二阶段回滚失败后，seata-server服务端TC重试回滚定时任务长时间不执行问题 #6478

Comments

nakerny commented Apr 15, 2024

Ⅰ. Issue Description

Ⅱ. Describe what happened

Ⅲ. Describe what you expected to happen

Ⅵ. Environment:

funky-eyes commented Apr 15, 2024 • edited

nakerny commented Apr 15, 2024

nakerny commented Apr 15, 2024

nakerny commented Apr 15, 2024

nakerny commented Apr 15, 2024

funky-eyes commented Apr 15, 2024

nakerny commented Apr 15, 2024

funky-eyes commented Apr 15, 2024

nakerny commented Apr 15, 2024

funky-eyes commented Apr 15, 2024

nakerny commented Apr 15, 2024

nakerny commented Apr 15, 2024

nakerny commented Apr 15, 2024

nakerny commented Apr 15, 2024

funky-eyes commented Apr 15, 2024

funky-eyes commented Apr 15, 2024

nakerny commented Apr 16, 2024

funky-eyes commented Apr 16, 2024

nakerny commented Apr 16, 2024

liuqiufeng commented Apr 19, 2024

funky-eyes commented Apr 15, 2024 •

edited