Domains: Bundle limit check breaks extrinsic ordering #2715

vedhavyas · 2024-04-24T15:12:57Z

Recently we saw on gemini-3h for EVM chain where extrinsic ordering was broken within each bundles. I initially assumed this was due to either substrate upstream or frontier upstream. Turns out this was breaking the extrinsic ordering due to the regression introduced in this - https://github.com/subspace/subspace/pull/2454/files

We seems to be skipping any ready extrinsics coming from tx_pool. If there are multiple extrinsics from the same sender for example with nonce: 0, 1, 2, 3, 4, 5
If adding the extrinsic with nonce 3 makes either bundle limit or size more than the maximum, then we skip that and we move to next extrinsic, which is 4. If that fits in, we add that to the bundle there by breaking the order. With this, any extrinsic after nonce 2 will fail at pre_dispatch since txn with nonce 3 is missing during the domain block import.

If the extrinsic with nonce 3 is included in the next bundle within the same consensus block, the order is not restored since we order the final extrinsics list is defined by the order of the bundles included in the Consensus block.

This brings us to two issues at hand

Bundle limit check should account for transaction sender. If an extrinsic with an account a was skipped, then all the proceeding extrinsics from the same sender should be skipped within a given bundle. This should be a trivial fix and I have tested this locally.
Final extrinsics ordering should not follow the order of the bundles submitted but instead, after shuffling the extrinsic grouped by sender, we should also re-arrange the extrinsics from a given sender in sorted order. This ensures that if the bundle order is shuffled for any given reason, domain operators will still end up sorting the extrinsics from the given sender in a proper order. This also solves one more issue at hand where in if a sender submits a transactions with same nonce but higher tip for example, speeding up the transactions in Ethereum terminology, final extrinsics ordering can consider the extrinsic with higher tip to be executed first and remaining extrinsics with same nonces will fail due to pre_dispatch. This would be inline with how both Substrate and Eth chains works but unfortunately these failed extrinsics will still be included in the bundle and there by Domain block due to our domain protocol.

Gemini-3h

For the first issue, this is purely client side. Once we make a new release with this change, operators with old release may still submit the extrinsic in faulty order. Unfortunately, some extrinsics may end up failing at pre dispatch due to operators running old release. But this should not break any compatibility since new and old operators will still follow extrinsic ordering in the bundle order while deriving the Domain block so both of them should agree on the same final set.
For the second issue, we can move the final extrinsic ordering for each sender to runtime instead of client. So this would require a runtime upgrade first. But since the existing final extrinsics list derivation is on client side, operators running old versions may still derive different order and might lead to fraud proofs. IMHO, I would prefer not introducing this change to Gemini-3h but instead for the next network.

Thoughts @nazar-pc @dariolina @NingLin-P ?

The text was updated successfully, but these errors were encountered:

NingLin-P · 2024-04-24T19:12:21Z

This is quite a surprise to me as I did consider this scenario when working on #2454, BTW have you reproduced this with integration test?

If adding the extrinsic with nonce 3 makes either bundle limit or size more than the maximum, then we skip that and we move to next extrinsic, which is 4. If that fits in, we add that to the bundle there by breaking the order.

The extrinsic with nonce 4 (and any following extrinsic from the same sender) even can fit in the bundle will still fail later in the check_extrinsics_and_do_pre_dispatch check and not included in the same bundle, because we are using the same instance of runtime API and the state change will be preserved which mean inside the runtime API instance the sender nonce is 2 and it is expecting 3, any other nonce will be rejected. So the first issue should not exist.

As for the second issue, the check_extrinsics_and_do_pre_dispatch check will ensure the extrinsic from the same sender must be ordered within the same bundle, so a bundle may has 0, 1, 2, 3 and another bundle may has 0,1, no matter how the order of these 2 bundles is, after de-duplicate and compile Vec<Bundle> to Vec<Extrinsic> the order of the extrinsic from the same sender should still preserved (unless there are other regression).

With a closer look, I'm suspected with the check_extrinsics_and_do_pre_dispatch check:

There are a lot of checks performed within pre_dispatch if nonce check pass but later the tx fee check fail, will the nonce change perserve in the runtime API instance, if so then it will also bring the first issue and will also be a problem of the invalid bundle FP
It seems the pre_dispatch of the evm tx in frontier is different from substrate, that it does not update nonce and charge tx fee in pre_dispatch, if so it will also brings the first issue, though I may missing something.

NingLin-P · 2024-04-24T19:28:33Z

Also, I believe the issue existed long before #2454, since the bundle limit check and any other checks during bundle construction were introduced long time ago and they will skip tx if the check fail, while #2454 just makes the check more efficient.

nazar-pc · 2024-04-25T02:22:17Z

operators with old release may still submit the extrinsic in faulty order

Does this mean they will be slashed?

after shuffling the extrinsic grouped by sender, we should also re-arrange the extrinsics from a given sender in sorted order

This is how Liu-Cheng implemented long time ago, is it not the case anymore?

I would prefer not introducing this change to Gemini-3h but instead for the next network

IIRC 3h was supposed to be the last network before mainnet, is this still the case @dariolina?

dariolina · 2024-04-25T06:30:12Z

We can reset 3h if we need to, but this issue does not seem sufficient reason for reset. Is it?

NingLin-P · 2024-04-25T07:01:08Z

If the issue is the check_extrinsics_and_do_pre_dispatch runtime api then the fix will most likely be a domain runtime upgrade, it should not bring any compatible issues with 3h, though we need to test the domain runtime upgrade path first in devnet since we have never done it before.

But in case we need to do such FP verification/ER derivation upgrade in the future, we should consider #2712

vedhavyas · 2024-04-25T07:51:53Z

Does this mean they will be slashed?

Yes, if we fix the second issue, operator running old version will get slashed due to change in Domain block extrinsic hash

This is how Liu-Cheng implemented long time ago, is it not the case anymore?

I don't believe it was never changed. AFAIK, we never adjusted the order but will need to check the history if we broke it post that.

It seems the pre_dispatch of the evm tx in frontier is different from substrate, that it does not update nonce and charge tx fee in pre_dispatch, if so it will also brings the first issue, though I may missing something.

Turns out this is issue with pre_dispatch from EVM like @NingLin-P suggested but
Here are the issues:

https://github.com/subspace/subspace/blob/main/domains/runtime/evm/src/lib.rs#L852 Gobbles up the inner result and as a result, pre_dispatch on our end always passes but this inner result may return a nonce error.
The reason it returns nonce error is it does not store the nonce during the pre_dispatch but rather during the execution. So from the list of txns with nonces 0, 1, 2, 3. Only txn with nonce 0 will make it into any bundles for a given consensus block. Rest of txns will only make it into the bundles once the previous transaction is executed and nonce is updated. More here -https://github.com/subspace/frontier/blob/subspace-v6/primitives/evm/src/validation.rs#L109
To fix the issue such that multiple txns with incrementing nonces go into the bundle, we need to use modified version pre_dispatch_self_contained that allow future nonces as well instead of immediate next nonce. This poses another problem where we cannot allow any future nonces but rather incrementing future nonces else bundle proposer will still skip an in-between transaction due to bundle weight limits introduced here - Try to include more tx in the bundle and skip including tx that already included in previous bundle #2454

NingLin-P · 2024-04-25T08:26:19Z

https://github.com/subspace/subspace/blob/main/domains/runtime/evm/src/lib.rs#L852 Gobbles up the inner result and as a result, pre_dispatch on our end always passes but this inner result may return a nonce error.

Yeah, this is another issue we have 👍

I think the fix will be:

Return the inner result in https://github.com/subspace/subspace/blob/main/domains/runtime/evm/src/lib.rs#L852
Instead of allowing future nonce, we should increase the nonce and charge the tx fee in our wrapper of pre_dispatch_self_contained , so evm tx will be handled like substrate tx

All these changes are in the evm domain runtime so we will need a domain runtime upgrade.

There are a lot of checks performed within pre_dispatch if nonce check pass but later the tx fee check fail, will the nonce change perserve in the runtime API instance, if so then it will also bring the first issue and will also be a problem of the invalid bundle FP

Also, this seems not related to the issue we met here, but is something we need to handle.

vedhavyas · 2024-04-25T08:59:36Z

Instead of allowing future nonce, we should increase the nonce and charge the tx fee in our wrapper of pre_dispatch_self_contained , so evm tx will be handled like substrate tx

I would first like to understand why frontier deviated in the first place. Maybe this should be fixed there if possible

NingLin-P · 2024-05-01T18:16:04Z

The issue is fixed in #2723, do you think we need to test it on devnet before closing this issue? @vedhavyas

dariolina · 2024-05-01T18:46:45Z

I think it should be deployed to devnet so @abhi3700 can test whether it fixes the issue he was facing with tx ordering.

abhi3700 · 2024-05-14T06:37:25Z

I have tested devnet-nova-2024-may-06 version.

Initially, when I sent 32 txs using a foundry script, it kind of got stuck (also in some cases failed due to OutOfGas error) & then after force exit (after long wait). And then verified from the PolkadotJS explorer that it's waiting in pool:

And then on re-running the script, more txs got stuck.

So, @NingLin-P suggested to do the testing on devnet (run locally) to avoid repetitive devnet reboot. I found some issues there:

OutOfGas error
Unexpected Reverted error.
backend: failed while inspecting

Here is the Notion doc which details out everything (success/failures doing some targeted testing).

vedhavyas · 2024-05-14T06:53:14Z

@abhi3700 The issue talks about whether transaction ordering is fixed with the recent patch we made.

We are aware of the OutOfGas issue and this is unrelated to this issue we are talking about.
So to sum it up, are the transactions you are sending being executed in the nonce order ? If not please let us know
Else we will consider this fixed and close this issue as resolved

abhi3700 · 2024-05-15T06:28:59Z

I tested on local devnet and I didn't find the tx ordering issue. But on testing over the current live devnet, txs are still stuck in the pool (as shown in the image above).
So, I am ambiguous whether tx ordering issue is resolved.

vedhavyas · 2024-05-15T08:20:47Z

@abhi3700 transactions being stuck in the tx-pool seems not related to the inclusion of transactions in order. Earlier transactions were included out of order and as a result txns are failing to be executed.

Transactions getting stuck in tx-pool means either of the following

Nonce gap between the txns
Operator might not submitting the bundles

I would suggest you ensure there are no nonce-gaps created by your script. If that is not the case, since you mentioned running the node locally, is the operator submitting bundles? I have recently ran multiple scripts provided by you and @marc-aurele-besner which sends around 30 txns and all of them goes through fine given operator is submitting bundles.

Now coming to the issues you found on your local node,

OutOfGas error

There is know issue with frontier to gas_estimate for a given transaction and we are working to fix it upstream. How was GasLimit supplied to these txns? Have you tried overriding gas limit and sending the txn and see if the error is fixed for now?

Unexpected Reverted error.

This seems like a contract issue rather than EVM issue. Does the same txn goes through in other test chains you have used before, if so we would need to see how EVM contract execution is different from ours. Also would be great if you can provide a script to reproduce this to debug further.

backend: failed while inspecting

Where do you see this error? is this on the client submitting txns or on the node? I dont think we have such an error from operator node earlier.

abhi3700 · 2024-05-15T15:06:33Z

I would suggest you ensure there are no nonce-gaps created by your script.

In the script, there is no such setting done to set nonce manually.

If that is not the case, since you mentioned running the node locally, is the operator submitting bundles?

Yes it is.

I have recently ran multiple scripts provided by you and @marc-aurele-besner which sends around 30 txns and all of them goes through fine given operator is submitting bundles.

I am not sure about the current live devnet. But as I saw the txs stuck in the pool on it. That's why I highlighted. Although while running devnet locally, txs didn't get stuck in the pool.

OutOfGas error

There is know issue with frontier to gas_estimate for a given transaction and we are working to fix it upstream. How was GasLimit supplied to these txns? Have you tried overriding gas limit and sending the txn and see if the error is fixed for now?

The gas limit was supplied via 2 main ways via

$ forge ... --gas-limit
manually in the code during function calls.
I have tried every combination in order to not come up with such gas issue.

You can check these:

If I missed any, please let me know on this.

Unexpected Reverted error.

This seems like a contract issue rather than EVM issue. Does the same txn goes through in other test chains you have used before, if so we would need to see how EVM contract execution is different from ours. Also would be great if you can provide a script to reproduce this to debug further.

I have run the script successfully on ETH Sepolia chain multiple times.
I just ran again for record. You can find the details here along with the broadcasted tx details (in JSON format).

backend: failed while inspecting

Where do you see this error? is this on the client submitting txns or on the node? I dont think we have such an error from operator node earlier.

On the client side in CLI. It can be reproduced as well.
Details here

vedhavyas · 2024-05-16T15:02:52Z

I am not sure about the current live devnet. But as I saw the txs stuck in the pool on it. That's why I highlighted. Although while running devnet locally, txs didn't get stuck in the pool.

okay seems unrelated to this. Need to debug devnet for this.

The gas limit was supplied via 2 main ways via
$ forge ... --gas-limit
manually in the code during function calls.
I have tried every combination in order to not come up with such gas issue.

I'm confused on this. For me while running locally, with gas_limit overrided, everything seems to work from the contract scripts you have provided. I'm wondering if this is specific to that contract itself.

I have run the script successfully on ETH Sepolia chain multiple times.
I just ran again for record. You can find the details here along with the broadcasted tx details (in JSON format).

Alright, can you create a new issue with a script to reproduce this locally for further debugging ?

On the client side in CLI. It can be reproduced as well.

I'm not sure if this is actually node issue but lets create an issue tracker for this with details on github instead of notion.

@abhi3700 it would be ideal to put all the details including scripts to reproduce on Github issue instead of Notion so we have one place to track them.

Overall, transaction ordering issue is resolved. I wold like to close this issue and open any new issues found here seperately to keep it easier to track unless you disagree @dariolina

dariolina · 2024-05-16T19:33:00Z

If it's a separate issue in your opinion, then absolutely open a new issue

abhi3700 · 2024-05-17T05:19:38Z

Okay let's close this issue.

I will open new issues for the unrelated ones.

vedhavyas self-assigned this Apr 24, 2024

vedhavyas closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Domains: Bundle limit check breaks extrinsic ordering #2715

Domains: Bundle limit check breaks extrinsic ordering #2715

vedhavyas commented Apr 24, 2024

NingLin-P commented Apr 24, 2024

NingLin-P commented Apr 24, 2024

nazar-pc commented Apr 25, 2024

dariolina commented Apr 25, 2024

NingLin-P commented Apr 25, 2024

vedhavyas commented Apr 25, 2024 •

edited

NingLin-P commented Apr 25, 2024

vedhavyas commented Apr 25, 2024

NingLin-P commented May 1, 2024

dariolina commented May 1, 2024

abhi3700 commented May 14, 2024 •

edited

vedhavyas commented May 14, 2024

abhi3700 commented May 15, 2024 •

edited

vedhavyas commented May 15, 2024

abhi3700 commented May 15, 2024 •

edited

vedhavyas commented May 16, 2024

dariolina commented May 16, 2024

abhi3700 commented May 17, 2024

Domains: Bundle limit check breaks extrinsic ordering #2715

Domains: Bundle limit check breaks extrinsic ordering #2715

Comments

vedhavyas commented Apr 24, 2024

NingLin-P commented Apr 24, 2024

NingLin-P commented Apr 24, 2024

nazar-pc commented Apr 25, 2024

dariolina commented Apr 25, 2024

NingLin-P commented Apr 25, 2024

vedhavyas commented Apr 25, 2024 • edited

NingLin-P commented Apr 25, 2024

vedhavyas commented Apr 25, 2024

NingLin-P commented May 1, 2024

dariolina commented May 1, 2024

abhi3700 commented May 14, 2024 • edited

vedhavyas commented May 14, 2024

abhi3700 commented May 15, 2024 • edited

vedhavyas commented May 15, 2024

abhi3700 commented May 15, 2024 • edited

vedhavyas commented May 16, 2024

dariolina commented May 16, 2024

abhi3700 commented May 17, 2024

vedhavyas commented Apr 25, 2024 •

edited

abhi3700 commented May 14, 2024 •

edited

abhi3700 commented May 15, 2024 •

edited

abhi3700 commented May 15, 2024 •

edited