Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Domains: Bundle limit check breaks extrinsic ordering #2715

Closed
vedhavyas opened this issue Apr 24, 2024 · 18 comments
Closed

Domains: Bundle limit check breaks extrinsic ordering #2715

vedhavyas opened this issue Apr 24, 2024 · 18 comments
Assignees

Comments

@vedhavyas
Copy link
Member

Recently we saw on gemini-3h for EVM chain where extrinsic ordering was broken within each bundles. I initially assumed this was due to either substrate upstream or frontier upstream. Turns out this was breaking the extrinsic ordering due to the regression introduced in this - https://github.com/subspace/subspace/pull/2454/files

We seems to be skipping any ready extrinsics coming from tx_pool. If there are multiple extrinsics from the same sender for example with nonce: 0, 1, 2, 3, 4, 5
If adding the extrinsic with nonce 3 makes either bundle limit or size more than the maximum, then we skip that and we move to next extrinsic, which is 4. If that fits in, we add that to the bundle there by breaking the order. With this, any extrinsic after nonce 2 will fail at pre_dispatch since txn with nonce 3 is missing during the domain block import.

If the extrinsic with nonce 3 is included in the next bundle within the same consensus block, the order is not restored since we order the final extrinsics list is defined by the order of the bundles included in the Consensus block.

This brings us to two issues at hand

  • Bundle limit check should account for transaction sender. If an extrinsic with an account a was skipped, then all the proceeding extrinsics from the same sender should be skipped within a given bundle. This should be a trivial fix and I have tested this locally.
  • Final extrinsics ordering should not follow the order of the bundles submitted but instead, after shuffling the extrinsic grouped by sender, we should also re-arrange the extrinsics from a given sender in sorted order. This ensures that if the bundle order is shuffled for any given reason, domain operators will still end up sorting the extrinsics from the given sender in a proper order. This also solves one more issue at hand where in if a sender submits a transactions with same nonce but higher tip for example, speeding up the transactions in Ethereum terminology, final extrinsics ordering can consider the extrinsic with higher tip to be executed first and remaining extrinsics with same nonces will fail due to pre_dispatch. This would be inline with how both Substrate and Eth chains works but unfortunately these failed extrinsics will still be included in the bundle and there by Domain block due to our domain protocol.

Gemini-3h

  • For the first issue, this is purely client side. Once we make a new release with this change, operators with old release may still submit the extrinsic in faulty order. Unfortunately, some extrinsics may end up failing at pre dispatch due to operators running old release. But this should not break any compatibility since new and old operators will still follow extrinsic ordering in the bundle order while deriving the Domain block so both of them should agree on the same final set.
  • For the second issue, we can move the final extrinsic ordering for each sender to runtime instead of client. So this would require a runtime upgrade first. But since the existing final extrinsics list derivation is on client side, operators running old versions may still derive different order and might lead to fraud proofs. IMHO, I would prefer not introducing this change to Gemini-3h but instead for the next network.

Thoughts @nazar-pc @dariolina @NingLin-P ?

@vedhavyas vedhavyas self-assigned this Apr 24, 2024
@NingLin-P
Copy link
Member

This is quite a surprise to me as I did consider this scenario when working on #2454, BTW have you reproduced this with integration test?

If adding the extrinsic with nonce 3 makes either bundle limit or size more than the maximum, then we skip that and we move to next extrinsic, which is 4. If that fits in, we add that to the bundle there by breaking the order.

The extrinsic with nonce 4 (and any following extrinsic from the same sender) even can fit in the bundle will still fail later in the check_extrinsics_and_do_pre_dispatch check and not included in the same bundle, because we are using the same instance of runtime API and the state change will be preserved which mean inside the runtime API instance the sender nonce is 2 and it is expecting 3, any other nonce will be rejected. So the first issue should not exist.

As for the second issue, the check_extrinsics_and_do_pre_dispatch check will ensure the extrinsic from the same sender must be ordered within the same bundle, so a bundle may has 0, 1, 2, 3 and another bundle may has 0,1, no matter how the order of these 2 bundles is, after de-duplicate and compile Vec<Bundle> to Vec<Extrinsic> the order of the extrinsic from the same sender should still preserved (unless there are other regression).

With a closer look, I'm suspected with the check_extrinsics_and_do_pre_dispatch check:

  • There are a lot of checks performed within pre_dispatch if nonce check pass but later the tx fee check fail, will the nonce change perserve in the runtime API instance, if so then it will also bring the first issue and will also be a problem of the invalid bundle FP
  • It seems the pre_dispatch of the evm tx in frontier is different from substrate, that it does not update nonce and charge tx fee in pre_dispatch, if so it will also brings the first issue, though I may missing something.

@NingLin-P
Copy link
Member

Also, I believe the issue existed long before #2454, since the bundle limit check and any other checks during bundle construction were introduced long time ago and they will skip tx if the check fail, while #2454 just makes the check more efficient.

@nazar-pc
Copy link
Member

operators with old release may still submit the extrinsic in faulty order

Does this mean they will be slashed?

after shuffling the extrinsic grouped by sender, we should also re-arrange the extrinsics from a given sender in sorted order

This is how Liu-Cheng implemented long time ago, is it not the case anymore?

I would prefer not introducing this change to Gemini-3h but instead for the next network

IIRC 3h was supposed to be the last network before mainnet, is this still the case @dariolina?

@dariolina
Copy link
Member

We can reset 3h if we need to, but this issue does not seem sufficient reason for reset. Is it?

@NingLin-P
Copy link
Member

If the issue is the check_extrinsics_and_do_pre_dispatch runtime api then the fix will most likely be a domain runtime upgrade, it should not bring any compatible issues with 3h, though we need to test the domain runtime upgrade path first in devnet since we have never done it before.

But in case we need to do such FP verification/ER derivation upgrade in the future, we should consider #2712

@vedhavyas
Copy link
Member Author

vedhavyas commented Apr 25, 2024

Does this mean they will be slashed?

Yes, if we fix the second issue, operator running old version will get slashed due to change in Domain block extrinsic hash

This is how Liu-Cheng implemented long time ago, is it not the case anymore?

I don't believe it was never changed. AFAIK, we never adjusted the order but will need to check the history if we broke it post that.

It seems the pre_dispatch of the evm tx in frontier is different from substrate, that it does not update nonce and charge tx fee in pre_dispatch, if so it will also brings the first issue, though I may missing something.

Turns out this is issue with pre_dispatch from EVM like @NingLin-P suggested but
Here are the issues:

  1. https://github.com/subspace/subspace/blob/main/domains/runtime/evm/src/lib.rs#L852 Gobbles up the inner result and as a result, pre_dispatch on our end always passes but this inner result may return a nonce error.
  2. The reason it returns nonce error is it does not store the nonce during the pre_dispatch but rather during the execution. So from the list of txns with nonces 0, 1, 2, 3. Only txn with nonce 0 will make it into any bundles for a given consensus block. Rest of txns will only make it into the bundles once the previous transaction is executed and nonce is updated. More here -https://github.com/subspace/frontier/blob/subspace-v6/primitives/evm/src/validation.rs#L109
  3. To fix the issue such that multiple txns with incrementing nonces go into the bundle, we need to use modified version pre_dispatch_self_contained that allow future nonces as well instead of immediate next nonce. This poses another problem where we cannot allow any future nonces but rather incrementing future nonces else bundle proposer will still skip an in-between transaction due to bundle weight limits introduced here - Try to include more tx in the bundle and skip including tx that already included in previous bundle  #2454

@NingLin-P
Copy link
Member

https://github.com/subspace/subspace/blob/main/domains/runtime/evm/src/lib.rs#L852 Gobbles up the inner result and as a result, pre_dispatch on our end always passes but this inner result may return a nonce error.

Yeah, this is another issue we have 👍

I think the fix will be:

All these changes are in the evm domain runtime so we will need a domain runtime upgrade.

There are a lot of checks performed within pre_dispatch if nonce check pass but later the tx fee check fail, will the nonce change perserve in the runtime API instance, if so then it will also bring the first issue and will also be a problem of the invalid bundle FP

Also, this seems not related to the issue we met here, but is something we need to handle.

@vedhavyas
Copy link
Member Author

Instead of allowing future nonce, we should increase the nonce and charge the tx fee in our wrapper of pre_dispatch_self_contained , so evm tx will be handled like substrate tx

I would first like to understand why frontier deviated in the first place. Maybe this should be fixed there if possible

@NingLin-P
Copy link
Member

The issue is fixed in #2723, do you think we need to test it on devnet before closing this issue? @vedhavyas

@dariolina
Copy link
Member

I think it should be deployed to devnet so @abhi3700 can test whether it fixes the issue he was facing with tx ordering.

@abhi3700
Copy link
Member

abhi3700 commented May 14, 2024

I have tested devnet-nova-2024-may-06 version.

Initially, when I sent 32 txs using a foundry script, it kind of got stuck (also in some cases failed due to OutOfGas error) & then after force exit (after long wait). And then verified from the PolkadotJS explorer that it's waiting in pool:
image
And then on re-running the script, more txs got stuck.

So, @NingLin-P suggested to do the testing on devnet (run locally) to avoid repetitive devnet reboot. I found some issues there:

  • OutOfGas error
  • Unexpected Reverted error.
  • backend: failed while inspecting

Here is the Notion doc which details out everything (success/failures doing some targeted testing).

@vedhavyas
Copy link
Member Author

@abhi3700 The issue talks about whether transaction ordering is fixed with the recent patch we made.

We are aware of the OutOfGas issue and this is unrelated to this issue we are talking about.
So to sum it up, are the transactions you are sending being executed in the nonce order ? If not please let us know
Else we will consider this fixed and close this issue as resolved

@abhi3700
Copy link
Member

abhi3700 commented May 15, 2024

I tested on local devnet and I didn't find the tx ordering issue. But on testing over the current live devnet, txs are still stuck in the pool (as shown in the image above).
So, I am ambiguous whether tx ordering issue is resolved.

@vedhavyas
Copy link
Member Author

@abhi3700 transactions being stuck in the tx-pool seems not related to the inclusion of transactions in order. Earlier transactions were included out of order and as a result txns are failing to be executed.

Transactions getting stuck in tx-pool means either of the following

  • Nonce gap between the txns
  • Operator might not submitting the bundles

I would suggest you ensure there are no nonce-gaps created by your script. If that is not the case, since you mentioned running the node locally, is the operator submitting bundles? I have recently ran multiple scripts provided by you and @marc-aurele-besner which sends around 30 txns and all of them goes through fine given operator is submitting bundles.

Now coming to the issues you found on your local node,

OutOfGas error

There is know issue with frontier to gas_estimate for a given transaction and we are working to fix it upstream. How was GasLimit supplied to these txns? Have you tried overriding gas limit and sending the txn and see if the error is fixed for now?

Unexpected Reverted error.

This seems like a contract issue rather than EVM issue. Does the same txn goes through in other test chains you have used before, if so we would need to see how EVM contract execution is different from ours. Also would be great if you can provide a script to reproduce this to debug further.

backend: failed while inspecting

Where do you see this error? is this on the client submitting txns or on the node? I dont think we have such an error from operator node earlier.

@abhi3700
Copy link
Member

abhi3700 commented May 15, 2024

I would suggest you ensure there are no nonce-gaps created by your script.

In the script, there is no such setting done to set nonce manually.

If that is not the case, since you mentioned running the node locally, is the operator submitting bundles?

Yes it is.

I have recently ran multiple scripts provided by you and @marc-aurele-besner which sends around 30 txns and all of them goes through fine given operator is submitting bundles.

I am not sure about the current live devnet. But as I saw the txs stuck in the pool on it. That's why I highlighted. Although while running devnet locally, txs didn't get stuck in the pool.

OutOfGas error

There is know issue with frontier to gas_estimate for a given transaction and we are working to fix it upstream. How was GasLimit supplied to these txns? Have you tried overriding gas limit and sending the txn and see if the error is fixed for now?

The gas limit was supplied via 2 main ways via

  1. $ forge ... --gas-limit
  2. manually in the code during function calls.
    I have tried every combination in order to not come up with such gas issue.

You can check these:

If I missed any, please let me know on this.

Unexpected Reverted error.

This seems like a contract issue rather than EVM issue. Does the same txn goes through in other test chains you have used before, if so we would need to see how EVM contract execution is different from ours. Also would be great if you can provide a script to reproduce this to debug further.

I have run the script successfully on ETH Sepolia chain multiple times.
I just ran again for record. You can find the details here along with the broadcasted tx details (in JSON format).

backend: failed while inspecting

Where do you see this error? is this on the client submitting txns or on the node? I dont think we have such an error from operator node earlier.

On the client side in CLI. It can be reproduced as well.
Details here

@vedhavyas
Copy link
Member Author

I am not sure about the current live devnet. But as I saw the txs stuck in the pool on it. That's why I highlighted. Although while running devnet locally, txs didn't get stuck in the pool.

okay seems unrelated to this. Need to debug devnet for this.

The gas limit was supplied via 2 main ways via
$ forge ... --gas-limit
manually in the code during function calls.
I have tried every combination in order to not come up with such gas issue.

I'm confused on this. For me while running locally, with gas_limit overrided, everything seems to work from the contract scripts you have provided. I'm wondering if this is specific to that contract itself.

I have run the script successfully on ETH Sepolia chain multiple times.
I just ran again for record. You can find the details here along with the broadcasted tx details (in JSON format).

Alright, can you create a new issue with a script to reproduce this locally for further debugging ?

On the client side in CLI. It can be reproduced as well.

I'm not sure if this is actually node issue but lets create an issue tracker for this with details on github instead of notion.

@abhi3700 it would be ideal to put all the details including scripts to reproduce on Github issue instead of Notion so we have one place to track them.

Overall, transaction ordering issue is resolved. I wold like to close this issue and open any new issues found here seperately to keep it easier to track unless you disagree @dariolina

@dariolina
Copy link
Member

If it's a separate issue in your opinion, then absolutely open a new issue

@abhi3700
Copy link
Member

Okay let's close this issue.

I will open new issues for the unrelated ones.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants