Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FireFly 1.3 Performance Testing #1465

Closed
SamMayWork opened this issue Feb 14, 2024 · 7 comments
Closed

FireFly 1.3 Performance Testing #1465

SamMayWork opened this issue Feb 14, 2024 · 7 comments

Comments

@SamMayWork
Copy link
Contributor

Doing performance testing ahead of the 1.3 release following the example set in #563 (comment). It's worth noting that we don't have an RC at the moment so this is most preliminary testing, but good to get some figures on the board.

Setup is the same as in the older performance testing issue:

2 FireFly nodes on one virtual server (EC2 m4.xlarge)
Entire FireFly stack is local to the server (ie both blockchains, Postgres databases, etc)
Single geth node with 2 instances of ethconnect
Maximum time to confirm before considering failure = 1 minute

@SamMayWork
Copy link
Contributor Author

SamMayWork commented Feb 19, 2024

The Plan

Here's the plan: we don't yet have an RC of 1.3, but it's not worth holding up the performance testing for that, if there's some major slow down then it's worth getting an idea of that now so we can start to work on the fixed required before we start the release process.

With that in mind, we need to run a performance run of FireFly v1.2.0 and then a run of the current main branch to see if there's a performance difference. Given this is super preliminary, it should be OK to use a shorter time frame to get a feeling for how quickly we're getting through transactions.

FireFly v1.2.0

Using the configuration below, we get the following results for an hour window of running the tests:

Expand for test configuration
nohup ./start.sh &> ffperf.log &

core-config.yml

log:
        level: debug
broadcast:
        batch:
                size: 200
                timeout: 1s
privatemessaging:
        batch:
                size: 200
                timeout: 1s
message:
        writer:
                count: 5
download:
        worker:
                count: 100
publicstorage:
                ipfs:
                        api:
                                requestTimeout: 2s
                        gateway:
                                requestTimeout: 2s

ethconnect.yml

rest:
  rest-gateway:
    maxTXWaitTime: 120
    maxInFlight: 200
    alwaysManageNonce: true
    attemptGapFill: true
    sendConcurrency: 3
    gasEstimationFactor: 2.0
confirmations:
  required: 5
debug:
  port: 6000

instances.yml

stackJSONPath: /home/ubuntu/.firefly/stacks/perf-1/stack.json

wsConfig:
  wsPath: /ws
  readBufferSize: 16000
  writeBufferSize: 16000
  initialDelay: 250ms
  maximumDelay: 30s
  initialConnectAttempts: 5
  heartbeatInterval: 5s

instances:
  - name: long-run
    tests: [{"name": "msg_broadcast", "workers":50},{"name": "msg_private", "workers":50},{"name": "blob_broadcast", "workers":30},{"name": "blob_private", "workers":30},{"name": "custom_ethereum_contract", "workers":20},{"name": "token_mint", "workers":10}]
    length: 500h
    sender: 0
    recipient: 1
    messageOptions:
      longMessage: false
    tokenOptions:
      tokenType: fungible
    contractOptions: {"address": "0x69eb9c6b573415d3a8ed2fd38c0f9685a36c8455"}

FireFly git commit:

dc86491bdf1adc9f25cdb410eacfcbf0e1fc23ba
image

Measured TPS according to logs is hanging around ~40.26 TPS


FireFly (what will be) v1.3.0

Using the configuration below, we get the following results for an hour window of running the tests:

Expand for test configuration
nohup ./start.sh &> ffperf.log &

core-config.yml

log:
        level: debug
broadcast:
        batch:
                size: 200
                timeout: 1s
privatemessaging:
        batch:
                size: 200
                timeout: 1s
message:
        writer:
                count: 5
download:
        worker:
                count: 100
publicstorage:
                ipfs:
                        api:
                                requestTimeout: 2s
                        gateway:
                                requestTimeout: 2s

ethconnect.yml

rest:
  rest-gateway:
    maxTXWaitTime: 120
    maxInFlight: 200
    alwaysManageNonce: true
    attemptGapFill: true
    sendConcurrency: 3
    gasEstimationFactor: 2.0
confirmations:
  required: 5
debug:
  port: 6000

instances.yml

stackJSONPath: /home/ubuntu/.firefly/stacks/perf-2/stack.json

wsConfig:
  wsPath: /ws
  readBufferSize: 16000
  writeBufferSize: 16000
  initialDelay: 250ms
  maximumDelay: 30s
  initialConnectAttempts: 5
  heartbeatInterval: 5s

instances:
  - name: long-run
    tests: [{"name": "msg_broadcast", "workers":50},{"name": "msg_private", "workers":50},{"name": "blob_broadcast", "workers":30},{"name": "blob_private", "workers":30},{"name": "custom_ethereum_contract", "workers":20},{"name": "token_mint", "workers":10}]
    length: 500h
    sender: 0
    recipient: 1
    messageOptions:
      longMessage: false
    tokenOptions:
      tokenType: fungible
    contractOptions: {"address": "0x457e54a0e07aae6332ffdeea59e3c5580810ba0a"}

FireFly git commit:

8b6b604d6f17396d4ba045b0221fc6e31408dcb3
image

Measured TPS according to logs is hanging around ~37.92 TPS


Looks like there's a ~7% difference in measured TPS between both of the runs (it's worth noting that the second run ran longer than the first, so even though the time window is an hour, the totals will still be higher). Will need to dig into specific metrics to understand where we're losing time.

@SamMayWork
Copy link
Contributor Author

SamMayWork commented Feb 20, 2024

Down the rabbit hole 🐰

NOTE: I'll be referring to '1.3' as what's currently in main, even though there isn't an RC

After performing 2 new runs of the performance test using the normal long-running configuration over the course of an hour, we get the following figures for inbound/outbound API response times:

| Route (Outbound Calls)                              1.2 avg   1.2 max   1.3 avg   1.3 max  |  Which faster?  |
| -------------------------------------------------------------------------------------------|                 |
| http://ipfs_0:5001/api/v0/add                       10.44ms   114.1ms   11.68ms   98.03ms  |       N/A       |
| http://dataexchange_0:3000/api/v1/blobs/default/    14.23ms   183.9ms   37.91ms   213.0ms  |       1.2       |
| http://dataexchange_0:3000/api/v1/transfers         27.22ms   76.45ms   22.74ms   95.65ms  |       N/A       |
| http://dataexchange_0:3000/api/v1/messages           9.67ms   75.15ms    5.32ms   68.73ms  |       N/A       |
| http://evmconnect_0:5008/                            9.74ms   108.8ms   21.74ms   151.9ms  |       1.2       |
| http://tokens_0_0:3000/api/v1/mint                  21.94ms   98.93ms   32.28ms   137.8ms  |       1.2       |


| Route (Inbound Calls)                               1.2 avg   1.2 max   1.3 avg   1.3 max  |  Which faster? |
| -------------------------------------------------------------------------------------------|                |
| /api/v1/namespaces/default/messages/broadcast       51.85ms   328.1ms   37.21ms   301.3ms  |       1.3!     |
| /api/v1/namespaces/default/messages/private         54.05ms   452.3ms   38.02ms   273.6ms  |       1.3!     |
| /api/v1/namespaces/default/data                     63.31ms   407.5ms   66.87ms   385.1ms  |       N/A      |
| /api/v1/namespaces/default/contracts/invoke         28.42ms   226.9ms   44.34ms   228.4ms  |       1.2      |
| /api/v1/namespaces/default/tokens/mint              78.77ms   376.8ms   77.63ms   287.7ms  |       N/A      |

Preliminary conclusions

  1. Calls outbound to DataExchange BLOBs API is significantly faster on v1.2.0 than on v1.3.0
  2. Calls outbound to the EVM Connector are significantly faster on v1.2.0 than on v1.3.0
  3. Calls outbound to the Token Connector are faster on v1.2.0 than on v1.3.0
  4. v1.3.0 performs faster on all inbound API requests than v1.2.0 apart from `/api/v1/namespaces/default/contracts/invoke (though this could be due to token connector slow down)

*1- It's worth noting that these runs were performed for an hour, so the general trend is probably reliable, but the specific figures should not be taken as gospel.

*2- Statistics are scraped from logs from FireFly core using a custom tool which I'll contribute after some cleanup.


Given these figures it seems logical that the next areas for investigation are:

  • Data Exchange BLOBs API
  • EVM Connector transactions
  • Tokens Connector mint operations
  • Contract invoke (though this could be due to token connector slow down)

@SamMayWork
Copy link
Contributor Author

SamMayWork commented Feb 20, 2024

Investigation

Going through each of the areas of investigation in no particular order.

(Inbound) Contract Invoke API
| Route (Inbound Calls)                               1.2 avg   1.2 max   1.3 avg   1.3 max  |  Which faster? |
| -------------------------------------------------------------------------------------------|                |
| /api/v1/namespaces/default/contracts/invoke         28.42ms   226.9ms   44.34ms   228.4ms  |       1.2      |

Right, so here we've gone from an average of 28.42ms to 44.34ms which is a ~60% increase in execution time, concerning. From the logs for each of the runs I conducted to produce the initial set of results, I've picked a transaction form both that show the difference in performance.

From v1.2.0 I have httpreq=qZ1KfQdQ which took 21.64ms
From v1.3.0 I have httpreq=9UDOTaTZ which took 83.89ms


1.2 logs

{"log":"[2024-02-19T19:59:15.382Z]  INFO --\u003e POST /api/v1/namespaces/default/contracts/invoke httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.383664552Z"}
{"log":"[2024-02-19T19:59:15.382Z] DEBUG SQL-\u003e begin dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.383688129Z"}
{"log":"[2024-02-19T19:59:15.383Z] DEBUG SQL\u003c- begin dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.383696645Z"}
{"log":"[2024-02-19T19:59:15.384Z] DEBUG SQL-\u003e insert transactions dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.38455654Z"}
{"log":"[2024-02-19T19:59:15.385Z] DEBUG SQL\u003c- insert transactions sequences=[47918] dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.385256478Z"}
{"log":"[2024-02-19T19:59:15.385Z] DEBUG SQL-\u003e insert operations dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.385463333Z"}
{"log":"[2024-02-19T19:59:15.386Z] DEBUG SQL\u003c- insert operations sequences=[138572] dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.386316466Z"}
{"log":"[2024-02-19T19:59:15.386Z] DEBUG SQL-\u003e lock default dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.38660824Z"}
{"log":"[2024-02-19T19:59:15.386Z] DEBUG SQL\u003c- lock default dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.387084482Z"}
{"log":"[2024-02-19T19:59:15.387Z] DEBUG SQL-\u003e insert events dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.387300098Z"}
{"log":"[2024-02-19T19:59:15.390Z] DEBUG SQL\u003c- insert events sequences=[372131] dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.390809911Z"}
{"log":"[2024-02-19T19:59:15.390Z] DEBUG SQL-\u003e commit dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.390826751Z"}
{"log":"[2024-02-19T19:59:15.392Z] DEBUG SQL\u003c- commit dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.392218175Z"}
{"log":"[2024-02-19T19:59:15.392Z]  INFO Emitted transaction_submitted event 314edbfc-f82d-4a32-a58e-317c13e09842 for default:7eb9f31f-48a0-4979-85f4-008b3a63d901 (correlator=,topic=contract_invoke) dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.392479498Z"}
{"log":"[2024-02-19T19:59:15.392Z]  INFO Executing blockchain_invoke operation 1c2433e4-efd5-49ea-8cd5-6a8d496f8baf via handler ContractManager httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.39296641Z"}
{"log":"[2024-02-19T19:59:15.404Z]  INFO \u003c-- POST /api/v1/namespaces/default/contracts/invoke [202] (21.64ms) httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.406246777Z"}

1.3 logs

{"log":"[2024-02-19T13:57:59.874Z]  INFO --\u003e POST /api/v1/namespaces/default/contracts/invoke httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.876572405Z"}
{"log":"[2024-02-19T13:57:59.876Z] DEBUG Validating method 'methodhash_set_27c24ee2991b7efee363af9e49f8cfda448be98036b314746c9bbeb81ebbc2d4' (cacheMiss=false) httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.877197117Z"}
{"log":"[2024-02-19T13:57:59.893Z]  INFO Executing blockchain_invoke operation ec41af8e-51d7-454c-a3a7-e7d95d4549ae via handler ContractManager httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.894025396Z"}
{"log":"[2024-02-19T13:57:59.893Z] DEBUG Validating method 'methodhash_set_27c24ee2991b7efee363af9e49f8cfda448be98036b314746c9bbeb81ebbc2d4' (cacheMiss=false) httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.894051915Z"}
{"log":"[2024-02-19T13:57:59.894Z] DEBUG EVMConnectorBody: {\"from\":\"0x4c765d85923692812d4a9c72e599b5565e49d943\",\"headers\":{\"type\":\"SendTransaction\",\"id\":\"default:ec41af8e-51d7-454c-a3a7-e7d95d4549ae\"},\"method\":{\"type\":\"function\",\"name\":\"set\",\"inputs\":[{\"name\":\"newValue\",\"type\":\"uint256\"}],\"outputs\":[]},\"params\":[164],\"to\":\"0x457e54a0e07aae6332ffdeea59e3c5580810ba0a\"} httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.894158763Z"}
{"log":"[2024-02-19T13:57:59.960Z]  INFO \u003c-- POST /api/v1/namespaces/default/contracts/invoke [202] (83.89ms) httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.962258998Z"}

Noting the big difference in logs shows that there's definitely some implementation changes here for us to dig into... Time to dig into the code...

image

Bingpot! - Changes within the last 8 months.... Which is this PR: #1354, curiously labelled 'Further optimize blockchain transaction inserts to DB'. Think there's more to the story here to uncover...

(Outbound) Token Connector API Mint
| Route (Outbound Calls)                              1.2 avg   1.2 max   1.3 avg   1.3 max  |  Which faster?  |
| -------------------------------------------------------------------------------------------|                 |
| http://tokens_0_0:3000/api/v1/mint                  21.94ms   98.93ms   32.28ms   137.8ms  |       1.2       |

Looks like an ~11ms change, need to dig into the token connector to figure out what has changed recently. A walk through the code shows some changes to how we submit transactions onto the blockchain which is the prime suspect at the moment: hyperledger/firefly-tokens-erc20-erc721#128

image

The suspected PR is implementing retry logic for blockchain calls which opens us to the possibility that some of the calls made during the test failed and we've retried them where previously we would have allowed them to fail. Might need to conduct another run and collect logs from the tokens connector to figure out if we're actually retrying some calls.

Given I can't (yet) isolate a specific reason for the slow down, I'm going to run some quick tests that are focused exclusively on token mint operations to see if I can pick out if there is a specific issue here. I'm hoping that running these in isolation will show very similar results.

(Outbound) POST EVM Connector
| Route (Outbound Calls)                              1.2 avg   1.2 max   1.3 avg   1.3 max  |  Which faster?  |
| -------------------------------------------------------------------------------------------|                 |
| http://evmconnect_0:5008/                            9.74ms   108.8ms   21.74ms   151.9ms  |       1.2       |

Into the EVM Connector code - might be a bit tricky since to my knowledge it doesn't use the same API structure as the rest of the applications...

So after going through the code, it looks like we're doing a lot of calls for gas estimation (makes sense really), the only PR in that space that has gone in since 1.2 looks to be hyperledger/firefly-evmconnect#85 but there's nothing immediately obviously to be the cause of the slow down.

(Outbound) Data Exchange BLOBs API
| Route (Outbound Calls)                              1.2 avg   1.2 max   1.3 avg   1.3 max  |  Which faster?  |
| -------------------------------------------------------------------------------------------|                 |
| http://dataexchange_0:3000/api/v1/blobs/default/    14.23ms   183.9ms   37.91ms   213.0ms  |       1.2       |

Average slow down looks to be ~23ms aka significant. This one is a bit weird, though because it looks like repo itself hasn't had any changes for a long period of time... https://github.com/hyperledger/firefly-dataexchange-https.

Additionally, looking into the code shows that everything in there hasn't changed since 1.2. Not sure I understand enough right now to be able to comment on what the source of the slow down could be, but that'll be the next thing to dig into.

So after going through and doing cursory investigation of all of these, I think there might be a legitimate slowing on the /contracts/invoke API but the other APIs don't make a whole lot of sense as to why there's a discrepancy in the results, so I think we're at the point where we should run a much longer test and then observe what the results are at the end those runs. While those runs are going it should be possible to investigate a bit more what's going on with the contracts API.

@SamMayWork
Copy link
Contributor Author

SamMayWork commented Feb 21, 2024

Long-Running Test

Previously, tests have been running for < a couple of hours at most, we're at the point of now needing to run longer tests to observe what transaction speed/throughput is looking like. We're going to the run the same suite as run for 1.2 for at least 4-5 days and then compare results.

Configuration
nohup ./start.sh &> ffperf.log &

core-config.yml

log:
        level: debug
broadcast:
        batch:
                size: 200
                timeout: 1s
privatemessaging:
        batch:
                size: 200
                timeout: 1s
message:
        writer:
                count: 5
download:
        worker:
                count: 100
publicstorage:
                ipfs:
                        api:
                                requestTimeout: 2s
                        gateway:
                                requestTimeout: 2s

ethconnect.yml

rest:
  rest-gateway:
    maxTXWaitTime: 120
    maxInFlight: 200
    alwaysManageNonce: true
    attemptGapFill: true
    sendConcurrency: 3
    gasEstimationFactor: 2.0
confirmations:
  required: 5
debug:
  port: 6000

instances.yml

stackJSONPath: /home/ubuntu/.firefly/stacks/perf-test/stack.json

wsConfig:
  wsPath: /ws
  readBufferSize: 16000
  writeBufferSize: 16000
  initialDelay: 250ms
  maximumDelay: 30s
  initialConnectAttempts: 5
  heartbeatInterval: 5s

instances:
  - name: long-run
    tests: [{"name": "msg_broadcast", "workers":50},{"name": "msg_private", "workers":50},{"name": "blob_broadcast", "workers":30},{"name": "blob_private", "workers":30},{"name": "custom_ethereum_contract", "workers":20},{"name": "token_mint", "workers":10}]
    length: 500h
    sender: 0
    recipient: 1
    messageOptions:
      longMessage: false
    tokenOptions:
      tokenType: fungible
    contractOptions: {"address": "0x3f80fa2a99168245a54fabb774ec3ddc7e71419e"}

FireFly git commit:

8b6b604d6f17396d4ba045b0221fc6e31408dcb3

@EnriqueL8
Copy link
Contributor

Picking this up - I see a similar behaviour where the pending for broadcasts just keeps getting bigger and bigger and it's because the batch pins get stuck by a previous one and they are never confirmed. For some reason the performance CLI thinks they are confirmed and even a higher number than sent!! I think it's because on rewind it receives some sort of message_confirmed by accident... Digging into this

@EnriqueL8
Copy link
Contributor

Have posted on Hardening 1.3 release the results from a run, the problem with the difference in confirmed over submitted was due to the metrics not being correctly added, should be fixed by #1490

@EnriqueL8
Copy link
Contributor

This is done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants