FireFly 1.3 Performance Testing #1465

SamMayWork · 2024-02-14T13:51:41Z

Doing performance testing ahead of the 1.3 release following the example set in #563 (comment). It's worth noting that we don't have an RC at the moment so this is most preliminary testing, but good to get some figures on the board.

Setup is the same as in the older performance testing issue:

2 FireFly nodes on one virtual server (EC2 m4.xlarge)
Entire FireFly stack is local to the server (ie both blockchains, Postgres databases, etc)
Single geth node with 2 instances of ethconnect
Maximum time to confirm before considering failure = 1 minute

SamMayWork · 2024-02-19T12:16:46Z

The Plan

Here's the plan: we don't yet have an RC of 1.3, but it's not worth holding up the performance testing for that, if there's some major slow down then it's worth getting an idea of that now so we can start to work on the fixed required before we start the release process.

With that in mind, we need to run a performance run of FireFly v1.2.0 and then a run of the current main branch to see if there's a performance difference. Given this is super preliminary, it should be OK to use a shorter time frame to get a feeling for how quickly we're getting through transactions.

FireFly v1.2.0

Using the configuration below, we get the following results for an hour window of running the tests:

Expand for test configuration

nohup ./start.sh &> ffperf.log &

core-config.yml

log:
        level: debug
broadcast:
        batch:
                size: 200
                timeout: 1s
privatemessaging:
        batch:
                size: 200
                timeout: 1s
message:
        writer:
                count: 5
download:
        worker:
                count: 100
publicstorage:
                ipfs:
                        api:
                                requestTimeout: 2s
                        gateway:
                                requestTimeout: 2s

ethconnect.yml

rest:
  rest-gateway:
    maxTXWaitTime: 120
    maxInFlight: 200
    alwaysManageNonce: true
    attemptGapFill: true
    sendConcurrency: 3
    gasEstimationFactor: 2.0
confirmations:
  required: 5
debug:
  port: 6000

instances.yml

stackJSONPath: /home/ubuntu/.firefly/stacks/perf-1/stack.json

wsConfig:
  wsPath: /ws
  readBufferSize: 16000
  writeBufferSize: 16000
  initialDelay: 250ms
  maximumDelay: 30s
  initialConnectAttempts: 5
  heartbeatInterval: 5s

instances:
  - name: long-run
    tests: [{"name": "msg_broadcast", "workers":50},{"name": "msg_private", "workers":50},{"name": "blob_broadcast", "workers":30},{"name": "blob_private", "workers":30},{"name": "custom_ethereum_contract", "workers":20},{"name": "token_mint", "workers":10}]
    length: 500h
    sender: 0
    recipient: 1
    messageOptions:
      longMessage: false
    tokenOptions:
      tokenType: fungible
    contractOptions: {"address": "0x69eb9c6b573415d3a8ed2fd38c0f9685a36c8455"}

FireFly git commit:

dc86491bdf1adc9f25cdb410eacfcbf0e1fc23ba

Measured TPS according to logs is hanging around ~40.26 TPS

FireFly (what will be) v1.3.0

Using the configuration below, we get the following results for an hour window of running the tests:

Expand for test configuration

nohup ./start.sh &> ffperf.log &

core-config.yml

log:
        level: debug
broadcast:
        batch:
                size: 200
                timeout: 1s
privatemessaging:
        batch:
                size: 200
                timeout: 1s
message:
        writer:
                count: 5
download:
        worker:
                count: 100
publicstorage:
                ipfs:
                        api:
                                requestTimeout: 2s
                        gateway:
                                requestTimeout: 2s

ethconnect.yml

rest:
  rest-gateway:
    maxTXWaitTime: 120
    maxInFlight: 200
    alwaysManageNonce: true
    attemptGapFill: true
    sendConcurrency: 3
    gasEstimationFactor: 2.0
confirmations:
  required: 5
debug:
  port: 6000

instances.yml

stackJSONPath: /home/ubuntu/.firefly/stacks/perf-2/stack.json

wsConfig:
  wsPath: /ws
  readBufferSize: 16000
  writeBufferSize: 16000
  initialDelay: 250ms
  maximumDelay: 30s
  initialConnectAttempts: 5
  heartbeatInterval: 5s

instances:
  - name: long-run
    tests: [{"name": "msg_broadcast", "workers":50},{"name": "msg_private", "workers":50},{"name": "blob_broadcast", "workers":30},{"name": "blob_private", "workers":30},{"name": "custom_ethereum_contract", "workers":20},{"name": "token_mint", "workers":10}]
    length: 500h
    sender: 0
    recipient: 1
    messageOptions:
      longMessage: false
    tokenOptions:
      tokenType: fungible
    contractOptions: {"address": "0x457e54a0e07aae6332ffdeea59e3c5580810ba0a"}

FireFly git commit:

8b6b604d6f17396d4ba045b0221fc6e31408dcb3

Measured TPS according to logs is hanging around ~37.92 TPS

Looks like there's a ~7% difference in measured TPS between both of the runs (it's worth noting that the second run ran longer than the first, so even though the time window is an hour, the totals will still be higher). Will need to dig into specific metrics to understand where we're losing time.

SamMayWork · 2024-02-20T14:58:08Z

Down the rabbit hole 🐰

NOTE: I'll be referring to '1.3' as what's currently in main, even though there isn't an RC

After performing 2 new runs of the performance test using the normal long-running configuration over the course of an hour, we get the following figures for inbound/outbound API response times:

| Route (Outbound Calls)                              1.2 avg   1.2 max   1.3 avg   1.3 max  |  Which faster?  |
| -------------------------------------------------------------------------------------------|                 |
| http://ipfs_0:5001/api/v0/add                       10.44ms   114.1ms   11.68ms   98.03ms  |       N/A       |
| http://dataexchange_0:3000/api/v1/blobs/default/    14.23ms   183.9ms   37.91ms   213.0ms  |       1.2       |
| http://dataexchange_0:3000/api/v1/transfers         27.22ms   76.45ms   22.74ms   95.65ms  |       N/A       |
| http://dataexchange_0:3000/api/v1/messages           9.67ms   75.15ms    5.32ms   68.73ms  |       N/A       |
| http://evmconnect_0:5008/                            9.74ms   108.8ms   21.74ms   151.9ms  |       1.2       |
| http://tokens_0_0:3000/api/v1/mint                  21.94ms   98.93ms   32.28ms   137.8ms  |       1.2       |


| Route (Inbound Calls)                               1.2 avg   1.2 max   1.3 avg   1.3 max  |  Which faster? |
| -------------------------------------------------------------------------------------------|                |
| /api/v1/namespaces/default/messages/broadcast       51.85ms   328.1ms   37.21ms   301.3ms  |       1.3!     |
| /api/v1/namespaces/default/messages/private         54.05ms   452.3ms   38.02ms   273.6ms  |       1.3!     |
| /api/v1/namespaces/default/data                     63.31ms   407.5ms   66.87ms   385.1ms  |       N/A      |
| /api/v1/namespaces/default/contracts/invoke         28.42ms   226.9ms   44.34ms   228.4ms  |       1.2      |
| /api/v1/namespaces/default/tokens/mint              78.77ms   376.8ms   77.63ms   287.7ms  |       N/A      |

Preliminary conclusions

Calls outbound to DataExchange BLOBs API is significantly faster on v1.2.0 than on v1.3.0
Calls outbound to the EVM Connector are significantly faster on v1.2.0 than on v1.3.0
Calls outbound to the Token Connector are faster on v1.2.0 than on v1.3.0
v1.3.0 performs faster on all inbound API requests than v1.2.0 apart from `/api/v1/namespaces/default/contracts/invoke (though this could be due to token connector slow down)

*1- It's worth noting that these runs were performed for an hour, so the general trend is probably reliable, but the specific figures should not be taken as gospel.

*2- Statistics are scraped from logs from FireFly core using a custom tool which I'll contribute after some cleanup.

Given these figures it seems logical that the next areas for investigation are:

Data Exchange BLOBs API
EVM Connector transactions
Tokens Connector mint operations
Contract invoke (though this could be due to token connector slow down)

SamMayWork · 2024-02-20T16:29:23Z

Investigation

Going through each of the areas of investigation in no particular order.

(Inbound) Contract Invoke API

| Route (Inbound Calls)                               1.2 avg   1.2 max   1.3 avg   1.3 max  |  Which faster? |
| -------------------------------------------------------------------------------------------|                |
| /api/v1/namespaces/default/contracts/invoke         28.42ms   226.9ms   44.34ms   228.4ms  |       1.2      |

Right, so here we've gone from an average of 28.42ms to 44.34ms which is a ~60% increase in execution time, concerning. From the logs for each of the runs I conducted to produce the initial set of results, I've picked a transaction form both that show the difference in performance.

From v1.2.0 I have httpreq=qZ1KfQdQ which took 21.64ms
From v1.3.0 I have httpreq=9UDOTaTZ which took 83.89ms

1.2 logs

{"log":"[2024-02-19T19:59:15.382Z]  INFO --\u003e POST /api/v1/namespaces/default/contracts/invoke httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.383664552Z"}
{"log":"[2024-02-19T19:59:15.382Z] DEBUG SQL-\u003e begin dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.383688129Z"}
{"log":"[2024-02-19T19:59:15.383Z] DEBUG SQL\u003c- begin dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.383696645Z"}
{"log":"[2024-02-19T19:59:15.384Z] DEBUG SQL-\u003e insert transactions dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.38455654Z"}
{"log":"[2024-02-19T19:59:15.385Z] DEBUG SQL\u003c- insert transactions sequences=[47918] dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.385256478Z"}
{"log":"[2024-02-19T19:59:15.385Z] DEBUG SQL-\u003e insert operations dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.385463333Z"}
{"log":"[2024-02-19T19:59:15.386Z] DEBUG SQL\u003c- insert operations sequences=[138572] dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.386316466Z"}
{"log":"[2024-02-19T19:59:15.386Z] DEBUG SQL-\u003e lock default dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.38660824Z"}
{"log":"[2024-02-19T19:59:15.386Z] DEBUG SQL\u003c- lock default dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.387084482Z"}
{"log":"[2024-02-19T19:59:15.387Z] DEBUG SQL-\u003e insert events dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.387300098Z"}
{"log":"[2024-02-19T19:59:15.390Z] DEBUG SQL\u003c- insert events sequences=[372131] dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.390809911Z"}
{"log":"[2024-02-19T19:59:15.390Z] DEBUG SQL-\u003e commit dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.390826751Z"}
{"log":"[2024-02-19T19:59:15.392Z] DEBUG SQL\u003c- commit dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.392218175Z"}
{"log":"[2024-02-19T19:59:15.392Z]  INFO Emitted transaction_submitted event 314edbfc-f82d-4a32-a58e-317c13e09842 for default:7eb9f31f-48a0-4979-85f4-008b3a63d901 (correlator=,topic=contract_invoke) dbtx=ZwLqWQBZ httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.392479498Z"}
{"log":"[2024-02-19T19:59:15.392Z]  INFO Executing blockchain_invoke operation 1c2433e4-efd5-49ea-8cd5-6a8d496f8baf via handler ContractManager httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.39296641Z"}
{"log":"[2024-02-19T19:59:15.404Z]  INFO \u003c-- POST /api/v1/namespaces/default/contracts/invoke [202] (21.64ms) httpreq=qZ1KfQdQ pid=1 req=mCGqE1IS\n","stream":"stderr","time":"2024-02-19T19:59:15.406246777Z"}

1.3 logs

{"log":"[2024-02-19T13:57:59.874Z]  INFO --\u003e POST /api/v1/namespaces/default/contracts/invoke httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.876572405Z"}
{"log":"[2024-02-19T13:57:59.876Z] DEBUG Validating method 'methodhash_set_27c24ee2991b7efee363af9e49f8cfda448be98036b314746c9bbeb81ebbc2d4' (cacheMiss=false) httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.877197117Z"}
{"log":"[2024-02-19T13:57:59.893Z]  INFO Executing blockchain_invoke operation ec41af8e-51d7-454c-a3a7-e7d95d4549ae via handler ContractManager httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.894025396Z"}
{"log":"[2024-02-19T13:57:59.893Z] DEBUG Validating method 'methodhash_set_27c24ee2991b7efee363af9e49f8cfda448be98036b314746c9bbeb81ebbc2d4' (cacheMiss=false) httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.894051915Z"}
{"log":"[2024-02-19T13:57:59.894Z] DEBUG EVMConnectorBody: {\"from\":\"0x4c765d85923692812d4a9c72e599b5565e49d943\",\"headers\":{\"type\":\"SendTransaction\",\"id\":\"default:ec41af8e-51d7-454c-a3a7-e7d95d4549ae\"},\"method\":{\"type\":\"function\",\"name\":\"set\",\"inputs\":[{\"name\":\"newValue\",\"type\":\"uint256\"}],\"outputs\":[]},\"params\":[164],\"to\":\"0x457e54a0e07aae6332ffdeea59e3c5580810ba0a\"} httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.894158763Z"}
{"log":"[2024-02-19T13:57:59.960Z]  INFO \u003c-- POST /api/v1/namespaces/default/contracts/invoke [202] (83.89ms) httpreq=9UDOTaTZ pid=1 req=TB5NNKuM\n","stream":"stderr","time":"2024-02-19T13:57:59.962258998Z"}

Noting the big difference in logs shows that there's definitely some implementation changes here for us to dig into... Time to dig into the code...

Bingpot! - Changes within the last 8 months.... Which is this PR: #1354, curiously labelled 'Further optimize blockchain transaction inserts to DB'. Think there's more to the story here to uncover...

(Outbound) Token Connector API Mint

| Route (Outbound Calls)                              1.2 avg   1.2 max   1.3 avg   1.3 max  |  Which faster?  |
| -------------------------------------------------------------------------------------------|                 |
| http://tokens_0_0:3000/api/v1/mint                  21.94ms   98.93ms   32.28ms   137.8ms  |       1.2       |

Looks like an ~11ms change, need to dig into the token connector to figure out what has changed recently. A walk through the code shows some changes to how we submit transactions onto the blockchain which is the prime suspect at the moment: hyperledger/firefly-tokens-erc20-erc721#128

The suspected PR is implementing retry logic for blockchain calls which opens us to the possibility that some of the calls made during the test failed and we've retried them where previously we would have allowed them to fail. Might need to conduct another run and collect logs from the tokens connector to figure out if we're actually retrying some calls.

Given I can't (yet) isolate a specific reason for the slow down, I'm going to run some quick tests that are focused exclusively on token mint operations to see if I can pick out if there is a specific issue here. I'm hoping that running these in isolation will show very similar results.

(Outbound) POST EVM Connector

| Route (Outbound Calls)                              1.2 avg   1.2 max   1.3 avg   1.3 max  |  Which faster?  |
| -------------------------------------------------------------------------------------------|                 |
| http://evmconnect_0:5008/                            9.74ms   108.8ms   21.74ms   151.9ms  |       1.2       |

Into the EVM Connector code - might be a bit tricky since to my knowledge it doesn't use the same API structure as the rest of the applications...

So after going through the code, it looks like we're doing a lot of calls for gas estimation (makes sense really), the only PR in that space that has gone in since 1.2 looks to be hyperledger/firefly-evmconnect#85 but there's nothing immediately obviously to be the cause of the slow down.

(Outbound) Data Exchange BLOBs API

| Route (Outbound Calls)                              1.2 avg   1.2 max   1.3 avg   1.3 max  |  Which faster?  |
| -------------------------------------------------------------------------------------------|                 |
| http://dataexchange_0:3000/api/v1/blobs/default/    14.23ms   183.9ms   37.91ms   213.0ms  |       1.2       |

Average slow down looks to be ~23ms aka significant. This one is a bit weird, though because it looks like repo itself hasn't had any changes for a long period of time... https://github.com/hyperledger/firefly-dataexchange-https.

Additionally, looking into the code shows that everything in there hasn't changed since 1.2. Not sure I understand enough right now to be able to comment on what the source of the slow down could be, but that'll be the next thing to dig into.

So after going through and doing cursory investigation of all of these, I think there might be a legitimate slowing on the /contracts/invoke API but the other APIs don't make a whole lot of sense as to why there's a discrepancy in the results, so I think we're at the point where we should run a much longer test and then observe what the results are at the end those runs. While those runs are going it should be possible to investigate a bit more what's going on with the contracts API.

SamMayWork · 2024-02-21T11:18:02Z

Long-Running Test

Previously, tests have been running for < a couple of hours at most, we're at the point of now needing to run longer tests to observe what transaction speed/throughput is looking like. We're going to the run the same suite as run for 1.2 for at least 4-5 days and then compare results.

Configuration

nohup ./start.sh &> ffperf.log &

core-config.yml

log:
        level: debug
broadcast:
        batch:
                size: 200
                timeout: 1s
privatemessaging:
        batch:
                size: 200
                timeout: 1s
message:
        writer:
                count: 5
download:
        worker:
                count: 100
publicstorage:
                ipfs:
                        api:
                                requestTimeout: 2s
                        gateway:
                                requestTimeout: 2s

ethconnect.yml

rest:
  rest-gateway:
    maxTXWaitTime: 120
    maxInFlight: 200
    alwaysManageNonce: true
    attemptGapFill: true
    sendConcurrency: 3
    gasEstimationFactor: 2.0
confirmations:
  required: 5
debug:
  port: 6000

instances.yml

stackJSONPath: /home/ubuntu/.firefly/stacks/perf-test/stack.json

wsConfig:
  wsPath: /ws
  readBufferSize: 16000
  writeBufferSize: 16000
  initialDelay: 250ms
  maximumDelay: 30s
  initialConnectAttempts: 5
  heartbeatInterval: 5s

instances:
  - name: long-run
    tests: [{"name": "msg_broadcast", "workers":50},{"name": "msg_private", "workers":50},{"name": "blob_broadcast", "workers":30},{"name": "blob_private", "workers":30},{"name": "custom_ethereum_contract", "workers":20},{"name": "token_mint", "workers":10}]
    length: 500h
    sender: 0
    recipient: 1
    messageOptions:
      longMessage: false
    tokenOptions:
      tokenType: fungible
    contractOptions: {"address": "0x3f80fa2a99168245a54fabb774ec3ddc7e71419e"}

FireFly git commit:

8b6b604d6f17396d4ba045b0221fc6e31408dcb3

EnriqueL8 · 2024-03-27T17:57:03Z

Picking this up - I see a similar behaviour where the pending for broadcasts just keeps getting bigger and bigger and it's because the batch pins get stuck by a previous one and they are never confirmed. For some reason the performance CLI thinks they are confirmed and even a higher number than sent!! I think it's because on rewind it receives some sort of message_confirmed by accident... Digging into this

EnriqueL8 · 2024-04-10T09:04:44Z

Have posted on Hardening 1.3 release the results from a run, the problem with the difference in confirmed over submitted was due to the metrics not being correctly added, should be fixed by #1490

EnriqueL8 · 2024-05-22T12:22:22Z

This is done

SamMayWork mentioned this issue Feb 21, 2024

Hardening for 1.3 Release #1470

Closed

EnriqueL8 closed this as completed May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FireFly 1.3 Performance Testing #1465

FireFly 1.3 Performance Testing #1465

SamMayWork commented Feb 14, 2024

SamMayWork commented Feb 19, 2024 •

edited

SamMayWork commented Feb 20, 2024 •

edited

SamMayWork commented Feb 20, 2024 •

edited

SamMayWork commented Feb 21, 2024 •

edited

EnriqueL8 commented Mar 27, 2024

EnriqueL8 commented Apr 10, 2024

EnriqueL8 commented May 22, 2024

FireFly 1.3 Performance Testing #1465

FireFly 1.3 Performance Testing #1465

Comments

SamMayWork commented Feb 14, 2024

SamMayWork commented Feb 19, 2024 • edited

The Plan

FireFly v1.2.0

FireFly (what will be) v1.3.0

SamMayWork commented Feb 20, 2024 • edited

Down the rabbit hole 🐰

Preliminary conclusions

SamMayWork commented Feb 20, 2024 • edited

Investigation

SamMayWork commented Feb 21, 2024 • edited

Long-Running Test

EnriqueL8 commented Mar 27, 2024

EnriqueL8 commented Apr 10, 2024

EnriqueL8 commented May 22, 2024

SamMayWork commented Feb 19, 2024 •

edited

SamMayWork commented Feb 20, 2024 •

edited

SamMayWork commented Feb 20, 2024 •

edited

SamMayWork commented Feb 21, 2024 •

edited