New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FireFly 1.3 Performance Testing #1465
Comments
Down the rabbit hole 🐰
After performing 2 new runs of the performance test using the normal long-running configuration over the course of an hour, we get the following figures for inbound/outbound API response times:
Preliminary conclusions
*1- It's worth noting that these runs were performed for an hour, so the general trend is probably reliable, but the specific figures should not be taken as gospel. *2- Statistics are scraped from logs from FireFly core using a custom tool which I'll contribute after some cleanup. Given these figures it seems logical that the next areas for investigation are:
|
InvestigationGoing through each of the areas of investigation in no particular order. (Inbound) Contract Invoke API
Right, so here we've gone from an average of From 1.2 logs
1.3 logs
Noting the big difference in logs shows that there's definitely some implementation changes here for us to dig into... Time to dig into the code... Bingpot! - Changes within the last 8 months.... Which is this PR: #1354, curiously labelled 'Further optimize blockchain transaction inserts to DB'. Think there's more to the story here to uncover... (Outbound) Token Connector API Mint
Looks like an The suspected PR is implementing retry logic for blockchain calls which opens us to the possibility that some of the calls made during the test failed and we've retried them where previously we would have allowed them to fail. Might need to conduct another run and collect logs from the tokens connector to figure out if we're actually retrying some calls. Given I can't (yet) isolate a specific reason for the slow down, I'm going to run some quick tests that are focused exclusively on token mint operations to see if I can pick out if there is a specific issue here. I'm hoping that running these in isolation will show very similar results. (Outbound) POST EVM Connector
Into the EVM Connector code - might be a bit tricky since to my knowledge it doesn't use the same API structure as the rest of the applications... So after going through the code, it looks like we're doing a lot of calls for gas estimation (makes sense really), the only PR in that space that has gone in since 1.2 looks to be hyperledger/firefly-evmconnect#85 but there's nothing immediately obviously to be the cause of the slow down. (Outbound) Data Exchange BLOBs API
Average slow down looks to be Additionally, looking into the code shows that everything in there hasn't changed since 1.2. Not sure I understand enough right now to be able to comment on what the source of the slow down could be, but that'll be the next thing to dig into. So after going through and doing cursory investigation of all of these, I think there might be a legitimate slowing on the |
Long-Running TestPreviously, tests have been running for < a couple of hours at most, we're at the point of now needing to run longer tests to observe what transaction speed/throughput is looking like. We're going to the run the same suite as run for 1.2 for at least 4-5 days and then compare results. Configuration
core-config.yml
ethconnect.yml
instances.yml
FireFly git commit:
|
Picking this up - I see a similar behaviour where the pending for broadcasts just keeps getting bigger and bigger and it's because the batch pins get stuck by a previous one and they are never confirmed. For some reason the performance CLI thinks they are confirmed and even a higher number than sent!! I think it's because on rewind it receives some sort of message_confirmed by accident... Digging into this |
Have posted on Hardening 1.3 release the results from a run, the problem with the difference in confirmed over submitted was due to the metrics not being correctly added, should be fixed by #1490 |
This is done |
Doing performance testing ahead of the 1.3 release following the example set in #563 (comment). It's worth noting that we don't have an RC at the moment so this is most preliminary testing, but good to get some figures on the board.
Setup is the same as in the older performance testing issue:
2 FireFly nodes on one virtual server (EC2 m4.xlarge)
Entire FireFly stack is local to the server (ie both blockchains, Postgres databases, etc)
Single geth node with 2 instances of ethconnect
Maximum time to confirm before considering failure = 1 minute
The text was updated successfully, but these errors were encountered: