Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve default optimizer steps sequence in via-ir #14406

Open
nikola-matic opened this issue Jul 11, 2023 · 15 comments
Open

Improve default optimizer steps sequence in via-ir #14406

nikola-matic opened this issue Jul 11, 2023 · 15 comments
Assignees
Labels
must have Something we consider an essential part of Solidity 1.0. optimizer viair

Comments

@nikola-matic
Copy link
Collaborator

The default optimizer sequence we have at the moment is quite large, and thus causes bloating in compilation times.

libsolidity/interface/OptimiserSettings.h
static char constexpr DefaultYulOptimiserSteps[] =
	"dhfoDgvulfnTUtnIf"            // None of these can make stack problems worse
	"["
		"xa[r]EscLM"               // Turn into SSA and simplify
		"cCTUtTOntnfDIul"          // Perform structural simplification
		"Lcul"                     // Simplify again
		"Vcul [j]"                 // Reverse SSA

		// should have good "compilability" property here.

		"Tpeul"                    // Run functional expression inliner
		"xa[rul]"                  // Prune a bit more in SSA
		"xa[r]cL"                  // Turn into SSA again and simplify
		"gvif"                     // Run full inliner
		"CTUca[r]LSsTFOtfDnca[r]Iulc" // SSA plus simplify
	"]"
	"jmul[jul] VcTOcul jmul";      // Make source short and pretty

The outcome should hopefully be a shorter sequence, that achieves the same level of optimization, while reducing the compile times (in the context of the optimizer pipeline). Special attention should be paid to steps that destroy the SSA form of its input (better suited for optimizing), and thus cause subsequent steps to perform poorly.

@ekpyron
Copy link
Member

ekpyron commented Jul 11, 2023

Just to elaborate a bit: the current sequence transforms to SSA and back in a loop.
This is partially based on the legacy-code-transform fact that the number of variables translates rather directly to required stack slots. The new code transform does not suffer from this and should be able to generate efficient code directly from SSA.

Based on that, we should, generally, be able to transform to SSA and stay in SSA, up until code generation. However, that would require to ensure that none of the steps that perform meaningful optimizations destroy the SSA property.

Actually changing any potential such steps to preserve SSA form I'd consider a separate tasks. This first task should merely investigate what can be achieved with modifying the sequence without actually touching the steps, keeping in mind that transforming back from SSA should not add value and contribute to "compilability" relative to the current code transform anymore.

@github-actions

This comment was marked as resolved.

@github-actions github-actions bot added the stale The issue/PR was marked as stale because it has been open for too long. label Oct 10, 2023
@ekpyron ekpyron added must have Something we consider an essential part of Solidity 1.0. and removed stale The issue/PR was marked as stale because it has been open for too long. labels Oct 10, 2023
@ekpyron
Copy link
Member

ekpyron commented Oct 16, 2023

While we're at it, we should double-check if the effects of the StackCompressor these days are actually still positive or whether we should just remove it (simplifying our entire pipeline)

@cameel
Copy link
Member

cameel commented Feb 26, 2024

For future reference, here are the initial results I showed last week: solc-seqbench-intro-and-initial-results.md.

@cameel
Copy link
Member

cameel commented Feb 26, 2024

And here's a bit more analysis:

  • Now with plots and tables that compare all sequences on the same contract or all contracts with the same sequence.
  • I added an extra test sequence: small-loops. It's just the default one with the big loop replaced with looping each component individually.

I haven't analyzed it all yet so here are the plots and some casual observations. I still need to go over those plots in more detail.

Results

Contract with different sequences

Sequence with different contracts

Observations

Contract deposit_contract
  • single-pass takes 2x less time and gets only slightly worse results.
  • others get almost the same results and timing is very close (but small-loops beats default, which beats always-ssa-min)
  • small-loops executes significantly fewer steps than default and always-ssa-min, but timing differs only minimally
  • small-loops converges visibly slower, especially in terms of bytecode size. default and always-ssa-min would beat it if we stopped as soon as they converge
  • Time wasted after convergence: half for all but single-pass in execution, in bytecode size half to 1/4.
  • Improvements seem to happen sharply at specific steps, and these seem to be the same steps in each sequence. The first ones are: c, M, e, l, T.
  • Almost all the improvement seems to happen in the first cycle of the main loop.
Contract ramanujan_pi
  • single-pass converges very quickly and is only a little worse in terms of execution time. For bytecode size it's even same as small-loops.
    • small-loops takes 4x the time to converge and the other two 10x the time.

@cameel
Copy link
Member

cameel commented Feb 26, 2024

Since the single-pass sequence looks very promising, I decided to see how it would fare in our CI before I analyze the results in more detail. Here's a test PR with changes #14887 and I posted benchmarks in comments.

Results look promising. 17-46% decrease in total compilation time. Gas benchmarks also aren't too bad. While I can see up to 25% increase in gas in a few tests, the majority is impacted very little and benchmark diffs for external tests look much more reasonable - most of them below 1.5% for both deployment and runtime. Also, in many cases costs actually decrease. ENS is an outlier with bytecode size increasing 5%, though on the other hand ElementFI had bytecode decrease by 4%.

Overall I think that we might be able to tweak this sequence enough to get results very close to the current default.

Also, worth noting that the running time decrease we're seeing with this sequence may very likely be the bottom line of what we can achieve by tweaking the sequence. The plots show that most of the gains happen in that first pass so we'll need to keep most of it in one form or another.

@cameel
Copy link
Member

cameel commented Feb 28, 2024

no-cse sequences

Here's the data for sequences I analyzed yesterday. This time I tried to remove some c (CommonSubexpressionEliminator) and j (ExpressionJoiner) steps that seemed to degrade the results for deposit_contract. The results are mixed. It helps in some cases, makes things worse in others.

Contract with different sequences

Sequence with different contracts

@cameel
Copy link
Member

cameel commented Feb 28, 2024

And here's some analysis of the data I gathered so far. This is more or less what I presented on the call today.

Analysis

  • All the sequences I checked out so far differ very little in terms of final bytecode size and runtime cost.
    • While small, the differences are not necessarily negligible. The thing is that none of them is consistently always better or always worse than others. It depends on the input.
  • The thing that does differ is primarily the optimization time:
    • The single-pass sequences clearly win over the multi-pass sequences here. Most of the gain is achieved on the first pass. Further passes do improve the result in many cases but only minimally, which is not worth the 2-3x increase in optimization time.
    • always-ssa-min is a definitely worse than the default sequence here. It usually takes just as much time and there were cases where it was almost 2x slower for almost no additional gain.
    • The no-cse variants are generally slightly faster than their equivalents, due to having fewer steps. This indicates that dropping some steps may be worthwhile.
  • ramanujan_pi is the contract that converges the slowest. Bytecode size converges only after 6 repetitions of the main loop and runtime cost after 4. Stopping after just a single pass gives worse results, though it's worth noting that subsequent passes make things worse before they make them better so ultimately the single-pass sequences do not end up being as much worse as one might expect. There may be a way to achieve results comparable with multi-pass in a more efficient way.
  • The small-loops sequence in some cases finishes significantly faster than default (~60% faster for ramanujan_pi) and never much longer. Still, it spends a lot of time not improving, just like default, so the single-pass variants generally seem more promising.

Conclusions

It seems that a single-pass sequence is the way to go. Such sequences are significantly faster than default and very close in terms of optimization. Probably can even be better with enough tweaking.

Even if we don't manage to consistently beat default with such a sequence, a safe option would be to expose the MaxRounds constant for the main loop as an optimization parameter and set it to 1 by default. In most cases this would give enough optimization and users could decide on their own whether they want to spend more time to achieve minimally better results.

@cameel
Copy link
Member

cameel commented Mar 6, 2024

Some more loose observations from looking at seqbench results recently:

  • Cleanup sequence has almost no effect on bytecode size or runtime cost.
  • On the other hand the part between the main loop and the cleanup sequence (described as "make source short and pretty") is pretty crucial. This is usually what brings the final cost/size to the lowest point, especially in the single-pass sequences. The jmul part seems to have a significant effect on bytecode size.
  • There seems to be no benefit in looping parts of sequence individually (i.e. like in the small-loops sequence)
    • Only "SSA plus simplify" seems to give some limited bytecode size gains with repetition, e.g. in ramanujan_pi contract.
  • The lowest bytecode size tends to be reached after TU steps but the current sequence always reverses this gain with c. Especially in the cleanup part that seeem undesirable.

@cameel
Copy link
Member

cameel commented Mar 6, 2024

the-good-parts sequence

This is a sequence I put together by starting with single-pass and then trying to remove the parts that did not contribute much to the final result. Then refined it on the ramanujan_pi contract by distilling the bits of the default sequence that improved its bytecode size with every repetition.

Contract with different sequences

Sequence with different contracts

@cameel
Copy link
Member

cameel commented Mar 6, 2024

And here's the overall comparison of all the sequences I tested so far.

Remarks

  • Compilation time was measured in a different way than optimization time so it may not add up. Take them as very approximate.
  • Percentages are based on the initial, unoptimized value (keep that in mind when comparing with percentages from e.g. gas_diff_stats.py, which does not know the original value).
  • The min column is the minimum reached at any point in sequence. It's reported when it's lower then the final value.

Contract deposit_contract

runtime gas min bytecode size min creation gas min optimization time compilation time (unoptimized) compilation time (optimized)
default -15.5% -37.0% -39.4% -20.1% -21.8% 343 ms 207 ms 430 ms
the-good-parts -15.3% -39.3% -21.7% 160 ms 203 ms 259 ms
single-pass -14.1% -14.9% -34.4% -38.9% -18.7% -21.5% 154 ms 184 ms 304 ms
always-ssa-min -15.5% -15.6% -36.3% -39.0% -19.7% -21.5% 363 ms 174 ms 466 ms
always-ssa -15.2% -36.7% -38.7% -20.0% -21.3% 236 ms 184 ms 377 ms
default-no-cse -15.5% -39.4% -21.8% 298 ms 210 ms 394 ms
single-pass-no-cse -14.8% -38.4% -21.0% 134 ms 182 ms 255 ms
small-loops -15.4% -15.6% -36.2% -39.2% -19.7% -21.6% 282 ms 178 ms 441 ms

Contract FixedFeeRegistrar

runtime gas min bytecode size min creation gas min optimization time compilation time (unoptimized) compilation time (optimized)
default -5.6% -33.8% -34.3% -31.0% -31.5% 187 ms 170 ms 193 ms
the-good-parts -5.6% -32.4% -34.3% -29.7% -31.5% 92 ms 92 ms 135 ms
single-pass -2.9% -3.0% -32.9% -34.5% -30.1% -31.6% 102 ms 87 ms 153 ms
always-ssa-min -3.0% -34.1% -35.1% -31.2% -32.1% 382 ms 158 ms 248 ms
always-ssa -2.9% -3.0% -32.9% -34.5% -30.1% -31.6% 130 ms 88 ms 188 ms
default-no-cse -5.6% -32.6% -34.1% -29.8% -31.2% 147 ms 78 ms 195 ms
single-pass-no-cse -3.1% -35.1% -35.3% -32.2% -32.3% 133 ms 94 ms 160 ms
small-loops -2.9% -3.0% -33.4% -35.0% -30.6% -32.1% 149 ms 84 ms 196 ms

Contract prbmath_unsigned

runtime gas min bytecode size min creation gas min optimization time compilation time (unoptimized) compilation time (optimized)
default -44.4% -30.8% -30.3% 550 ms 596 ms 600 ms
the-good-parts -44.5% -29.6% -29.1% 249 ms 444 ms 409 ms
single-pass -44.4% -30.7% -30.8% -30.2% -30.3% 259 ms 420 ms 410 ms
always-ssa-min -44.4% -30.8% -30.3% 784 ms 564 ms 827 ms
always-ssa -44.4% -30.8% -30.3% 455 ms 498 ms 620 ms
default-no-cse -44.4% -29.5% -29.0% 510 ms 613 ms 638 ms
single-pass-no-cse -44.3% -29.7% -29.2% -29.3% 241 ms 433 ms 614 ms
small-loops -44.4% -30.6% -30.7% -30.1% 472 ms 403 ms 626 ms

Contract ramanujan_pi

runtime gas min bytecode size min creation gas min optimization time compilation time (unoptimized) compilation time (optimized)
default -67.2% -67.7% -39.2% -39.9% -36.2% -36.8% 520 ms 129 ms 622 ms
the-good-parts -67.2% -67.5% -42.9% -39.6% 135 ms 134 ms 233 ms
single-pass -64.1% -64.8% -30.1% -33.5% -27.8% -30.9% 109 ms 109 ms 170 ms
always-ssa-min -66.8% -67.3% -40.2% -37.1% 531 ms 114 ms 602 ms
always-ssa -65.7% -66.4% -30.8% -42.2% -28.4% -39.0% 160 ms 124 ms 232 ms
default-no-cse -67.7% -37.3% -39.9% -34.5% -36.8% 542 ms 129 ms 540 ms
single-pass-no-cse -65.0% -25.9% -32.1% -23.9% -29.6% 89 ms 118 ms 151 ms
small-loops -66.3% -67.0% -28.8% -31.9% -26.5% -29.5% 303 ms 109 ms 379 ms

Contract strings

runtime gas min bytecode size min creation gas min optimization time compilation time (unoptimized) compilation time (optimized)
default -11.9% -12.1% -26.1% -27.7% -24.6% -26.1% 287 ms 169 ms 430 ms
the-good-parts -11.9% -25.9% -27.7% -24.3% -26.1% 134 ms 185 ms 239 ms
single-pass -11.6% -11.8% -23.7% -27.7% -22.3% -26.1% 121 ms 161 ms 219 ms
always-ssa-min -11.9% -12.1% -25.4% -25.9% -23.9% -24.4% 330 ms 173 ms 406 ms
always-ssa -11.8% -12.0% -23.1% -29.8% -21.7% -28.1% 185 ms 178 ms 295 ms
default-no-cse -12.1% -26.2% -24.7% 311 ms 182 ms 389 ms
single-pass-no-cse -11.9% -25.3% -23.9% 108 ms 169 ms 212 ms
small-loops -11.6% -11.8% -25.4% -27.7% -23.9% -26.1% 255 ms 164 ms 349 ms

@cameel
Copy link
Member

cameel commented Mar 6, 2024

Final summary for the-good-parts sequence

Comparison with default

Cost differences in CI:

  • most between -1% and +3%
  • one outlier at -6% (ramanujan_pi.sol)
  • several outliers between +4% and +9% (including base64.sol and erc20.sol)

External tests:

  • bytecode size changes between -1.3% and +2.1%
  • runtime cost changes between 0% and +0.3% with one outlier at +2.1% (uniswap)
  • effect on legacy pipeline:
    • bytecode size changes between 0% and +0.2%
    • negligible runtime cost changes

Timing of extrnal tests:

  • Improvement between -15% and -45%.
    • Timing improved in all cases.
  • Runtime of bytecode comparison job for optimized IR decreased by 25%.

seqbench results (values where the new sequence is worse in bold):

default runtime gas the-good-parts runtime gas default bytecode size the-good-parts bytecode size default compilation time (optimized) the-good-parts compilation time (optimized)
deposit_contract -15.5% -15.3% -37.0% -39.3% 430 ms 259 ms
FixedFeeRegistrar -5.6% -5.6% -33.8% -32.4% 193 ms 135 ms
prbmath_unsigned -44.4% -44.5% -30.8% -29.6% 600 ms 409 ms
ramanujan_pi -67.2% -67.2% -39.2% -42.9% 622 ms 233 ms
strings -11.9% -11.9% -26.1% -25.9% 430 ms 239 ms

Conclusions

The new sequence clearly beats the single-pass sequence. Comparison with default is more mixed.

The outliers seem a bit concerning. Still, they're not extreme and go both ways so the new sequence is not consistently worse in any metric. In many cases it's a little better.

Overall results seem close enough that we're probably fine using it. I'm pretty sure the sequence can still be refined a little if we spend more time on it. All the sequences I tested have cases where they're better than default by a few percent, most of the time without doing as many repetitions. The trick is to get one that can do it consistently on most input.

@cameel
Copy link
Member

cameel commented Mar 7, 2024

Effects of removing StackCompressor

The results are oddly mixed: from completely neutral to very negative.

On one hand, I see almost no change in gas usage in our semantic tests.

I also ran seqbench on the default and the-good-parts sequences. Literally zero difference other than slightly compilation times (but still probably within the margin of error). Exactly the same bytecode sizes and execution gas. I did spot checks on the generated Yul and files are identical. To the point that I'm questioning whether my patched solc really worked correctly, but so far I see no evidence that it did not.

On the other hand, results of external tests are terrible (ir-optimize-evm+yul):

project bytecode_size deployment_gas method_gas
brink 0%
colony 0%
ens +0.49% ❌ 0% 0%
perpetual-pools +9.09% ❌ +9.19% ❌ +1.6% ❌
pool-together +0.86% ❌
uniswap +4.05% ❌ +3.59% ❌ +5.87% ❌
yield_liquidator +2.03% ❌ +2.29% ❌ +0.08% ❌
zeppelin +0.01% ❌ +0.01% ❌ +0.02% ❌

In some cases zero difference, but in others up to 10% increase in bytecode size and 6% increase in gas usage. And no cases where results improved.

Finally, the change caused some new StackTooDeep issues. Only in a few semantic/syntax tests in our test suite but also in 7 of the 15 external tests we have.

@cameel
Copy link
Member

cameel commented Mar 9, 2024

the-good-parts-mk2 sequence

Comparison with default

Cost differences in CI:

  • same or better as the-good-parts. Still mostly between -1% and +3% with outliers at -6% and 4-9%, but also has a number of cases where a 2-4% increase became 0-1%.

External tests:

  • bytecode size changes between -2.4% and +1.65%: better than the-good-parts
  • runtime cost changes between 0% and +0.15% with one outlier at +2% (uniswap): better than the-good-parts
  • effect on legacy pipeline in the same range as the-good-parts (bytecode size changes between 0% and +0.3% and negligible runtime cost changes)

seqbench results
(values that differ from the-good-parts in bold):

default runtime gas the-good-parts-mk2 runtime gas the-good-parts runtime gas default bytecode size the-good-parts-mk2 bytecode size the-good-parts bytecode size default compilation time the-good-parts-mk2 compilation time the-good-parts compilation time
deposit_contract -15.5% -15.3% -15.3% -37.0% -39.4% -39.3% 430 ms 248 ms 259 ms
erc20 -4.2% -4.2% -4.1% -37.0% -35.9% -34.2% 149 ms 113 ms 94 ms
FixedFeeRegistrar -5.6% -5.6% -5.6% -33.8% -32.8% -32.4% 193 ms 120 ms 135 ms
prbmath_unsigned -44.4% -44.5% -44.5% -30.8% -29.6% -29.6% 600 ms 403 ms 364 ms
ramanujan_pi -67.2% -67.2% -67.2% -39.2% -42.9% -42.9% 622 ms 238 ms 233 ms
strings -11.9% -11.9% -11.9% -26.1% -27.2% -25.9% 430 ms 236 ms 239 ms

Conclusions

The new sequence is same or better than the-good-parts in nearly all cases. Speed is also pretty much the same.

It's still not as good as default but also still much faster.

@cameel
Copy link
Member

cameel commented Apr 22, 2024

Comparison of the-good-parts sequence variants

seqbench results on cancun

Contracts

New contracts:

Comparison between contracts for each sequence

Note that I excluded chains.sol here because the range of values makes it hard to compare with other contracts.

Which variant of the-good-parts is the best?

Quality of optimization
  • External tests: in almost all cases mk3 is a minimal improvement over mk2. mk2 was also better than mk1.
  • seqbench: bytecode size and runtime gas differences are usually fractions of a percent. In the few cases where they are bigger mk2 and mk3 beat mk1. mk2 is sometimes minimally better than mk3.
  • Tests in the repo: mk2 is a small improvement over mk1 in some cases. mk3 is very similar to mk2.
  • Optimized output in the repo: mk3 addresses some cases where code snippets from the repo were not fully optimized. Still not all of them. Haven't compared mk1 and mk2 in that regard.

Overall: mk3 >= mk2 >= mk1. But differences are small.

Compilation time
  • External tests: Neither mk3 nor mk2 seems consistently faster in CI. There are differences and sometimes they seem significant, but the timing variance in CI is so big that I think they are still the result of that variance.
  • seqbench: mk2 is only slightly slower than mk1. mk3 is more visibly slower than mk2, but still very close. Compared to default they can be considered pretty much the same. The difference comes from each version having a few more steps.
  • Timing benchmark in the repo: differences between mk2 and mk3 are in favor of the latter, but are very small, probably smaller than normal variance.

Overall: mk1 >= mk2 >= mk3. But again, differences are small.

Conclusion

All of the sequences are pretty close. I'd recommend mk3 because it has a small edge in terms of optimization quality and the difference in timing is not significant. It's also the one I spent the most time analyzing so I don't see much point in going back to the other two. Still, it seems that improvements were marginal at best.

the-good-parts-mk3 vs default

Quality of optimization

Cost differences in CI:

  • Still mostly between -1% and +3% with outliers at -5% and 4-9%.

External tests:

  • bytecode size changes between -1.85% and +1.65%.
  • runtime cost between 0% and +0.15%.
    • Note that the +2% outlier for uniswap we've seen with mk1 is gone, but this is probably due to switch to Cancun, not improvements in the sequence.
  • effect on legacy pipeline: bytecode size between 0% and +0.3%, runtime cost change negligible.

seqbench:

  • runtime cost improvement differences are negligible in all cases.
  • bytcode size improvement differs by 1-2% either way so it's wash. The only real outlier being the pathological chans.sol where default increases the size by 252.3%, while the-good-parts-mk3 "only" by 227.2%.
Compilation time

External tests:

  • New sequence is 10-65% faster, depending on the test (with colony being a single outlier at +6%).

seqbench:

  • Compilation time decreases by 1/3 to 2/3 depending on the test.

Timing benchmark:

  • The compilation time is halved in all three cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
must have Something we consider an essential part of Solidity 1.0. optimizer viair
Projects
Status: In Progress
Development

No branches or pull requests

4 participants
@cameel @ekpyron @nikola-matic and others