[WIP] Avoid computing flags even lazy #466

viorelcanja · 2021-05-07T07:17:26Z

This PR tries to optimize lazy flags computation. In the current implementation, information about the operation and operands is stored and is in some cases overwritten by subsequent operations. In other cases operations force generation of particular flags since they only do partial lazy computation.

The proposed optimization adds metadata to the analyzer to be able to identify which flags are tested and which are modified by each instruction. Using this information each jitted BB is analyzed for potential otpimizations and instructions which generate flags that are completely overwritten by subsequent instructions and thus never used are being replaced with alternate implementations which do not do flags computation (this includes not storing anything for lazy computation).

This improves performance significantly in some benchmars like running a tight loop that does mostly register operations.

Since it was not clear to me what the correct metadata is for some instructions, if a particular instruction does not have the metadata it will disable the optimization from that point on until the end of the BB.

Still to do:

metadata was generated semi-automatically and it could be wrong in some places, it should be double checked
I am just beginning to learn rust (I know enough to be dangerous) so I am probably using non-idiomatic rust. I probably need to go back and change some code.
I need to do more testing
I need to do more profiling
I need to adjust some tests (generated wasm has been changed)

Challenges:

duplicating functions to include a "_noflags" version is tedious and error prone. On the other hand dealing with this at runtime (including the check for flag computation in the jitted code) negates some of the perf advantages.
intruction metadata in the analyzer must be in sync with the actual implementation of the jit and interpreter, also the regular and _noflags jit versions need to be in sync

Initial benchmarks:

NEW

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          230.87  :       5.92  :       1.94
STRING SORT         :          7.0302  :       3.14  :       0.49
BITFIELD            :      1.0887e+08  :      18.68  :       3.90 <----
FP EMULATION        :          19.962  :       9.58  :       2.21
FOURIER             :          563.87  :       0.64  :       0.36
ASSIGNMENT          :          5.8462  :      22.25  :       5.77 <----
IDEA                :          544.41  :       8.33  :       2.47
HUFFMAN             :          341.86  :       9.48  :       3.03 <----
NEURAL NET          :         0.38348  :       0.62  :       0.26
LU DECOMPOSITION    :          13.835  :       0.72  :       0.52
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 9.261
FLOATING-POINT INDEX: 0.657
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : GenuineIntel 0f/06 1000MHz
L2 Cache            : 6144 KB
OS                  : Linux 4.19.0-16-686
C compiler          : gcc version 8.3.0 (Debian 8.3.0-6)
libc                : libc-2.28.so
MEMORY INDEX        : 2.220
INTEGER INDEX       : 2.381
FLOATING-POINT INDEX: 0.364
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

----

OLD

BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Index-split by Andrew D. Balsa (11/97)
Linux/Unix* port by Uwe F. Mayer (12/96,11/97)

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          226.04  :       5.80  :       1.90
STRING SORT         :          7.0279  :       3.14  :       0.49
BITFIELD            :      6.4915e+07  :      11.14  :       2.33 <----
FP EMULATION        :           19.46  :       9.34  :       2.15
FOURIER             :          553.47  :       0.63  :       0.35
ASSIGNMENT          :          4.4541  :      16.95  :       4.40 <----
IDEA                :          545.29  :       8.34  :       2.48
HUFFMAN             :          321.95  :       8.93  :       2.85 <----
NEURAL NET          :         0.38921  :       0.63  :       0.26
LU DECOMPOSITION    :          14.207  :       0.74  :       0.53
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 8.150
FLOATING-POINT INDEX: 0.662
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : GenuineIntel 0f/06 1000MHz
L2 Cache            : 6144 KB
OS                  : Linux 4.19.0-16-686
C compiler          : gcc version 8.3.0 (Debian 8.3.0-6)
libc                : libc-2.28.so
MEMORY INDEX        : 1.707
INTEGER INDEX       : 2.320
FLOATING-POINT INDEX: 0.367
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

viorelcanja · 2021-05-08T18:02:20Z

I added an new optimization unrelated to the old one to be able to see the cummultive effect. Please let me know if I should split the PR.

The new benchmarks are:

TEST                : Iterations/sec.  : Old Index   : New Index
                    :                  : Pentium 90* : AMD K6/233*
--------------------:------------------:-------------:------------
NUMERIC SORT        :          240.29  :       6.16  :       2.02
STRING SORT         :           7.026  :       3.14  :       0.49
BITFIELD            :      1.1096e+08  :      19.03  :       3.98
FP EMULATION        :          20.457  :       9.82  :       2.27
FOURIER             :          571.01  :       0.65  :       0.36
ASSIGNMENT          :           6.089  :      23.17  :       6.01
IDEA                :          634.61  :       9.71  :       2.88
HUFFMAN             :          360.39  :       9.99  :       3.19
NEURAL NET          :         0.40564  :       0.65  :       0.27
LU DECOMPOSITION    :          14.703  :       0.76  :       0.55
==========================ORIGINAL BYTEMARK RESULTS==========================
INTEGER INDEX       : 9.707
FLOATING-POINT INDEX: 0.686
Baseline (MSDOS*)   : Pentium* 90, 256 KB L2-cache, Watcom* compiler 10.0
==============================LINUX DATA BELOW===============================
CPU                 : GenuineIntel 0f/06 1000MHz
L2 Cache            : 6144 KB
OS                  : Linux 4.19.0-16-686
C compiler          : gcc version 8.3.0 (Debian 8.3.0-6)
libc                : libc-2.28.so
MEMORY INDEX        : 2.264
INTEGER INDEX       : 2.548
FLOATING-POINT INDEX: 0.380
Baseline (LINUX)    : AMD K6/233*, 512 KB L2-cache, gcc 2.7.2.3, libc-5.4.38

copy · 2021-05-08T21:41:01Z

Very nice contribution, thanks! I'll need some time to review this in detail, but here are some initial thoughts:

First of all, this is clearly a worthwhile optimisation
An interesting data point would be how often this optimisation prevents the generation of lazy flags. You can use the existing APIs to generate stats from the generated code at runtime, see profiler.rs, src/browser/print_stats.js and codegen::gen_profiler_stat_increment. Run make debug-with-profiler and use debug.html to see the statistics at the bottom of the page
The profiling info may also inform whether extending the set of functions called at runtime is worth it, for example add16, which is a called from the generated module vs add32, which is generated without a function call. The former is on its way out (see for example my recent change d929f4d), so we might as well leave called functions like add16 unoptimised for now. Alternatively, generate code in the noflags case and generate a call otherwise.
The nasmtest generator script (tests/nasm/create_tests.js) should be extended to hit both cases at least once (for each arithmetic instruction, once where flags are tested after the instruction (e.g. by adding a pushf) and once where they are overridden after the instruction). You can bump NO_TESTS to 10 or so for local testing and randomise which of the two cases is hit.
Not urgent, but you should at some point start running rustfmt on your code (cargo fmt --all)

Unfortunately, any instruction that may trigger a fault, including all memory accesses may read flags, so in the following sequence flags for add have to be computed:

add eax, ebx
mov [edx], 42

We may later have another optimisation that moves the flags computation into the slow path of the memory access.

Please let me know if I should split the PR.

Yes, please. You second optimisation seems useful on its own and will make this PR easier to review.

This reverts commit 5a18b20.

viorelcanja · 2021-05-09T12:36:17Z

I reverted the commit with the mem fast path optimization and created a new PR.

re:profiling, I will add it
re: tests, I will add tests to cover both code paths
re: formating, of course :) I just need to get a bit used to the whole rust environment

Re: exceptions, agreed, Although for "well behaved code" it should not be an issue, in general, it is. You can get deterministic behavior by generating exceptions on purpose and reading the flags so they should be computed properly. For the general case we cannot optimize it back in the mem access slow path:

1 add eax,ebx
2 mov eax,0xffffffff
3 mov ebx,0xffffffff
4 mov [edx], eax
5 add eax,ebx

With the current optimization, information about how to compute flags is not saved for instr 1 since 5 completely rewrites the flags. If 4 triggers an exception we need the flags and without saved stated (which now is optimized) we cannot do it.We could partially get back some perf by looking at instructions at analysis time and still optimize the saving of state if operations between 1 and 4 allow reconstruction of necessary state. For instance, this could still be optimized since subsequent operations do not destroy the required state and information could be recovered in the slow path:

add eax, ebx
xchg eax, esi
xchg ebx,edi
mov [edx], eax
add eax,ebx

Still, it gets complicated and I am not sure it's worth implementing this. Maybe a flag to ignore corner cases ( like enable_optimistic_optimizations ) might add more value to people looking at running "well behaved code". I'm curious what you think of this.

LE: edited the 2nd example to be more relevant.

copy · 2021-05-17T08:25:30Z

I reverted the commit with the mem fast path optimization and created a new PR.

Thanks!

Re: exceptions, agreed, Although for "well behaved code" it should not be an issue, in general, it is. You can get deterministic behavior by generating exceptions on purpose and reading the flags so they should be computed properly. For the general case we cannot optimize it back in the mem access slow path:

1 add eax,ebx
2 mov eax,0xffffffff
3 mov ebx,0xffffffff
4 mov [edx], eax
5 add eax,ebx

With the current optimization, information about how to compute flags is not saved for instr 1 since 5 completely rewrites the flags. If 4 triggers an exception we need the flags and without saved stated (which now is optimized) we cannot do it.We could partially get back some perf by looking at instructions at analysis time and still optimize the saving of state if operations between 1 and 4 allow reconstruction of necessary state. For instance, this could still be optimized since subsequent operations do not destroy the required state and information could be recovered in the slow path:
add eax, ebx
xchg eax, esi
xchg ebx,edi
mov [edx], eax
add eax,ebx

Good analysis, I believe you're correct. Tracking which registers contain which values from a previous arithmetic instruction could also help optimising jmpcc, setcc and cmov which currently only handles cases where the two instructions are next to each other (see ctx.last_instruction, gen_getzf, etc.). So I do think this is a worthwhile strategy, although it doesn't necessarily need to be implemented as part of this PR.

Maybe a flag to ignore corner cases ( like enable_optimistic_optimizations ) might add more value to people looking at running "well behaved code".

Not a fan of this. I don't know if I would enable it for copy.sh/v86, because whenever some OS doesn't work, I would need to turn off the flag to verify. I think the aforementioned profiling to determine how often a memory instructions prevents flags from being optimised could help inform the decision.

viorelcanja added 2 commits May 7, 2021 09:48

optimize lazy flags computation

4902ee9

remove comments

a712ee6

viorelcanja marked this pull request as ready for review May 7, 2021 07:41

viorelcanja added 6 commits May 7, 2021 11:00

update expect test

faf3788

statically allocate flag buffers

02d3c2e

comp_flags

493c29a

fix bounds

7381a56

revert comments

11db332

remove one addition from the memory access fast path

5a18b20

update expect tests

8e24f02

viorelcanja closed this May 8, 2021

viorelcanja reopened this May 8, 2021

viorelcanja added 2 commits May 9, 2021 09:26

Revert "remove one addition from the memory access fast path"

6bbfce2

This reverts commit 5a18b20.

force flags availability for mem access instructions

824729b

copy mentioned this pull request Oct 7, 2021

Newcomer issues #537

Open

Repository owner deleted a comment from micwoj92 Jul 19, 2022

copy mentioned this pull request Sep 19, 2023

Symmetric multi-processing (SMP) support #924

Open

copy mentioned this pull request Apr 30, 2024

NodeJS performance on alpine/arch running on v86 #1005

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Avoid computing flags even lazy #466

[WIP] Avoid computing flags even lazy #466

viorelcanja commented May 7, 2021 •

edited

viorelcanja commented May 8, 2021

copy commented May 8, 2021

viorelcanja commented May 9, 2021 •

edited

copy commented May 17, 2021 •

edited

[WIP] Avoid computing flags even lazy #466

Are you sure you want to change the base?

[WIP] Avoid computing flags even lazy #466

Conversation

viorelcanja commented May 7, 2021 • edited

viorelcanja commented May 8, 2021

copy commented May 8, 2021

viorelcanja commented May 9, 2021 • edited

copy commented May 17, 2021 • edited

viorelcanja commented May 7, 2021 •

edited

viorelcanja commented May 9, 2021 •

edited

copy commented May 17, 2021 •

edited