Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large-scale PowerPC recompiler rework #641

Draft
wants to merge 42 commits into
base: main
Choose a base branch
from
Draft

Conversation

Exzap
Copy link
Contributor

@Exzap Exzap commented Jan 30, 2023

Disclaimer: This is work-in-progress. I'm opening this draft PR for visibility, so others can track progress and know not to alter recompiler code. Work started on this in November and the ETA for completion is somewhere in the span of the next few months, depending on my motivation.

Goals

I originally started work on the recompiler in 2014 and since then I have learned a lot more about state-of-the-art compiler and IR design. While I'm generally happy with the quality of our code translation, some of the design choices I made along the way make it hard to introduce further optimizations or fixes. A lot of the complexity is at the burden of the x86-64 backend, which means that all of that would have to be reimplemented when targeting another architecture.

Overall, the idea is to make both the front-end (PPC to IR) and the back-end (IR to x86-64) as "dumb" as possible so that all the complex logic can be shifted to operate on platform-independent IR, lowering the burden on platform-specific code.

State

Please do not report bugs yet. In fact I don't recommend trying this out, it's an active construction site.

  • Reorganized file and folder structure to be more modular
  • Modernize C-style code to use C++ features where it makes sense
  • Fundamentally rework PPC basic block handler to be more flexible. Support non-continous functions and potentially allow for complex inlining
  • Support for bool-based jumps and bool registers instead of having PowerPC CR logic embedded into the IR
  • Allow PowerPC CR bits, SPRs and XER carry bit to reside in registers and participate in register allocation
  • Avoid complex instructions in the IR when they could be implemented using basic operations only
    • LSWI / STWSI
    • SRAW / SRAWI
    • BDNZ
    • LWARX / STWCX
    • ADDC and other arithmetic instructions with carry
    • DCBZ
    • MFCR / MTCRF
    • RLWIMI
    • SLW / SRW
  • Support typed registers. For now everything is either a 32bit integer or a 2x64bit paired single register
  • Switch floating-point logic over to the newer register allocator that is currently only used for integer registers
  • Support for calls to native code in arbitrary locations of the IR program. Currently calling external code is done hackily via macro instructions which need per-backend implementation
  • Rework floating-point register handling. This is a big chapter on it's own and I'll expand on this once I get to it
  • Optimize! This includes bringing back optimizations lost with the restructuring as well as adding some new ones
    • Added a new dead code elimination pass
    • x86 specific: Conditional jumps will use eflags instead of emulating PPC CRx bits where possible
    • Fix loop detection and move register loads/stores out of loops where possible

I know a lot of these are pretty abstract, so in the future I might add a few before-vs-after code examples to this text.

Q&A

Will this PR add ARM support?

No. But it will make adding a new target architecture a lot easier and if I am motivated enough I'll look into adding an aarch64 backend after this is done.

Will this make Cemu faster?

Maybe? After everything is done the recompiler should output faster code, but CPU execution speed generally isn't a bottleneck in Cemu so it's hard to predict whether there will be an actual difference.

What about the proposed plan to use LLVM?

I did quite a bit of research on that. The biggest downside is that LLVM is still quite JIT-unfriendly and comes with significant bloat. Not saying that it wouldn't work, but the cons outweigh the pros in my opinion. Plus we already got a pretty sophisticated recompiler and it would be a waste to throw it away.
On a personal note, I enjoy working on custom solutions more than plugging in libraries so it's easier for me to stay motivated and make progress. In regards to total effort both solutions are about the same.

@Wunkolo
Copy link

Wunkolo commented Jan 30, 2023

What would be the scope of changing the x64 emitter over to something like xbyak?

With the current x64 emitter, adding a new instruction or class of instructions would involve implementing the encoding for those instructions (REX, VEX, EVEX, ModR/M, SIB, etc) from scratch and then implementing the new instruction in particular AND detecting it the particular CPUID flags when this redundant work can probably just be pushed onto a proven library.

@Exzap
Copy link
Contributor Author

Exzap commented Jan 30, 2023

Thanks for pointing out Xbyak, I wasn't aware of it. The assemblers I looked at were always a bit overkill for our purposes, usually focusing on human-friendly API and less towards a simple interface for machine generated code. We only need a very thin emitter, but Xbyak seems to be exactly that.

As part of this rework I also started a new "cleaner" x86-64 high-performance emitter which I auto-generate from encoding tables. The effort for this is relatively minimal, but using a premade emitter would certainly cut down the effort even further. I'll think about it.

@amayra
Copy link

amayra commented May 16, 2023

did you drop this project ?

@Exzap
Copy link
Contributor Author

Exzap commented May 17, 2023

Nah just busy with other stuff. I'll get back to this eventually

@jcrm1 jcrm1 mentioned this pull request Sep 26, 2023
@iMonZ
Copy link

iMonZ commented Sep 26, 2023

Nah just busy with other stuff. I'll get back to this eventually

Thanks! ARM64 Support would make the CEMU emulator finally done and future proof!

@Wunkolo
Copy link

Wunkolo commented Sep 26, 2023

On ARM64: I've been using oaknut on other projects. It is structured very similarly to xbyak.

@Gabezin64
Copy link

This will finally fix the lens flare issue in The Wind Waker HD and Twilight Princess HD?

@Exzap
Copy link
Contributor Author

Exzap commented Oct 13, 2023

This will finally fix the lens flare issue in The Wind Waker HD and Twilight Princess HD?

That's a graphical issue. It's unaffected by this CPU rework.

Intermediate commit while I'm still fixing things but I didn't want to pile on too many changes in a single commit.
New:
Reworked PPC->IML converter to first create a graph of basic blocks and then turn those into IML segment(s). This was mainly done to decouple IML design from having PPC specific knowledge like branch target addresses. The previous design also didn't allow to preserve cycle counting properly in all cases since it was based on IML instruction counting.
The new solution supports functions with non-continuous body. A pretty common example for this is when functions end with a trailing B instruction to some other place.

Current limitations:
- BL inlining not implemented
- MFTB not implemented
- BCCTR and BCLR are only partially implemented
Instead of having fixed macros for BCCTR/BCCTRL/BCLR/BCLRL we now have only one single macro instruction that takes the jump destination as a register parameter.
This also allows us to reuse an already loaded LR register (by something like MTLR) instead of loading it again from memory.

As a necessary requirement for this: The register allocator now has support for read operations in suffix instructions
Also removed associatedPPCAddress field from IMLInstruction as it's no longer used
Storing the condition result in a register instead of imitating PPC CR lets us simplify the backend a lot. Only implemented as PoC for BDZ/BDNZ so far.
Carry bit is now resident in a register-allocated GPR instead of being backed directly into IML instructions

All the PowerPC carry ADD* and SUB* instructions as well as SRAW/SRAWI have been reworked to use more generalized IML instructions for handling carry

IML instructions now support two named output registers instead of only one (easily extendable to arbitrary count)
It's better to do it in a lowering pass so that the backend code can be kept as simple as possible
CR bits are now resident in registers instead of being baked into the instruction definitions. Same for XER SO, and LWARX reservation EA and value.

Reworked LWARX/STWCX, CRxx ops, compare and branch instructions. As well as RC bit handling. Not all CR-related instructions are reimplemented yet.

Introduced atomic_cmp_store operation to allow implementing STWCX in architecture agnostic IML

Removed legacy CR-based compare and jump operations
Also implement PPC NAND instruction
Additionally there is no more range limit for virtual RegIDs, making the entire uint16 space available in theory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants