Optimization Thoughts

dynamic recompiler CPU pipeline stalling?

I have not seen any code in the dynamic recompiler that tries to ensure the CPU pipeline is not stalled from memory loading. As the R4300 had many more registers (32+32fp)than the pi, I would expect the dynamic code to contain a lot more register loads/saves (including ldm/stm in stack use). If these are not carefully used then the processor will keep stalling the pipeline and waste time.

Requirements: Dump arm assembler and analyze. Modify dynamic recompiler to stop CPU stalls.

dynamic recompiled code or GCC

The dynamic recompiler can handle self-modifying code however if this functionality is never used, then it would be interesting to dump the generated arm assembler to a file and compile it into a shared library. This could help determine how optimized the dynamically recompiled code is.

Requirements: There is a function in dynarec.c to invalidate compiled blocks, a check needs to be done to see if this is called after initial recompiling. A dump of the assembler needs to be done with some supporting code to build a shared library. The internal callback functions need to be made global.

Update: Placing of branches into an assembler file is difficult as labels need to be 'inserted' as appropriate to cater for branches. Using LDR to force a value into the PC would not help as any optimization could break the code.

caching and Translation Look-aside Blocks for R4300 (TLBs)

Caching and TLBs for the R4300 are emulated within mupen64plus however is it possible to remove this part of the emulator to reduce the processing load as Mupen64plus can read from the emulated memory. Given caching is only a mechanism to allow an increase in memory access, then unless the R4300 OS/kernel uses the cache miss exception to do something unrelated then it shouldn't matter.

Requirements: Understand how delay slots + cache misses work

Update: The dynamic recompiler does not perform caching emulation and has code to make this transparent to TLB's. In the dynarec documentation it states most games don't use virtual addressing anyway.

video-plugin on GPU

The video plugin uses around 40-50% of the processing time. However if the plugin can be pushed onto the GPU then there should be a big performance improvement, potentially without having to frame skip either.

Requirements: A reference(pointer) to the N64 memory space to be passed to the GPU. This would need to use CMA to reserve a contiguous block of physical memory. Hardware DMA can get virtual+physical address blocks but not large enough for RDRAM. A plugin would need to be written for the GPU. There are examples of assembler that run on the GPU.

rewrite of dynamic compiler for arm+fpu only

The dynamic recompiler has been written so that it is partially generic (to cater for x86 and ARM). However few processor specific optimizations are performed during the writing of dynamic code. This speeds up compile time and on the PI a 4KB chunk is processed in just a few milliseconds. Page invalidations in compiled code (e.g. from self modifying code) for the m64p_test_rom, only occur during start up so expending some more processor time to optimize the compiled code would improve game play.

So far I have found no evidence of the use of LDM/STM, 'Operand2' and LDR/STR 'offset write-back' to reduce the number of instructions in the dynamic code. This would require careful analysis of all jumps/branches in the code to ensure execution does not jump to a combined instruction. As the address of a jump is not always known until runtime, a mechanism to check and correct would be required.

The recompiler appears to use a block of memory (address in R11 acting as FP) for storing the current values of the N64 registers, If the registers to be loaded are in order, then it is possible to load/store multiple sets in one instruction.

Provide feedback

Saved searches