Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement fast tier-1 JIT compiler #283

Open
jserv opened this issue Nov 30, 2023 · 11 comments
Open

Implement fast tier-1 JIT compiler #283

jserv opened this issue Nov 30, 2023 · 11 comments
Assignees

Comments

@jserv
Copy link
Contributor

jserv commented Nov 30, 2023

To achieve the goal of integrating a portable JIT compiler (#81) , we plan to implement a multi-tier compilation framework. We will start with the Fast JIT tier and then transition to LLVM JIT (ORC lazy JIT) to achieve both quick startup times and improved performance.

Following a similar approach to V8 JIT tiering, we will employ backend threads (#239) with eager compilation. This implies that once the profiling data indicates the shift from interpreter to JIT (#189), all RISC-V instructions will undergo compilation with the Fast JIT. We will patiently await the compilation process's completion, then proceed to instantiate the runtime module and execute the RISC-V instructions. Concurrently, we'll initiate LLVM JIT to compile RISC-V instructions and progressively transition to LLVM JIT functions during execution.

Design goals:

  • Short compilation time and quick startup.
  • Good overall performance.
  • Small memory footprint.
  • Ease of support for additional targets, typically requiring codegen re-implementation.
  • Long-term objective: Tier-up to LLVM JIT.

Reference:

@jserv
Copy link
Contributor Author

jserv commented Dec 4, 2023

Adopting an interpreter-first approach, where the system starts with an interpreter and switches to JIT compilation upon identifying performance bottlenecks, can potentially influence the security landscape. Here are some points on how this approach may impact security concerns:

  1. Reduced Attack Surface at Startup: Initially running code in an interpreter reduces the attack surface related to JIT compilation. Since JIT compilers often involve more complex operations and direct interaction with low-level machine code, starting with an interpreter can limit early exposure to these risks.
  2. Controlled Transition to JIT Compilation: By only transitioning to JIT compilation for performance-critical sections, we can more closely monitor and control the parts of code that are compiled. This allows for a more targeted approach to security, focusing on the smaller subset of code that is JIT-compiled.
  3. Delay in Exposure to JIT-related Vulnerabilities: Since JIT compilation is deferred, any vulnerabilities specific to the JIT process (like JIT spraying attacks) are also deferred. This delay could be advantageous in reducing the window of opportunity for certain types of attacks.
  4. Potential for New Vulnerabilities: The mechanisms used to determine when to switch from interpretation to JIT compilation might introduce new vulnerabilities. For instance, if an attacker can influence this decision-making process, they might trigger JIT compilation in a way that is advantageous for exploiting vulnerabilities.

Balancing performance, functionality, and security remains a key consideration in this approach. Here are some academic papers that emphasize interpreter-first approaches or related concepts in JIT compilation:

@jserv
Copy link
Contributor Author

jserv commented Dec 16, 2023

After the code check-in for the tier-1 JIT compiler based on the x86-64 architecture (#289), the following action items have been identified to complete our goals:

  • Refine the newly-added jit-x64.[ch] files to decouple the components common to future Arm64 backend support;
  • Implement jit-arm64.[ch] and revise rv32_template.c to provide Aarch64 support;
  • Evaluate and implement basic optimizations, such as register allocation (jit: Implement register allocation for T1C #330) and peephole optimization, to improve the performance of the tier-1 JIT compiler.

@qwe661234
Copy link
Collaborator

qwe661234 commented Dec 21, 2023

In the implementation of jit-arm64.[ch], our objective is to harmonize the API of ARM64 with that of X64. This allows us to efficiently reuse the existing JIT template from X64 instead of creating a new one.

For example, these are the implementations of the 'addi' instruction for x64 and ARM64. As demonstrated in these examples, after harmonizing the API, the only difference lies in the use of the x64 register RAX and the ARM64 register R6.

  • X64
emit_load(state, S32, platform_parameter_registers[0], RAX, offsetof(struct riscv_internal, X) + 4 * ir->rs1);
emit_alu32_imm32(state, 0x81, 0, RAX, ir->imm);
emit_store(state, S32, RAX, platform_parameter_registers[0], offsetof(struct riscv_internal, X) + 4 * ir->rd);
  • ARM64
emit_load(state, S32, platform_parameter_registers[0], R6, offsetof(struct riscv_internal, X) + 4 * ir->rs1);
emit_alu32_imm32(state, 0x81, 0, R6, ir->imm);
emit_store(state, S32, R6, platform_parameter_registers[0], offsetof(struct riscv_internal, X) + 4 * ir->rd);

We can simply rewrite them to reuse JIT template.

#ifndef __x86_64
TEMP_REG RAX
#else
TEMP_REG R6
#endif
emit_load(state, S32, platform_parameter_registers[0], TEMP_REG, offsetof(struct riscv_internal, X) + 4 * ir->rs1);
emit_alu32_imm32(state, 0x81, 0, TEMP_REG, ir->imm);
emit_store(state, S32, TEMP_REG, platform_parameter_registers[0], offsetof(struct riscv_internal, X) + 4 * ir->rd);

@jserv
Copy link
Contributor Author

jserv commented Dec 21, 2023

We are essentially developing a basic tool akin to LLVM TableGen, which empowers developers to define the fundamental concepts in any desired way through custom backends tailored for JIT code generation.

@jserv
Copy link
Contributor Author

jserv commented Dec 25, 2023

After the code check-in for the tier-1 JIT compiler based on the x86-64 architecture (#289), the following action items have been identified to complete our goals:

  • Refine the newly-added jit-x64.[ch] files to decouple the components common to future Arm64 backend support;
  • Implement jit-arm64.[ch] and revise rv32_template.c to provide Aarch64 support;

The JIT backend for Arm64 has now been implemented (#304), and the built-in domain-specific language (DSL) has undergone refinements. These enhancements enable straightforward descriptions for code generation, facilitating the on-the-fly creation of machine code for both x86-64 and Arm64 architectures.

@jserv
Copy link
Contributor Author

jserv commented Dec 25, 2023

The tier-1 JIT compiler is essentially a single-pass compiler, focused on speed and simplicity of compilation rather than complex analysis. In a single pass, it does not construct a graph-based intermediate representation of the code. Instead, it generates code for a limited number of instructions at a time, utilizing minimal context from preceding instructions. Despite its apparent simplicity, this method of compilation presents its own set of challenges in terms of accurate and efficient implementation. Nevertheless, we are exploring basic optimizations, such as:

  • Register Allocation: By tracking register occupancy for each slot through abstract values, the compiler can often bypass code for local accesses, mostly updating the abstract state instead.
  • Constant-Folding: When abstract values track constants, the compiler can evaluate side-effect-free instructions at compile time, leading to the generation of more constants.
  • Branch-Folding: If abstract values track constants, branches with constant conditions can be eliminated or transformed into unconditional jumps.
    Instruction Selection: Tracking register occupancy through abstract values enables the compiler to choose between memory or register addressing modes. Similarly, tracking constants allows for the use of immediate-mode instructions like ADDI.
  • Peephole Optimization: By looking ahead at one or more instructions, the compiler can merge multiple instructions, such as combining a compare and a branch.

This approach to JIT compilation seeks a balance between the need for rapid code generation and the opportunity to apply foundational optimizations.

Meanwhile, I plan to discuss and confirm the feasibility of the aforementioned optimizations with @vacantron and @qwe661234 later.

Reference: Whose baseline compiler is it anyway?

@jserv
Copy link
Contributor Author

jserv commented Dec 30, 2023

A hybrid JIT combines aspects of both a compiler and an interpreter. In such a system, a single method can be executed partly through interpretation and partly through compilation. This approach leverages the fact that most programs spend the majority of their time executing a small fraction of their code. Therefore, fully compiling every method isn't necessary for enhanced execution speed; focusing on the most frequently executed code segments is often enough.

Take, for example, a method where the bulk of execution time is spent in a short loop that runs numerous times. A hybrid JIT would compile just this loop, while interpreting the rest of the method. This strategy is particularly beneficial in embedded systems for two reasons. First, it reduces the memory needed to store the native code since the entire method isn't compiled. Second, it shortens the compilation time, aligning closer to soft-real-time performance.

For instance, consider a method that initializes and then calculates the dot product of two vectors. In such a method, the two for-loops might be the only parts compiled into native code. If the complete method contains a large number of bytecode instructions, but each loop only has a few, this hybrid approach can significantly reduce the compilation workload and the size of the resulting native code. Crucially, it does this while still optimizing the most critical parts of the method.

This method of JIT compilation thus offers a balance between performance efficiency and resource utilization, making it especially suitable for systems where memory and processing power are at a premium.
A Sample Java Program

Reference: A Small Hybrid JIT for Embedded Systems

@jserv
Copy link
Contributor Author

jserv commented Dec 30, 2023

Blocked by #297, #308, #189, #330, #347, #365

@jserv
Copy link
Contributor Author

jserv commented Dec 30, 2023

JIT Dynamic Binary Translation Flow

The instruction set simulator identifies certain code segments as hotspots and converts them into native machine code following a specific sequence of steps. Each block of code is mapped to a function identified by its address. During this process, every instruction is transformed into native code that semantically aligns with the original, thereby accurately reflecting the architectural state of the processor. This approach ensures that the compiled code maintains fidelity to the original instruction set while leveraging the efficiency of native execution.

Source: Cycle-accurate performance modelling in an ultra-fast just-in-time dynamic binary translation instruction set simulator

@vacantron
Copy link
Collaborator

There are some topics that we can delve into further:

  1. Register Allocation
    The current register allocator has a simple mechanism that picking the candidate of the register and mapping to the vm. To improve the performance, we can use the linear-scan or graph-coloring algorithm to find the better candidate to reduce the register swapping.

  2. Constant Optimization / Instruction Selection
    There are some useful instructions that can move the constant to the memory directly in x86-64/Aarch64. With these instructions, some lui/auipc/addi/sw instruction of RISC-V can be eliminated and reduces the register pressure. The better instruction selection can also utilize the potential of the architecture of host.

  3. Interprocedual Optimization (still being researching)
    At the end/beginning of the block, we usually have to store/restore the register context to/from the stack. The overhead could be huge if there are many piece of blocks. The interprocedual (across blocks) analysis/optimization can probably help us to inline some blocks and reduce the stack operations.

@qwe661234
Copy link
Collaborator

After investigating the difference in dynamic instruction count between riscv32-unknown-elf-gcc and x64 native gcc, we observed that the significant difference is caused by the powerful Intel instruction rep movsb. This observation inspires us to figure out a approach that we can trace the pattern in the tier-1 JIT compiler and replace the host instruction with this powerful one.

Metric DIC RISCV DIC x64 Difference
string sort 61459066578 3755814502 16.36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants