

# FULL-SYSTEM SIMULATION OF JAVA WORKLOADS WITH RISC-V AND THE JIKES RESEARCH VIRTUAL MACHINE

Martin Maas Krste Asanovic John Kubitowicz

1<sup>st</sup> CARRV Workshop, October 14, 2017



Berkeley  
Architecture  
Research



# Managed Languages

Java, PHP, C#,  
Python, Scala



JavaScript,  
WebAssembly



Java, Swift,  
Objective-C



Servers

Web Browser

Mobile

# Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century

CACM  
08/2008

By Stephen M. Blackburn, Kathryn S. McKinley, Robin Garner, Chris Hoffmann, Asjad M. Khan, Rotem Bentzur, Amer Diwan, Daniel Feinberg, Daniel Frampton, Samuel Z. Guyer, Martin Hirzel, Antony Hosking, Maria Jump, Han Lee, J. Eliot B. Moss, Aashish Phansalkar, Darko Stefanović, Thomas VanDrunen, Daniel von Dincklage, and Ben Wiedermann

## Abstract

Evaluation methodology underpins all innovation in experimental computer science. It requires relevant *workloads*, appropriate *experimental design*, and rigorous *analysis*.

Unfortunately, methodology is not keeping pace with the changes in our field. The rise of managed languages such as Java, C#, and Ruby in the past decade and the imminent rise of commodity multicore architectures for the next decade pose new methodological challenges that are not yet widely understood. This paper explores the consequences of our collective inattention to methodology on innovation,

Many developers today choose managed languages, which provide: (1) memory and type safety, (2) automatic memory management, (3) dynamic code execution, and (4) well-defined boundaries between type-safe and unsafe code (e.g., JNI and PInvoke). Many such languages are also object-oriented. Managed languages include Java, C#, Python, and Ruby. C and C++ are not managed languages; they are compiled-ahead-of-time, not garbage collected, and unsafe. Unfortunately, managed languages add at least three new degrees of freedom to experimental evaluation: (1) a *space-time trade-off* due to garbage collection, in which heap size is a control vari-

# ISCA-2017 Proceedings

**54** papers in total



**Evaluated** using  
managed-language  
workloads

**Not evaluated**  
using managed-  
language workloads

# Managed Language Challenges



Long-Running  
on Many Cores



Concurrent  
Tasks (GC, JIT)



Fine-grained  
Interactions

# Limitations of Simulators



Qemu

## High-performance Emulation

Cannot account for fine-grained details  
(e.g., barrier delays of ~10 cycles)



## Cycle-accurate Simulation

Too slow to run large-scale Java workloads

Realism

# Limitations of Simulators



Realism



Industry Adoption





Run managed workloads on real RISC-V  
hardware in FPGA-based simulation to  
enable modifying the entire stack

# Two Pieces of Infrastructure



?

Easy-to-modify  
Hardware



?

Easy-to-modify  
Runtime System

# Two Pieces of Infrastructure



Rocket Chip  
Ecosystem



Easy-to-modify  
Runtime System

# Two Pieces of Infrastructure



Rocket Chip  
Ecosystem

Jikes Research  
Virtual Machine

# Talk Outline

## 1. Porting JikesRVM to RISC-V

Lessons learnt porting JikesRVM to RISC-V

## 2. Running JikesRVM on RocketChip

Demo of JikesRVM running on real RISC-V hardware

## 3. Managed-Language Use Cases

New research that is enabled by this infrastructure

# PART I

## 1. Porting JikesRVM to RISC-V

Lessons learnt porting JikesRVM to RISC-V

## 2. Running JikesRVM on RocketChip

Demo of JikesRVM running on real RISC-V hardware

## 3. Managed-Language Use Cases

New research that is enabled by this infrastructure

# JikesRVM on RISC-V

- 15,000 lines of code in 86 files to port the non-optimizing baseline compiler
- Runs full JDK6 applications, including the Dacapo benchmark suite (no JDK7)
- Passes JikesRVM core test suite

# The Jikes Research VM



# The Jikes Research VM



# The Jikes Research VM



# Porting The Jikes Research VM

# The Environment: Yocto

```
http://yoctoproject.org/documentation
```

For more information about OpenEmbedded see their website:  
<http://www.openembedded.org/>

You had no conf/bblayers.conf file. The configuration file has been created for you with some default values. To add additional metadata layers into your configuration please add entries to this file.

The Yocto Project has extensive documentation about OE including a reference manual which can be found at:  
<http://yoctoproject.org/documentation>

For more information about OpenEmbedded see their website:  
<http://www.openembedded.org/>

```
### Shell environment set up for builds. ###
```

You can now run 'bitbake <target>'

```
maas@a6:/scratch/maas/poky/demo/riscv-poky/build$ bitbake  
Parsing recipes:  29% !#####
```

Install the resulting  
SDK image

Clone riscv-poky ([GitHub: riscv/riscv-poky](#))  
and run: **bitbake meta-toolchain**

To add missing dependencies, edit: *meta-riscv/recipes-core/images/core-image-riscv.bb*

```
martin@virtualbox:~/src/riscv/toolchain$ ./poky-riscv-glibc-x86_64-meta-toolchain-riscv64-toolchain-2.0+snapshot.sh  
Poky (Yocto Project Reference Distro) SDK installer version 2.0+snapshot  
=====  
Enter target directory for SDK (default: /opt/poky-riscv/2.0+snapshot): ~/src/riscv/toolchain/sdk  
You are about to install the SDK to "/home/martin/src/riscv/toolchain/sdk". Proceed[Y/n]? Y  
Extracting SDK.....done  
Setting it up...done  
SDK has been successfully set up and is ready to be used.  
Each time you wish to use the SDK in a new shell session, you need to source the environment setup script e.g.  
$ . /home/martin/src/riscv/toolchain/sdk/environment-setup-riscv64-poky-linux  
martin@virtualbox:~/src/riscv/toolchain$ . /home/martin/src/riscv/toolchain/sdk/environment-setup-riscv64-poky-linux
```

# Assembler Port

Don't write the assembler by hand, use **riscv/riscv-opcodes** repository and generate it!

```
def make_jikesrv():
    for name in namelist:
        args = arguments[name]

    funargs = []
    substargs = []

    register_order = ['rd', 'rs1', 'rs2', 'rs3']

    # loads/stores have different order of registers
    if name in ['lb', 'lh', 'lw', 'ld', 'flw', 'fld', 'sb', 'sh', 'sw', 'sd', 'fsw', 'fsd']:
        register_order = ['rd', 'rs2', 'rs1']

    fcvt_float2int = []
    fcvt_int2float = []
    fcmp = []

    for f in ['d', 's']:
        for i in ['l', 'w', 'lu', 'wu']:
            fcvt_float2int.append('fcvt.%s.%s' % (i,f))
            fcvt_int2float.append('fcvt.%s.%s' % (f,i))

    for instr in ['fle', 'fit', 'feq']:
        fcmp.append('%s.%s' % (instr, f))

    for r in register_order:
        if r in args:
            if name.startswith('f'):
                if name in ['flw', 'fld', 'fsw', 'fsd'] and r == 'rs1':
                    funargs.append('GPR ' + r)
                elif name in fcvt_float2int and r == 'rd':
                    funargs.append('GPR ' + r)
                elif name in fcvt_int2float and r == 'rs1':
                    funargs.append('GPR ' + r)
```



```
public final void emitLH(GPR rd, GPR rs1, int imm12) {
    int mi = 0x1003 | rd.value() << 7 | rs1.value() << 15 | make_imm12(imm12);
    mIP++;
    mc.addInstruction(mi);
}

public final void emitLW(GPR rd, GPR rs1, int imm12) {
    int mi = 0x2003 | rd.value() << 7 | rs1.value() << 15 | make_imm12(imm12);
    mIP++;
    mc.addInstruction(mi);
}

public final void emitLD(GPR rd, GPR rs1, int imm12) {
    int mi = 0x3003 | rd.value() << 7 | rs1.value() << 15 | make_imm12(imm12);
    mIP++;
    mc.addInstruction(mi);
}

public final void emitLBU(GPR rd, GPR rs1, int imm12) {
    int mi = 0x4003 | rd.value() << 7 | rs1.value() << 15 | make_imm12(imm12);
    mIP++;
    mc.addInstruction(mi);
}

public final void emitLHU(GPR rd, GPR rs1, int imm12) {
    int mi = 0x5003 | rd.value() << 7 | rs1.value() << 15 | make_imm12(imm12);
    mIP++;
    mc.addInstruction(mi);
}

public final void emitLWU(GPR rd, GPR rs1, int imm12) {
    int mi = 0x6003 | rd.value() << 7 | rs1.value() << 15 | make_imm12(imm12);
    mIP++;
    mc.addInstruction(mi);
}
```

# Baseline JIT Compiler

Call into assembler

Complex bytecodes  
are implemented in  
software

```
@Override
protected final void emit_multianewarray(TypeReference typeRef, int dimensions) {
    asm.emitLDtoc(T0, ArchEntrypoints.newArrayArrayMethod.getOffset());
    asm.emitLVAL(A0, method.getId());
    asm.emitLVAL(A1, dimensions);
    asm.emitLVAL(A2, typeRef.getId());
    asm.emitADDI(A3, FP, spTopOffset);           // offset from FP to expression stack top
    asm.emitSLLI(T1, A1, LOG_BYTES_IN_ADDRESS); // number of bytes of array dimension args
    asm.emitADD(A3, T1, A3);                   // offset from FP to expression stack top
    asm.emitJALR(RA, T0, 0);
    discardSlots(dimensions);
    pushAddr(A0);
}

@Override
protected final void emit_arraylength() {
    popAddr(T0);
    asm.emitLWoffset(T1, T0, ObjectModel.getArrayLengthOffset());
    pushInt(T1);
}

@Override
protected final void emit_athrow() {
    asm.emitLDtoc(T0, Entrypoints.athrowMethod.getOffset());
    peekAddr(A0, 0);
    asm.emitJALR(RA, T0, 0);
}
```

# Debugging

```
---- 0x3541ff7c (Opcode: getstatic, Stack: 72)
|-> 0x32fcecd0 [72] -> 0x0 0x0 0x32f7dda8 0x200002e9a8 0x32fc当地 0x3541d7e8 0xb5b 0x
0x3541ff8c    lui      t0, 0x1
0x3541ff90    xori      t0, t0, 368
0x3541ff94    add      t0, t0, gp
0x3541ff98    ld       t0, 0(t0)
0x3541ff9c    sd       t0, 64(sp)
---- 0x3541ffa0 (Opcode: iconst_0, Stack: 64)
|-> 0x32fcecd0 [64] -> 0x330046c8 0x0 0x0 0x32f7dda8 0x200002e9a8 0x32fc当地 0x3541d7
0x3541ffb0    li       t0, 0
0x3541ffb4    sw       t0, 60(sp)
---- 0x3541ffb8 (Opcode: iconst_1, Stack: 56)
|-> 0x32fcecd0 [56] -> 0x0 0x330046c8 0x0 0x0 0x32f7dda8 0x200002e9a8 0x32fc当地 0x3541d7
0x3541ffc8    li       t0, 1
0x3541ffcc    sw       t0, 52(sp)
---- 0x3541ffd0 (Opcode: iastore, Stack: 48)
|-> 0x32fcecd0 [48] -> 0x1000000000 0x0 0x330046c8 0x0 0x0 0x32f7dda8 0x200002e9a8 0x3
0x3541ffe0    lw       t2, 52(sp)
0x3541ffe4    lw       t1, 60(sp)
0x3541ffe8    ld       t0, 64(sp)
0x3541ffec    lw       t3, 4088(t0)
0x3541ffff0   bltu   t1, t3, pc + 8
0x3541ffff4   lb       zero, 1(zero)
0x3541ffff8   slli   t1, t1, 2
0x3541ffffc   add      t1, t1, t0
0x354120000  sw       t2, 0(t1)
```

**At the start of every Bytecode, emit a text sequence:**

LD X0, 1024(X0) # SEGFAULT  
(Number of instructions)  
(Opcode)  
(Stack Offset)

**Print out instructions as DASM(0x12345678) and pipe the output through spike-dasm**

# Foreign-Function Calls

- Two mechanisms: JNI and syscalls
- JNI is bidirectional, requires yield point, keeping track of references, supports varargs, exceptions
- Syscalls are minimal Java-to-C calls

# Much More...

- Exception delivery and bounds checks (across C and Java stack frames)
- Dynamic linker (trampolines, bridges)
- Yield points, root scanning for GC, Interface Method Tables, etc.

# PART II

## 1. Porting JikesRVM to RISC-V

Lessons learnt porting JikesRVM to RISC-V

## 2. Running JikesRVM on RocketChip

Demo of JikesRVM running on real RISC-V hardware

## 3. Managed-Language Use Cases

New research that is enabled by this infrastructure

```
running class initializer for java.util.logging.Logger
invoking method < BootstrapCL, Ljava/util/logging/Logger; >.<clinit> ()V
Initializing runtime compiler
Late stage processing of command line
[VM booted]
Extracting name of class to execute
Initializing Application Class Loader
Turning back on security checks. Letting people see the ApplicationClassLoader.
running class initializer for java.lang.ClassLoader$StaticData
invoking method < BootstrapCL, Ljava/lang/ClassLoader$StaticData; >.<clinit> ()V
RVMClassLoader.getApplicationClassLoader(): Initializing Application ClassLoader, with repositories: `.'...
RVMClassLoader.getApplicationClassLoader(): ...initialized Application classloader, to SystemAppCL
Creating main thread
Constructing mainThread
Starting main thread
Boot sequence completed; finishing boot thread
```



bash-4.4#

# MIDAS Performance Results

| Benchmarks | Instructions (B) | Simulated Time (s) |
|------------|------------------|--------------------|
| avrora     | 118.0            | 311.8              |
| luindex    | 47.4             | 103.5              |
| lusearch   | 263.5            | 597.2              |
| pmd        | 158.5            | 346.8              |
| sunflow    | 504.8            | 1,352.9            |
| xalan      | 190.8            | 466.4              |

Default input sizes, >1 trillion instructions

# PART III

## 1. Porting JikesRVM to RISC-V

Lessons learnt porting JikesRVM to RISC-V

## 2. Running JikesRVM on RocketChip

Demo of JikesRVM running on real RISC-V hardware

## 3. Managed-Language Use Cases

New research that is enabled by this infrastructure

# Hardware-Software Co-Design

Grail Quest: A New Proposal for Hardware-assisted Garbage Collection

Martin Maas Krste Asanović John Kubiatowicz  
University of California, Berkeley

## ABSTRACT

Many important systems are written in garbage-collected languages and GC has a substantial impact on throughput, responsiveness and predictability of these systems. However, despite decades of research, there is still no "Holy Grail" of GC: a collector with no memory overhead, no performance overhead, and no latency overhead that needs to achieve freedom from pauses, high GC throughput and good responsiveness. This is because GC needs to move many threads or use substantial amounts of compute resources.

In this paper, we propose a step towards this elusive goal by revisiting the basic idea of GC. We argue that the time is right for the trends that make it the perfect time to revisit this approach and present our proposal for a hardware-assisted GC system to reconcile the conflicting goals. Our system is work in progress and we discuss design choices, trade-offs and open questions.

## 1. INTRODUCTION

A substantial portion of big data frameworks – and large-scale distributed workloads in general – are written in languages with Garbage Collection (GC), such as Java, Scala, Python or R. It is important for a wide range of workloads. Garbage Collection has had tremendous research efforts for over 30 years. Yet, we arguably still don't have what has been called the "Holy Grail" of GC [1]: a pause-free collector that achieves high throughput and high GC throughput (i.e., sustaining high allocation rates) preferentially without a large resource cost for the application.

Many recent GC innovations have focused on the first three goals. Most modern GCs can be made effectively pause-free at the cost of slower down times and threads and using a substantial amount of resources. Moreover, these approaches oftentimes ignore another factor that is very important in workplace-scale compute: energy efficiency. Previous work [2] has shown that GC can account for up to 25% of energy and 40% of execution time in common workloads (10% on average). Worse, as big data systems are processing ever larger heaps, these numbers will likely increase.

We believe that we can reverse these timelines and energy efficiency by revisiting the old idea of moving GC into hardware. Our goal is to build a GC that simultaneously achieves high GC throughput, good memory utilization, pause times, and low hardware costs (e.g., LLC misses and energy efficiency). We build an algorithm that performs well on the first three criteria but is resource-intensive [3]. Our key insight is that this algorithm can be very efficient by moving it into hardware, combined with some hardware support for GC [3–7]. However, none of these schemes has been widely adopted. We believe that there are three reasons

that have prevented this from happening. First, overheads of concurrent GCs are often high and slowdowns spread throughout the execution of the program. We move the culprits (primarily barriers) into hardware to alleviate their impact and allow out-of-order cores to speculate over them. Second, the most resource-intensive phases of a GC (memory reclamation) are a poor fit for general-purpose cores. We move them into accelerators closer to DRAM, and power and area.

## 2. BACKGROUND

There is a large body of work that has been published on GC. Jones and Lins [8] provide a general introduction.

There are two fundamental GC strategies: tracing and reference counting. Tracing collectors start from a set of roots (such as static or stack variables), perform a

# Grail Quest: A New Proposal for Hardware-Assisted Garbage Collection

## 6th Workshop on Architectures and Systems for Big Data (ASBD '16), Seoul, Korea, June 2016

# Motivating Example

We found that the distortion introduced [by the method] unacceptably large and erratic. For example, with the GenMS collector, the [benchmark] reports a 12% to 33% increase in runtime versus running [without].

Modifiable hardware enables fine-grained measurement and injection of language-level data without disturbing the application performance

## Quantifying the Performance of Garbage Collection vs. Explicit Memory Management

Matthew Hertz<sup>\*</sup>  
Computer Science Department  
Canisius College  
Buffalo, NY 14208  
matthew.hertz@canisius.edu

Emery D. Berger  
Dept. of Computer Science  
University of Massachusetts Amherst  
Amherst, MA 01003  
emery@cs.umass.edu

Garbage collection yields numerous software engineering benefits, but its quantitative impact on performance remains elusive. One can compare the performance of garbage collectors with explicit memory management in C+++ programs by looking at an appropriate collector. This kind of direct comparison is not possible for languages like Java or C# because their memory management programs in these languages naturally do not contain calls to `free`. Thus, we propose a gap analysis and an experimental methodology of explicit memory management and non-copying garbage collection remains unknown.

We propose a new experimental methodology that lets us quantify the performance of precise garbage collection versus explicit memory management. We use a C++ program that contains many programs as it they used explicit memory management by relying on `malloc` to insert calls to `free`. These oracles are generated from a real-world application in an architectural simulation. By executing inside an architecturally-detailed simulator, this "oracle" language provides a precise and detailed model of memory management while measuring the cost of calling `malloc` and `free`. We evaluate different trade-offs between standard oracle that aggressively frees objects immediately after their creation and a conservative-based oracle that conservatively frees objects just after they are last used. We also compare the range of possible placement of explicit deallocation calls.

Our results show that both standard and non-copying garbage collectors across a range of benchmarks using the Oracle memory manager, and present (real and simulated) runtimes for each. Our results show that the performance cost of the time-space tradeoff of garbage collection, with five times as much memory as the standard oracle, is negligible. The standard non-copying memory space matches the performance of readability-based explicit memory management. With only twice as much memory, garbage collection degrades performance by nearly 70%. When garbage collection degrades performance by nearly 70%. When

\*Work performed at the University of Massachusetts Amherst.

Permission to make digital or hard copies of all or part of this work for personal research and educational purposes is granted by the author(s) and editor(s). To copy otherwise, or to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. OOPSLA '05, October 16–20, 2005, San Diego, California, USA. Copyright 2005 ACM 1-59593-031-0/05/0010 \$5.00.

It is possible to measure the garbage collection activity of a program [19, 32, 56]. However, this is not possible to subtract garbage collection's effect on runtime performance. It degrades performance by degrading locality and reorganizing memory. It also degrades locality, especially when physical memory is scarce [61]. Subtracting the costs of garbage collection from the costs of explicit memory management, garbage collectors can provide by immediately recycling freed memory [53, 55, 57, 58]. For all these reasons, the costs of precise,

# Memory Allocation Latency



Sampling Rate: 1 KHz

# Memory Allocation Latency



Sampling Rate: 1 KHz

# Memory Allocation Latency



Sampling Rate: 1 KHz



Every Allocation

# Logging Memory Allocations



All memory allocations in a program  
(color indicates allocation class size)

# DRAM Row Misses



Dacapo Java Benchmarks on FPGA RISC-V core, FCFS open-page memory access scheduler. 800 Billion cycles @ 30MHz

# DRAM Row Misses



# Conclusion

# Conclusion



Combining **RISC-V** and **JikesRVM** enables new hardware-software research that **modifies the hardware, operating system and managed runtime**



# Announcement



The RISC-V Foundation is launching a new **J Extension Work Group** to add managed-language support to RISC-V!

If you would like to get involved, talk to me or David Chisnall ([david.chisnall@cl.cam.ac.uk](mailto:david.chisnall@cl.cam.ac.uk))

# Thank you! Any Questions?



**Martin Maas**

**Krste Asanovic**

**John Kubiatowicz**

+Thanks to David Biancolin, Christopher Celio, Sagar Karandikar, Donggyu Kim and all of UCB-BAR

**Acknowledgements:** Research was partially funded by DARPA Award Number HR0011-12-2-0016, DOE grant #DE-AC02-05CH11231, the STARnet Center for Future Architecture Research (C-FAR), and ASPIRE Lab sponsors and affiliates Intel, Google, HPE, Huawei, LGE, NVIDIA, Oracle, and Samsung.