

# **Virtualization So Light, it *FLOATS!***

## **Accelerating Floating Point Virtualization**

**Nick Wanninger**, Nadharm Dhiantravan, Peter Dinda

Northwestern | Pab

# **Virtualization So Light, it *FLOATS!***

## **Accelerating Floating Point Virtualization**

Nick Wanninger, Nadharm Dhiantravan, Peter Dinda

Northwestern | Pab

# There are several alternatives to Floating Point

- AI Model quantization: float8, bfloat16, etc.
- Posit/Unum, rationals, arbitrary precision floating point, Bfloats, logarithmic arithmetic, ...
- **A whole conference dedicated to this**



32<sup>nd</sup> IEEE International Symposium on Computer Arithmetic



El Paso, TX, USA. May 4-7, 2025.

<https://www.arith2025.org/>

# Changing number systems *will* changes results.



# Switching to these systems is nontrivial

```
double op(float a, float b, float c) {  
    return a * b + c;  
}
```

# Switching to these systems is nontrivial

```
double op(float a, float b, float c) {
    return a * b + c;
}

void mpfr_op(mpfr_t result, mpfr_t a, mpfr_t b, mpfr_t c) {
    mpfr_mul(result, a, b, MPFR_RNDN); // result = a * b
    mpfr_add(result, result, c, MPFR_RNDN); // result += c
}
```

# The entire code structure needs to change!

Manually manage  
memory lifetimes of  
your numbers!

```
double op(float a, float b, float c) {  
    return a * b + c;  
}  
  
void mpfr_op(mpfr_t result, mpfr_t a, mpfr_t b, mpfr_t c) {  
    mpfr_mul(result, a, b, MPFR_RNDN); // result = a * b  
    mpfr_add(result, result, c, MPFR_RNDN); // result += c  
}
```

*Imagine needing to worry about  
this in something like CESM!*

# We want scientists to be able to experiment with these things



**We want to *write* applications with the semantics of hardware floating point**

**But have it *execute* using some alternative arithmetic!**

# Floating Point Virtualization

- Have the program *think* it is using hardware floating point
- But swap it out, transparently through virtualization

(HPDC'22)

[nickw.io/papers/hpdc22.pdf](http://nickw.io/papers/hpdc22.pdf)

## FPVM: Towards a Floating Point Virtual Machine

Peter Dinda  
Northwestern University

Nick Wanninger  
Northwestern University

Jiacheng Ma  
Northwestern University

Alex Bernat  
Northwestern University

Charles Bernat  
Northwestern University

Souradip Ghosh  
Northwestern University

Christopher Kraemer  
Northwestern University

Yehya Elmarsi  
Northwestern University

### Abstract

Alternatives to IEEE floating point arithmetic have become all the rage. Some extract more representational power out of the available bits. Others offer the potential for lower or higher precision than is available in IEEE-compatible hardware. Even an interface that is available in hardware has received some problems. Using such alternatives in scientific and engineering systems, and in other significant codebase is a major challenge, however. We explore how to address this challenge through virtualizing the IEEE floating point hardware, specifically on x64. The goal of the floating point virtual machine (FPVM) is to support IEEE floating point binary to be seamlessly extended to support the desired alternative arithmetic system with overheads determined by that system and not the virtualization mechanism. We describe the prospects, issues, and tradeoffs for four different approaches for building FPVMs: user-space, kernel-space, hardware-assisted, and hardware transformation. We then describe the design and implementation of our current design, which combines static binary analysis/translation and trap-and-emulate execution. We evaluate our FPVM implementation on several benchmarks, virtualizing them to use posits and MPFR. Finally, we comment on kernel- and hardware-level innovations that could further reduce overheads for floating point virtualization.

### CCS Concepts

• Software engineering → Operating systems; Virtual machines; Correctness; Software reliability; Operational analysis; Mathematics of computing → Numerical analysis; Arbitrary-precision arithmetic.

### Keywords

floating point arithmetic, virtualization, software development, IEEE 754

This project was supported by the United States National Science Foundation via grants CNS-1767833, CCF-2028811, and CCF-2119099.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee for those that copy the article and the full citation on the first page. Copyright for components of this work owned by others than the author(s) must be honored. All other rights reserved. Authorization to copy items for internal or personal use, or the internal or personal use of specific clients, is granted by the copyright owner for users registered with the Copyright Clearance Center (CCC) Transactional Reporting Service, provided that the base fee of \$10.00 plus 10¢ per page per article is paid directly to CCC, 27 Congress Street, Salem, MA 01970, USA. IEEE' 22, June 27-July 1, 2022, Minneapolis, MN, USA © 2022 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-9900-0/22/06...\$15.00  
<https://doi.org/10.1145/3502181.3551469>

ACM Reference Format:  
Peter Dinda, Nick Wanninger, Jiacheng Ma, Alex Bernat, Charles Bernat, Souradip Ghosh, Christopher Kraemer, and Yehya Elmarsi. 2022. Towards a Floating Point Virtual Machine. In Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing (HPDC '22, June 27-July 1, 2022, Minneapolis, MN, USA, 14 pages. <https://doi.org/10.1145/3502181.3551469>

### 1 Introduction

Virtually all applications in scientific and engineering domains, as well as applications built on machine learning techniques, make extensive use of IEEE 754 floating point arithmetic [32, 33] through its numerous implementations. Floating point has proven to be extremely effective at enabling high performance while providing behavior that is sensible to a knowledgeable developer.

Many applications are being developed on floating point hardware implementations, as well as being challenged along these fronts. First, alternatives such as umacs/posits [26, 32], BFwords [38], logarithmic arithmetic [5], and others [29, 43] potentially extract more useful representational value out of the same number of bits, or use range-preserving methods that are amenable for modern workflows such as machine learning. The second front involves using these representations, as well as IEEE floating point arithmetic (for example in GNU MPFR [23] or libBF [2]), at arbitrary precisions, including much higher precision than the hardware supports. This is being done by either translating between floating point and related representations altogether in favor of an API to the real numbers [11]. Such an API would allow programmers to reason about their code using the rules of standard arithmetic and achieve reasonable performance in many cases. This approach (or higher precision) might also mitigate the effects of misunderstanding how numbers have about various aspects of IEEE floating point [18, 20].

**Limitations of state-of-the-art approaches:** Despite their benefits, using alternative arithmetic systems within an existing ecosystem is challenging. One common challenge is that the floating point API is often not designed to be used with floating point numbers. A scenario is having to rewrite the application using a new API. A more pleasant scenario is when the programming language supports pluggable number representations, such as Fortran 90's kind parameter for type specifications, or the recent VFPfloat [35, 36] extension to C/C++ that allows one to define their own floating point type. However, these APIs are not always available in compilers, much less source code, but they still must deal with cross-language compatibility (if ever possible) and update and rebuild any libraries their codebase uses. Of course, these become daunting tasks for a large application. Additionally, any freshly rebuilt application may need

A user can execute their “*blessed binary*” under FPVM simply:

```
$ fpvm run ./solve_climate_change input.csv
```

*Without recompiling*

# FPVM is a Virtual Machine

- No **hardware support** for virtualized floating point
  - So we simulate it using **software**
- 
- Configure the hardware to **trap** when rounding, overflow, etc., occur.
  - **Emulate** the instruction in software with a different arithmetic system

# Let's say we have an instruction which rounds

```
add    %rax,%r14  
add    %r15,%rax  
mulsd %xmm4,%xmm0  
addsd (%r14),%xmm0  
movsd %xmm0,(%r14)
```

# The hardware catches this and tells the kernel

```
add    %rax,%r14  
add    %r15,%rax  
mulsd %xmm4,%xmm0  
addsd (%r14),%xmm0  
movsd %xmm0,(%r14)
```

Instruction  
"faults"

Kernel  
receives the  
trap

# ... which delegates the fault to FPVM with SIGFPE



# FPVM then emulates this instruction at a higher precision (e.g., 200 bit MPFR)



# There's one problem with this...



# Solution: NaN boxing



We put a **pointer** into the register.

(Disguised as a NaN)

This gives us a big benefit!

# Solution: NaN boxing



We put a **pointer** into the register.  
(Disguised as a NaN)

**Future accesses to  
this value will also  
trap into FPVM!**

# Solution: NaN boxing



This indirection also means FPVM has to include a garbage collector, though...

# FPVM Supports four alternative arithmetic systems

## Vanilla

Evaluate using IEEE  
Floating point  
hardware

## Boxed

Vanilla, but with  
*Nan* boxed values

## MPFR

Use arbitrary  
precision floats  
from the MPFR  
library

## Posits

Experimental  
bindings to the  
posits alternative  
arithmetic system

# These are broken down into two groups

## Vanilla

Evaluate using IEEE  
Floating point  
hardware

## Boxed

Vanilla, but with  
*Nan* boxed values

## MPFR

Use arbitrary  
precision floats  
from the MPFR  
library

## Posits

Experimental  
bindings to the  
posits alternative  
arithmetic system

*Correctness Validation*

*Real alternatives to IEEE floating point*

# We'll focus on **Boxed** in this talk

Vanilla

Evaluate using IEEE  
Floating point  
hardware

Boxed

Vanilla, but with  
*Nan boxed* values

MPFR

Use arbitrary  
precision floats  
from the MPFR  
library

Posits

Experimental  
bindings to the  
posits alternative  
arithmetic system

**Boxed is a minimal system that  
amplifies virtualization overhead**

# **Unfortunately,**

**x86 is not fully floating point virtualizable.**

We aren't going to get traps for **all** operations which should to maintain correctness.

# Unfortunately,

**x86 is not fully floating point virtualizable.**

We aren't going to get traps for **all** operations which should to maintain correctness.

```
double x = ...;  
long    y = *(long*)&x;
```

Treating floats as ints  
won't act right with NaNs

# Unfortunately,

x86 is not fully floating point virtualizable.

We aren't going to get traps for **all** operations which should maintain correctness.

```
double x = ...;  
long   y = *(long*)&x;
```

Treating floats as ints  
won't act right with NaNs

```
double x = ...;  
double z = -x;  
  
movsd ..., %xmm0  
xorpd %xmm1, (1 << 63)
```

The evil compiler  
thinks its *clever...*

# Binary code analysis to the rescue!

A dashed blue arrow points from the C code on the left to the assembly code on the right, indicating a correspondence or analysis flow.

|                       |                               |
|-----------------------|-------------------------------|
| extern double fp;     | foo:                          |
| int foo (double fp) { | push rbp                      |
| return *(int*) &fp;   | mov rbp, rsp                  |
| }                     | movsd QWORD PTR [rbp-8], xmm0 |
|                       | lea rax, [rbp-8]              |
|                       | mov eax, DWORD PTR [rax]      |
|                       | pop rbp                       |
|                       | ret                           |

**FPVM featured a binary analysis to *find these situations***

# It then inserts “correctness traps”



A trap to FPVM would be inserted here to “*demote*” eax back to a float

This work:

# **Virtualization So Light, it *FLOATS!***

## **Accelerating Floating Point Virtualization**

**Nick Wanninger**, Nadharm Dhiantravan, Peter Dinda

Northwestern | Pab

# **Virtualization So Light, it *FLOATS!***

## **Accelerating Floating Point Virtualization**

Nick Wanninger, Nadham Dhiantravan, Peter Dinda

Northwestern | Pab

# **FPVM's performance has left room for improvement.**

It enabled transparent swapping of arithmetic systems

But... some applications had **6,000x slowdown**

# Our baseline performance overheads



# Breaking down the virtualization overhead

A instruction, the majority of the overhead comes from **signal delivery** and **returning to the next instruction**



Ideally alternative math would be the *only* overhead



# Everything else is virtualization overhead



# FPVM was between 10 and 20x slower than our goal of zero-cost virtualization



**The goal of this paper is to get the *cost of virtualization* down to zero.**

# We do this with three techniques

Trap Short  
Circuiting

Sequence  
Emulation

Profiler based  
correctness traps

# **Trap short circuiting first**

**Trap Short  
Circuiting**

Sequence  
Emulation

Profiler based  
correctness traps

# Let's take a closer look at the overheads



**This is a non-trivial, large, multi-physics hydrodynamic astrophysical application**

<https://enzo-project.org/>

# We have a few intrinsic overheads



# This test uses the minimum overhead **altmath**



The “worst case”  
system for us: Boxed

# But a few of these are solvable software problems



# In this work, we'll focus on the signal overheads



# **Let's attack the problem head on**

- The FPVM runtime needs to be notified of floating point exceptions
- Existing signal mechanisms are designed to be general purpose, and relatively rare
- ... and as a result, are not as fast as they could be.

## Let's attack the problem head on

- The FPVM runtime needs to be notified of floating point exceptions
- Existing signal mechanisms are designed to be general purpose, and relatively rare
- ... and as a result, are not as fast as they could be.

**So let's just replace signals!**

# Regular signal delivery is expensive



# Regular signal delivery is expensive



# Sigreturn is also slow!



# Trap Short Circuiting bypasses the signals



# Trap short circuiting reduces overheads *substantially*

- Kernel time is reduced by over 10x
- It's now basically free to return from FPVM
- Overall overheads drop by ~6x



# This improvement is consistent



**There's more we can do, though.**

Trap Short  
Circuiting

**Sequence  
Emulation**

Profiler based  
correctness traps

|       |              |
|-------|--------------|
| addsd | %xmm0, %xmm1 |
| mulsd | %xmm0, %xmm0 |
| divsd | %xmm0, %xmm2 |

# FPVM emulation tends to cascade

|       |              |
|-------|--------------|
| addsd | %xmm0, %xmm1 |
| mulsd | %xmm0, %xmm0 |
| divsd | %xmm0, %xmm2 |

If this instruction traps

# FPVM emulation tends to cascade

|       |              |
|-------|--------------|
| addsd | %xmm0, %xmm1 |
| mulsd | %xmm0, %xmm0 |
| divsd | %xmm0, %xmm2 |

So will this one

# Sequence emulation amortizes overheads across instructions

|       |              |       |
|-------|--------------|-------|
| addsd | %xmm0, %xmm1 | Trap! |
| mulsd | %xmm0, %xmm0 |       |
| divsd | %xmm0, %xmm2 |       |

# Sequence emulation amortizes overheads across basic blocks

|       |              |
|-------|--------------|
| addsd | %xmm0, %xmm1 |
| mulsd | %xmm0, %xmm0 |
| divsd | %xmm0, %xmm2 |

:

We emulate all of these!

# **Sequence emulation amortizes overheads across instructions**

```
addsd    %xmm0, %xmm1  
mulsd    %xmm0, %xmm0  
divsd    %xmm0, %xmm2
```

⋮

**So we only pay exception handling once!**

# We have to be careful though!

|       |              |
|-------|--------------|
| addsd | %xmm0, %xmm1 |
| mulsd | %xmm0, %xmm0 |
| divsd | %xmm0, %xmm2 |
| movsd | (...), %xmm2 |
| addsd | %xmm0, %xmm2 |

# We have to be careful though!

```
addsd    %xmm0, %xmm1  
mulsd    %xmm0, %xmm0  
divsd    %xmm0, %xmm2  
movsd    (...), %xmm2  
addsd    %xmm0, %xmm2
```

Most FP sequences are broken up by  
a few **NON-FP** instructions!

# We extended FPVM to emulate these instructions

```
addsd    %xmm0, %xmm1  
mulsd    %xmm0, %xmm0  
divsd    %xmm0, %xmm2  
movsd    (...), %xmm2  
addsd    %xmm0, %xmm2
```

# Combining these solutions nearly eliminates kernel overhead



# **Very quickly, our last technique...**

Trap Short  
Circuiting

Sequence  
Emulation

**Profiler based  
correctness traps**

# This technique attacks the *User Experience*

The previous technique to insert correctness traps could take **weeks** to complete.

This is because it attempts to solve an ***unsolvable problem***  
(alias analysis)



# We replaced this analysis with a *profiler*

- Run your program *once* through a profiler
- “Representative workload”
- Analysis times down from **weeks** to **minutes**
- FPVM can now run many more programs!



# **Results**

# Altmath now dominates across the board



# Using *boxed math*, overheads reduce by up to ~10x



# Virtualization overheads are also reduced



# We are *much* closer to zero-cost virtualization



# The overhead can get *even lower* with a more expensive **altmath** like MPFR



# Conclusion

- We bypass signals with *trap short circuiting*
- We emulate more instructions with *sequence emulation*
- We reduce the time to do correctness analysis from **weeks** to **minutes**
- All of which reduces the overhead of virtualization *around* the *alternative math* library down to as low as **1.35x** with MPFR

Sequence Emulation

Trap Short Circuiting

Profiler based correctness traps

Thanks!



Download our  
paper!

# Virtualization So Light, it *FLOATS!*

## Accelerating Floating Point Virtualization



[Nick Wanninger](#), Nadharm Dhiantravan, [Peter Dinda](#)

Northwestern | Pab

# **BACKUP SLIDES**





## Traditional Traps



## Magic Traps



Magic Traps bypass the kernel

## Application Slowdown



## Slowdown from lower bound





## Instruction Rank Popularity



## CDF of Instruction Sequence Length











MPFR altnmath overheads