

# Hardware–Software Co-Design: Not Just a Cliché

Adrian Sampson

James Bornholt

Luis Ceze

University of Washington

SNAPL 2015



**sailipa**



time

immemorial

2005

2015

(not to scale)



**free lunch**

time  
immemorial

2005

2015

exponential  
single-threaded  
performance  
scaling!

(not to scale)





we'll scale the  
number of cores  
instead

The multicore transition  
was a **stopgap**,  
not a panacea.



**Application**

**Language**

**Architecture**

**Circuits**

**Application**

**Language**

**hardware–software abstraction boundary**

parallelism

data  
movement

**Architecture**

guard  
bands

energy  
costs

**Circuits**

**Application**

**Language**

parallelism

hardware-software abstraction boundary

data  
movement

guard  
bands

energy  
costs

**Architecture**

**Circuits**

lessons learned from

# **Approximate Computing**

---

## **New Opportunities**

for hardware–software co-design

lessons learned from  
**Approximate Computing**

---

**New Opportunities**  
for hardware–software co-design

**Application**

**Language**

**new abstractions for incorrectness**

**Architecture**

**Circuits**







# Application

type systems

debuggers

probabilistic  
guarantees

auto-tuning

# Language

## new abstractions for incorrectness

flaky  
functional units      lossy cache  
compression

# Architecture

neural  
acceleration      drowsy  
SRAMs

# Circuits

# The von Neumann curse



other crud  
we don't care about  
and can't fix

# Hardware design costs sanity & well-being



Thierry Moreau,  
FPGA design champion

[Moreau et al.; HPCA 2015]

# Trust your compiler

approximate cache



# Trust your compiler



# Trust your compiler



# Trust your compiler



line state bits?

lessons learned from

# **Approximate Computing**

---

## New Opportunities

for hardware–software co-design

**More hardware flexibility  
that humans can actually program**

# More hardware flexibility that humans can actually program



**FPGA**

# More hardware flexibility that humans can actually program



**FPGA**

- explicit data movement
- explicit memory blocks
- explicit physical routing
- explicit clock frequency
- explicit ILP
- explicit numeric bit width

# More hardware flexibility that humans can actually program

## A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

Andrew Putnam   Adrian M. Caulfield   Eric S. Chung   Derek Chiou<sup>1</sup>  
Kypros Constantinides<sup>2</sup>   John Demme<sup>3</sup>   Hadi Esmaeilzadeh<sup>4</sup>   Jeremy Fowers  
Gopi Prashanth Gopal   Jan Gray   Michael Haselman   Scott Hauck<sup>5</sup>   Stephen Heil  
Amir Hormati<sup>6</sup>   Joo-Young Kim   Sitaram Lanka   James Larus<sup>7</sup>   Eric Peterson  
Simon Pope   Aaron Smith   Jason Thong   Phillip Yi Xiao   Doug Burger

Microsoft

### Abstract

*Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurable fabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6x8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables.*

*In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the*

desirable to reduce management issues and to provide a consistent platform that applications can rely on. Second, datacenter services evolve extremely rapidly, making non-programmable hardware features impractical. Thus, datacenter providers are faced with a conundrum: they need continued improvements in performance and efficiency, but cannot obtain those improvements from general-purpose systems.

Reconfigurable chips, such as Field Programmable Gate Arrays (FPGAs), offer the potential for flexible acceleration of many workloads. However, as of this writing, FPGAs have not been widely deployed as compute accelerators in either datacenter infrastructure or in client devices. One challenge traditionally associated with FPGAs is the need to fit the accelerated function into the available reconfigurable area. One could virtualize the FPGA by reconfiguring it at run-time to support more functions than could fit into a single device. However, current reconfiguration times for standard FPGAs

# More hardware flexibility that humans can actually program

## A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services

Andrew Putnam Adrian M. Caulfield Eric S. Chung Derek Chiou<sup>1</sup>  
Kypros Constantinides<sup>2</sup> John Demme<sup>3</sup> Hadi Esmaeilzadeh<sup>4</sup> Jeremy Fowers  
Gopi Prashanth Gopal Jan Gray Michael Haselman Scott Hauck<sup>5</sup> Stephen Heil  
Amir Hormati<sup>6</sup> Joo-Young Kim Sitaram Lanka James Larus<sup>7</sup> Eric Peterson  
Simon Pope Aaron Smith Jason Thong Phillip Yi Xiao Doug Burger

Microsoft

### Abstract

*Datacenter workloads demand high computational capabilities, flexibility, power efficiency, and low cost. It is challenging to improve all of these factors simultaneously. To advance datacenter capabilities beyond what commodity server designs can provide, we have designed and built a composable, reconfigurable fabric to accelerate portions of large-scale software services. Each instantiation of the fabric consists of a 6x8 2-D torus of high-end Stratix V FPGAs embedded into a half-rack of 48 machines. One FPGA is placed into each server, accessible through PCIe, and wired directly to other FPGAs with pairs of 10 Gb SAS cables.*

*In this paper, we describe a medium-scale deployment of this fabric on a bed of 1,632 servers, and measure its efficacy in accelerating the Bing web search engine. We describe the requirements and architecture of the system, detail the*

*ability to reduce management issues and to provide a consistent platform that applications can rely on. Second, datacenter services evolve extremely rapidly, making non-programmable hardware features impractical. Thus, datacenter providers are faced with a conundrum: they need continued improvements in performance and efficiency, but cannot obtain those improvements from general-purpose systems.*

*Reconfigurable chips, such as Field Programmable Gate Arrays (FPGAs), offer the potential for flexible acceleration of many workloads. However, as of this writing, FPGAs have not been widely deployed as compute accelerators in either datacenter infrastructure or in client devices. One challenge traditionally associated with FPGAs is the need to fit the accelerated function into the available reconfigurable area. One could virtualize the FPGA by reconfiguring it at run-time to support more functions than could fit into a single device. However, current reconfiguration times for standard FPGAs*

**23  
authors!**

# Trust, but formally verify



# Trust, but formally verify



# Trust, but formally verify



e.g., [Hunt and Larus; OSR April 2007]

# Hardware beyond core computation



power supply  
& battery



new memory  
technologies



mobile display  
& backlight

**free lunch**

**multicore era**

**the era  
of language  
co-design?**



