

# Demystifying the Characteristics of 3D-Stacked Memories: A Case Study for the Hybrid Memory Cube (HMC)

Ramyad Hadidi, Bahar Asgari , Burhan Ahmad Mudassar,  
Saibal Mukhopadhyay, Sudhakar Yalamanchili, and Hyesoon Kim

IISWC'17 Talk





2

# Memory Evolution





3

# 3D-Stacking Technology

Provides opportunities & novel features

3D-DRAMs:

- ▶ Provide higher bandwidth and density
- ▶ Enable lower power consumption
- ▶ Motivate processing-in-memory

HMC is an example of such memories.



# New Considerations

4

New **internal organization**

New **thermal** behavior

New **latency** and **bandwidth** hierarchy

New packet-switched **interface**





# Contributions

5

We evaluate a real system with HMC 1.1 to:

- Study new memory organization
- Present bandwidth, power, and temperature relationships
- Investigate required cooling power
- Explore contributing factors to latency



To realize the full-system impact of 3D-stacked memories and HMC in particular.



# Hybrid Memory Cube (HMC)

6

HMC 1.1 (Gen2): 4GB size



Logic Layer   Vault Controller   DRAM Layer



7

# Hybrid Memory Cube (HMC)

HMC 1.1 (Gen2): 4GB size





# HMC Communication I

8

Follows a serialized **packet-switched** protocol

Partitioned into 16-byte *flit*

Each transfer incurs 1 flit of overhead

| Type               | Read         |                  | Write            |              |
|--------------------|--------------|------------------|------------------|--------------|
|                    | Request      | Response         | Request          | Response     |
| Data Size Overhead | Empty 1 Flit | 1~8 Flits 1 Flit | 1~8 Flits 1 Flit | Empty 1 Flit |
| Total Size         | 1 Flit       | 2~9 Flits        | 2~9 Flits        | 1 Flit       |



# HMC Communication I

9

Follows a serialized **packet-switched** protocol

Partitioned into 16-byte *flit*

Each transfer incurs 1 flit of overhead

| Type               | Request      | Read Response       | Write Request       | Write Response |
|--------------------|--------------|---------------------|---------------------|----------------|
| Data Size Overhead | Empty 1 Flit | 1~8 Flits<br>1 Flit | 1~8 Flits<br>1 Flit | Empty 1 Flit   |
| Total Size         | 1 Flit       | 2~9 Flits           | 2~9 Flits           | 1 Flit         |



10

# HMC Communication II

Two/Four full duplex external links:

- Width of 8 or 16 lanes
- Configurable speeds of 10, 12.5, and 15 Gbps



Our evaluated system

2 external links – 8 lanes each



# Experimental Setup I

11

- Pico SC6 Mini
- EX700 Backplane
- AC510 Module
- 4GB HMC 1.1



DC Power Supply:  
Fan Speed Control



Power  
Measurement



15W  
Fan



45 cm

90 cm

45°



12

# Experimental Setup I

Pico SC6 Mini  
EX700 Backplane  
AC510 Module  
4GB HMC 1.1



DC Power Supply:  
Fan Speed Control



Power  
Measurement



15W  
Fan



45 cm

90 cm

135 cm

 comparach



# Experimental Setup II

13



FPGA frequency: 187.5 MHz

Modified GUPS (giga updates per second) benchmark

Apply different masks to addresses



# Access Patterns

14



Accessing Less Banks



IISWC'17



Georgia  
Tech

comparch



# Access Patterns

15





16

# Access Patterns



IISWC'17



Georgia Tech



comarch



# Access Patterns

17





# Bandwidth

18



Access Pattern

IISWC'17



Georgia Tech

comparch



# Bandwidth

19

Accessing 4 banks saturates 1 vault bandwidth.  
External bandwidth is saturated at 4 vaults.





20

# Thermal/Power Experiments



| Configuration Name | DC Power Supply: Voltage | DC Power Supply: Current | 15 W Fan Distance | Average HMC Idle Temperature |
|--------------------|--------------------------|--------------------------|-------------------|------------------------------|
| Cfg1               | 12 V                     | 0.36 A                   | 45 cm             | 43.1°C                       |
| Cfg2               | 10 V                     | 0.29 A                   | 90 cm             | 51.7°C                       |
| Cfg3               | 6.5 V                    | 0.14 A                   | 90 cm             | 62.3°C                       |
| Cfg4               | 6.0 V                    | 0.13 A                   | 135 cm            | 71.6°C                       |



21

# Temperature (read only)





22

# Temperature (read only)





23

# Temperature & Bandwidth





24

# Temperature & Bandwidth





25

# Temperature & Bandwidth





# Device Power Consumption (read only)

26





# Device Power & Bandwidth

27





# Device Power & Bandwidth

28





# Cooling Power Consumption (read only) 29

Required Cooling Power to  
Keep Temperature at ( $^{\circ}\text{C}$ ):

◆ 50    ○ 55    ● 60    ▲ 65    ■ 70





# Cooling Power Consumption (read only) 30

Required Cooling Power to  
Keep Temperature at ( $^{\circ}\text{C}$ ):

◆ 50   ● 55   ● 60   ▲ 65   ■ 70





31

# Closed-Page Policy





# Closed-Page Policy

32

Payload Size:   ■ 128B   ■ 112B   ■ 96B   ■ 80B  
                 ■ 64B   ■ 48B   ■ 32B   ■ 16B





# Achieving High Bandwidth

33

- ▶ Promote bank-level parallelism
- ▶ Remap data to avoid internal organization bottlenecks
- ▶ Concatenate requests to use bandwidth effectively



34

# Latency Deconstruction





# Latency Deconstruction





36

# Latency Deconstruction Summary

TX Path:



|                                     |            |
|-------------------------------------|------------|
| Conversion to flits & buffering     | 10 cycles  |
| Round-robin arbitration among ports | 2-9 cycles |
| Add packet fields & flow control    | 10 cycles  |
| Serialization                       | 10 cycles  |
| Transmission (128B)                 | 15 cycles  |

Freq.: 187.5 MHz

Cycle: 5.3 ns

IISWC'17



Georgia  
Tech

comparch



37

# Low-Load Latency





38

# Low-Load Latency





39

# Low-Load Latency





40

# Low-Load Latency





41

# High-Load Latency





# Latency-Bandwidth

42





43

# Latency-Bandwidth





# Conclusions

44

- ▶ Mixing read and write requests and using large request sizes lead to effective use of bi-directional bandwidth.
- ▶ Distributing accesses prevents internal bottlenecks and exploits bank-level parallelism.
- ▶ Controlling the request rate to avoid high latency.
- ▶ Employing fault-tolerant mechanisms and using proper cooling solutions enables temperature-sensitive operations to reach a higher bandwidth.
- ▶ Reducing latency overhead of the infrastructure will greatly benefit latency.



45

---

# Backup Slides

---

IISWC'17

Georgia  
Tech

comparch



# HMC Memory Addressing

Closed-page policy

Page Size = 256 B

Low-order-interleaving address mapping policy

34-bit address field:





# Experimental Setup III



|              | Full-scale GUPS                                        | Small-scale GUPS         | Stream GUPS                         |
|--------------|--------------------------------------------------------|--------------------------|-------------------------------------|
| Addresses    | Random Configurable Mask                               | Random Configurable Mask | Defined by User                     |
| Request Rate | Maximum                                                | Configurable             | Minimum                             |
| Experiment   | Bandwidth<br>Power<br>Temperature<br>High-Load Latency | Latency-Bandwidth        | Integrity Check<br>Low-Load Latency |



# Thermal Configurations



| Configuration Name | DC Power Supply: Voltage | DC Power Supply: Current | 15 W Fan Distance | Average HMC Idle Temperature |
|--------------------|--------------------------|--------------------------|-------------------|------------------------------|
| Cfg1               | 12 V                     | 0.36 A                   | 45 cm             | 43.1 °C                      |
| Cfg2               | 10 V                     | 0.29 A                   | 90 cm             | 51.7 °C                      |
| Cfg3               | 6.5 V                    | 0.14 A                   | 90 cm             | 62.3 °C                      |
| Cfg4               | 6.0 V                    | 0.13 A                   | 135 cm            | 71.6 °C                      |



49

# Cooling Power



| Configuration | Cooling Power |
|---------------|---------------|
| cfg1          | 19.32 W       |
| cfg2          | 15.90 W       |
| cfg3          | 13.90 W       |
| cfg4          | 10.78 W       |



# HMC Communication II

Two/Four full duplex external links:

- Width of 16 or 8 lanes
- Configurable speeds of 10, 12.5, and 15 Gbps



$$\begin{aligned} \text{BW}_{\text{peak}} &= 2 \text{ link} \times 8 \text{ lanes/link} \times 15 \text{ Gbps} \times 2 \text{ full duplex} \\ &= 480 \text{ Gbps} = 60 \text{ GB/s}. \end{aligned}$$



# Address Mapping

51





# Bandwidth II

52





53

# Latency-Bandwidth II





54

# Latency-Bandwidth III

