



# 3D-MAPS: 3D Massively Parallel Processor With Stacked Memory

IEEE ISSCC 2012 Presentation

---

Dae Hyun Kim<sup>1</sup>, Krit Athikulwongse<sup>1</sup>, Michael B. Healy<sup>1</sup>, Mohammad M. Hossain<sup>1</sup>, Moongon Jung<sup>1</sup>, Ilya Khorosh<sup>1</sup>, Gokul Kumar<sup>1</sup>, Young-Joon Lee<sup>1</sup>, Dean L. Lewis<sup>1</sup>, Tzu-Wei Lin<sup>1</sup>, Chang Liu<sup>1</sup>, Shreepad Panth<sup>1</sup>, Mohit Pathak<sup>1</sup>, Minzhen Ren<sup>1</sup>, Guanhao Shen<sup>1</sup>, Taigon Song<sup>1</sup>, Dong Hyuk Woo<sup>1</sup>, Xin Zhao<sup>1</sup>, Joung Ho Kim<sup>2</sup>, Ho Choi<sup>3</sup>, Gabriel H. Loh<sup>1</sup>, Hsien-Hsin S. Lee<sup>1</sup>, and Sung Kyu Lim<sup>1</sup>

<sup>1</sup> Georgia Institute of Technology, Atlanta, USA

<sup>2</sup> Korea Advanced Institute of Science and Technology, Daejon, Korea

<sup>3</sup> Amkor Technology, Seoul, Korea



# Agenda

2/31

---

- Objective and Overview
- TSV and Stacking Technology
- Design
  - Architecture, layouts, and design analysis
- Testing
  - Die photos, package, board, and testing Infrastructure
- Measurement Results
- Ongoing Works
- Conclusions

# Objective

3/31

- Papers on TSV modeling and manufacturing: **many**
- Papers on CAD tools: **some**
- Papers on architecture and application: **few**
- Papers on test chips: **few**
  - Neuromorphic vision chip, Tohoku Univ [ISSCC'01]
  - Inductive coupling, Keio Univ [ISSCC'08]
  - DDR3 DRAM, Samsung [ISSCC'09]
  - Design-for-Reliability, IMEC [ISSCC'10]
  - Wide-I/O DRAM, Samsung [ISSCC'11]
- **Objective: build the first general-purpose many-core 3D processor**

# 3D-MAPS: An Overview

4/31



- 3D **MA**ssively **P**arallel processor with **S**tacked memory
- 130nm **G**LOBAL**F**OUNDRIES + Tezzaron F2F bonding
- 64 cores, 5-stage/2-way VLIW architecture
- 256KB SRAM, 1-cycle access
- 5mm X 5mm, 230 IO cells
- 277MHz Fmax, 1.5V Vdd
- **64GB/s memory BW @ 4W**

- TSV: 50K used for IO & dummy
- TSV: 1.2um diameter, 5um pitch
- F2F: 50K used for memory access
- F2F: 3.4um diameter, 5um pitch



# Tezzaron 3D Stack-up

5/31

- 2 logic tiers, face-to-face bonded
  - Top die thinned to 12um, bottom die is 765um
  - GLOBALFOUNDRIES 130nm technology + Artisan library/IP



# 3D MAPS Core Architecture

6/31

- 2-issue (memory/ALU), 5-stage VLIW
  - single cycle memory access at every cycle



# V1 Full-die Layouts

7/31



64 cores + 235 IO cells (on periphery)

core-to-core wires



64 SRAM memory tiles (64 x 4KB)

# Face-to-face Via Usage

8/31

- Spec: 3.4um diameter, 5um pitch, negligible RC
  - Usage: 64 for signal, 684 for P/G per core



# Through-Silicon-Via Usage

9/31

- Spec: **1.2um diameter, 5um pitch, R = 0.6ohm, C = 3fF**
  - Usage: mainly in IO cells
  - 204 redundant TSVs in each IO cell
  - 53 dummy TSVs per core



IO cells along the periphery



$12 \times 17 = 204$   
P/G TSV array

IO cell (zoom-in)

# Timing Closure and Power Delivery

10/31



buffers and gates in between cores



P/G rings for the cores



P/G rings for SRAM tiles



decap cells attached to P/G rings

# 3D CAD Tools and Methodologies

11/31

- Commercial 3D tools are **NOT** available
- We started with 2D Tools and added scripts & plug-ins
  - 3D layout construction: Encounter
  - 3D timing optimization: Encounter + PrimeTime
  - 3D timing and SI analysis: CelTIC + PrimeTime
  - 3D power analysis: ModelSim + Encounter
  - 3D clock analysis: Encounter + SPICE
  - 3D IR-drop analysis: VoltageStorm
  - 3D thermal analysis: ANSYS + Fluent
  - 3D DRC/LVS: Calibre
- Used to design both V1, V2, and more

# 3D Static Timing Analysis with SI

12/31



# 3D Timing Analysis

13/31

- Our worst-case path has 3.6ns delay, so Fmax = 277MHz
  - RF-to-memory write path: stage 2/3 FF – MUX – ADD – MUX – DMEM\_ADDR



# 3D Signal Integrity Analysis

14/31

- We analyze both 2D and 3D nets
  - All nets < 500mV: 5um F2F pitch was enough



# 3D IR-drop Analysis

15/31

- Can handle di/dt noise as well



# 3D IR-drop Analysis

16/31

- Single-core: clock buffers are power hungry (60mV)
- 64-core: cores in the middle experience more IR-drop (78mV)



# DFT Infrastructure

17/31

- 64 cores split into 4 sectors, tested independently
    - Scan IO pins located on one side
    - Testing circuitry sitting in between the cores



# 3D-MAPS Die Photos

18/31



die photo, backside of core (= thinned) die



whole die, dummy TSVs, IO cells on periphery

# SEM Images

19/31



single IO cell, TSVs, BEOL of core die



single TSV and its landing pads

# IR Images

20/31



I/O cells and cores, IR image w/ 6um depth



I/O cells ESD circuit, IR image w/ 6um depth

# Amkor Packaging

21/31

15mm X 15mm



# Testing Infrastructure

22/31



# Xilinx ML605



Agilent 16804A

# Sample Bit Stream: 3D Interface Test

23/31

- Data memory R/W works: **TSVs and F2Fs work**



# Programming Environment

24/31

- No OS/compiler yet

```
// histogram 64-core version
#include<stdio.h>

int main(int argc, char *argv[])
{
    if ((argc!=2)&&(argc!=3)) {
        printf("Usage: %s <input>\n");
        return 0;
    }
    int histogram[256], i;
    for (i=0;i<256;i++)
        histogram[i]=0;

    FILE* input;
    if ((input=fopen(argv[1],
                    "r"))==NULL) {
        printf("%s does not exist\n");
        return 0;
    }
    if ( input == NULL ) {
        perror ( "file can't be opened" );
    }
    else {
        char c;
        while (fscanf(input,"%c",
                      histogram[c]++)!=EOF);
        fclose(input);
    }
}
.....
```

```

movi $r21, WEST
movi $r1, 0
movi $r2, 512

FORWARD_COUNTER_LEFT:
    beq $r1, $r2, DONE
    BARRIER

    LW_I $r7, $r1, 0
    movi $r18, 0

    CASCADE_LEFT:
        beq $r18, $r29, DONE_CASCADE_LEFT
        SW_BUF $r7, $r21
        LW_BUF $r6, $r20
        LW_I $r5, $r1, 0
        add $r7, $r5, $r6
        addi $r18, $r18, 1
        jmp CASCADE_LEFT

    DONE_CASCADE_LEFT:
        bne $r31, $r0, AVOID_MEM_UP
        SW_I $r7, $r1, 0

    AVOID_MEM_UP:
        addi $r1, $r1, 4
        jmp FORWARD_COUNTER_LEFT

```

# BW and Power Measurement

25/31

- 64-core version of apps written in assembly
  - 3D-MAPS V1 supports **42 integer instructions**
  - Max achievable BW is 277 MHz X 64 ch X 4 Bytes = **70.9 GB/s**
  - **Modern CPU + DDR3 BW: 25 to 30GB/s**

| benchmark          | Memory bandwidth | Measured Power |
|--------------------|------------------|----------------|
| AES encryption     | 49.5 GB/s        | 4.032 W        |
| Edge detection     | 15.6 GB/s        | 3.768 W        |
| Histogram          | 30.3 GB/s        | 3.588 W        |
| K-means clustering | 40.6 GB/s        | 4.014 W        |
| Matrix multiply    | 13.8 GB/s        | 3.789 W        |
| Median filter      | <b>63.8 GB/s</b> | <b>4.007 W</b> |
| Motion estimation  | 24.1 GB/s        | 3.830 W        |
| String search      | 8.9 GB/s         | 3.876 W        |

# Frequency and Voltage Sweep

26/31

- Frequency vs power (voltage = 1.5V)
- Voltage vs power (frequency = 250MHz)



# Ongoing Work: 3D-MAPS V2

27/31

- Through MOSIS/Tezzaron 3D IC MPW (taped out: Oct 2011)

|                 | 3D-MAPS V1          | 3D-MAPS V2                                       |
|-----------------|---------------------|--------------------------------------------------|
| # of tiers      | 2 (1 logic, 1 SRAM) | 5 (2 logic, 3 DRAM)                              |
| # of cores      | 64                  | 128                                              |
| Memory capacity | 256KB SRAM          | 256MB DRAM & 512KB SRAM                          |
| Logic footprint | 5mm X 5mm           | 10mm X 10mm                                      |
| DRAM footprint  | -                   | 20mm X 12mm                                      |
| Bonding style   | F2F                 | F2F and F2B                                      |
| TSV/F2F usage   | ~ 50K / ~50K        | ~ 150K / ~185K                                   |
| Memory access*  | 2048 bit/cycle SRAM | 4096 bit/cycle SRAM<br>2048 bit/cycle DRAM (DDR) |
| freq / power    | 277MHz / 4.0W       | 175MHz / 10.4W                                   |

\* Current wide-I/O allows 512 bit/cycle DRAM access

# Stack Up Comparison

28/31

- TSV usage
  - 3D-MAPS V1: For I/O (204 redundancy)
  - 3D-MAPS V2: For I/O (204 redundancy) and DRAM access (9 redundancy)



# V2 Layouts

29/31

Top logic die



Bot logic die



Single core and its  
scratchpad SRAM (4KB)

# Wide-I/O DRAM Interface

30/31



Each M1 landing pad has 9 redundant TSVs



# Conclusions

31/31

---

- 3D-MAPS V1
  - 64 general-purpose cores + stacked SRAM
  - Ran 8 parallel applications successfully
  - Achieved 64GB/s memory bandwidth @ 4W power
  - Developed CAD tools and methodologies
  - TSV used for I/O
- 3D-MAPS V2 (ongoing work)
  - 128 cores + stacked SRAM & DRAM cube
  - TSV used for I/O and DRAM access