



# 3D Integration



**Prof. Hsien-Hsin S. Lee**  
**Electrical and Computer Engineering**  
**Georgia Tech**

Sponsors:



**Georgia Institute  
of Technology**



# Moore's Law (a.k.a. Intel's Roadmap)



Source: Intel Corp.

[www.3D.gatech.edu](http://www.3D.gatech.edu)



# Wire Delay

Performance dominator: Interconnect





# 3D Integration 101



NetBurst (P4) Microarchitecture (Willamette)





# State of the Art vs. Emerging 3D ICs

Wire,  
wire,  
wire

- Length (latency and power)
- Density
- I/O bond (power)



|                          |                                              |
|--------------------------|----------------------------------------------|
| Overhang: 1 to 2 mm      | F2F vias: ~5-10µm (< 1 FO4)<br>TSV: ~20-50µm |
| Wirebond Pitch: 40-60 µm | TSV Pitch: 1 to 10 µm                        |
| Wires around boundary    | TSVs on the entire surface                   |



# 3D Integrated Circuits

- More than Moore  
Toshiba image  
8-tier 16Gb NAND flash (CSCM)



Source: Paul Sakuma/Associated Press



- What is brought to the table?
  - One more degree of freedom
  - Simplified Integration
  - Wire length
  - Via Density



- Opportunities from architects' perspective

- Latency (e.g., shrunk global wires)
- Power consumption
- Bandwidth
- Flexibility
- Heterogeneity





# Bonding Orientation (Die Stacking)



Face-to-Face (limited to 2 die tiers)



Face-to-Back (Scalable)



# 3D Integration Process (Face-to-Back)





# Application Trend of 3D Silicon Integration



# 3D Die-Stacked Architecture





# 3D Partitioning

- Application Level
  - RF on logic
  - Accelerators
- Component and Architectural Level
  - Memory-on-memory
  - Memory-on-logic
  - Optical interconnect on processor
- Micro-architectural and Circuits Level
  - Array folding
  - Bit slicing
- Technology Level
  - At CMOS devices





# Application Level Heterogeneous Stacking

- Heterogeneous Stacking (Type A)
  - Microprocessor
  - ICN / NoC
  - Analog
  - Power Regulator
  - DSP
  - Memory
  - RF IC
  - (Optical) I/O
- Heterogeneous Stacking (Type B)
  - 45nm
  - 90 nm
- Smaller SoC form factor
- Economy of Scale





# Application Level 3D Partitioning



# POD (Parallel-On-Demand) — a 3D Acceleration Layer





# Application Level: Stacking Perf-Enhanced Tiers



## Stack

- Reservation Station
- Load/Store Queue
- L1 caches
- Reorder Buffer
- Instruction Fetch Queue
- Branch Target Buffer
- Branch Predictor
  - Two in a box
- Accelerators
- Co-processors

A solution for keeping NRE low when designing multiple 2D/3D processors



# Stacked Communication-Partitioned LSQ



- Make Die 2 a “Phantom” by faking signals



# Stacked Communication-Partitioned LSQ



- Make Die 2 a “Phantom” by faking signals



# Stacked Reservation Station



Two Issues:  
(1) Bus Length  
(2) Capacitive loading of comparators



**Segmented b'cst bus**

- Capacitive loading reduced
- Bus length still hinders the freq





# Stackable SRAM Structures





# Component Level 3D Partitioning



# Homogeneous Component-Level Stacking

- Stacked Memory Tiers
  - DRAM, SRAM, Flash
- Capacity-driven
- Cost reduction



Samsung WSP  
8-tier 16Gb NAND flash



# Component Level: DRAM-on-Processor

This interface only needs  
~100+ bits

But die-stacking provides many  
1000's of connections!



Conventional DRAM layer stacking



# Modified 3D-Stacked DRAM





# System Memory Stacking on Many-Core

Drop it in from above!



Hard enough to keep  
one core fed...



How do you keep  
1000 cores fed?

Courtesy: Gabe Loh's penguins



## 3D Offers

- Inter-die Vias with
  - Very high-density
  - Very short latency (< 1F04)
- 1 to 10TB/sec or more BW



# 3D Network-On-Chip





## 3D NOC Improved Neighborhood Access Latency

- CPU
- Nodes within 2 hops
- Nodes within 1 hop
- Nodes within 3 hops



Courtesy of Yuan Xie at Penn State  
[www.3D.gatech.edu](http://www.3D.gatech.edu)



# Microarchitectural Level 3D Partitioning



# Microarchitectural Level: Same Structure Stacking

- Structure Folding : SRAM Array Splitting (Wordline Split)

wordline



Repeating may be  
needed for long wires



6T SRAM cell

column select

sense amplifier





# Issues in Wordline Split



Courtesy of Kiran Puttswamy (Intel) and Gabriel Loh

[www.3D.gatech.edu](http://www.3D.gatech.edu)



# Crossbar Folding



Connect(15,15)



Connect(15,15)



Connect(15,15)

- Column-split 3D crossbar
- Row-split and column-split can be applied simultaneously in a 4-layer stack

- Strictly non-blocking
- Lots of switches
- Very long critical path

- Row-split 3D crossbar
- Add mux to route layers
- critical path reduced by 25%



# Perfect Shuffle MIN Folding





# Butterfly Shuffle MIN Folding



- Use “Bit-split” partition





# 3D ICN Comparison



16x16 perfect shuffle



16x16 Butterfly shuffle





# Circuits Level 3D Partitioning



# Bit-Sliced Partitioning: Thermal Herding

ADD 89902+65539

(This end closer to heatsink)

Most of the time, only the one layer closest to the heatsink is active

More active

These three layers active



Courtesy of Gabriel Loh at Georgia Tech  
[www.3D.gatech.edu](http://www.3D.gatech.edu)



# Circuits-Level: Memory Port Splitting

4-port memory cell



Two-tier 4-port memory cell



Bottom

Top



$6.24k \mu\text{m}^2$



$6.24k \mu\text{m}^2$

~40% area reduction

- (1) Wire dominated
- (2) Area can be ~quadrupled per double-porting
- (3) Worsened latency and energy consumption

- (1) Reduced wire length
- (2) Reduced gate loading on wordline



# Circuits-Level: Memory Port Splitting



|                               | <b>2D RF</b> | <b>3D RF</b> | <b>%</b> |
|-------------------------------|--------------|--------------|----------|
| Area ( $\mu\text{m}^2$ )      | 20.3k        | 12.5k        | 61%      |
| Footprint ( $\mu\text{m}^2$ ) | 20.3k        | 6.24k        | 31%      |
| Delay (ps)                    | Read '0'     | 1401         | 1043     |
|                               | Read '1'     | 1407         | 1050     |
|                               | Write '0'    | 520          | 308      |
|                               | Write '1'    | 1381         | 735      |
| Energy (pJ)                   | Read '0'     | 0.149        | 0.126    |
|                               | Read '1'     | 0.149        | 0.127    |
|                               | Write '0'    | 2.342        | 1.704    |
|                               | Write '1'    | 2.342        | 1.710    |



- (1) Reduced wire length
- (2) Reduced gate loading on wordline



# Architecture Revisit for 3D Integration



# Revisiting Prior Architectural “False Truth”

- Common wisdom

*Bandwidth problems can be cured with money, latency problems are harder because the speed of light is fixed —you can't bribe God.*

- We challenge this to fix “latency” using bandwidth
- TSV
  - Are fast (< 1F04)
  - Are high density
  - Can eliminate trailing edge problems



# Cache Line Size Impact (MPKI)





# SMART-3D: Optimized Memory Architecture



TSV could deliver 100s or 1000s GB/sec bandwidth.

TSV are not fully utilized for single-thread apps.

Redesign memory hierarchy to use TSV to enable latency-hiding

New key insight to fetch data into cache at page granularity (e.g., 4KB)

Use TSV bandwidth to ameliorate memory latency issues





# 3D Many-Core Prototyping



# 3D Many-Core Architecture Prototyping





# Objectives of Custom Design



## 3D-MAPS Specifications

Footprint = 5mm x 5mm  
 Process technology: 130nm Chartered  
 TSV and 3D stacking: Tezzaron  
 Number of cores = 64 (5-stage, 2-way VLIW, in-order)  
 Clock frequency: 455MHz  
 Supply voltage: 1.5V  
 Core tier power consumption: 1.5W to 3W  
 Core-to-core communication: 2D-mesh

Memory model: dedicated 4KB SRAM per core  
 Memory access: 4 or 1 Byte/cycle  
 Core-to-memory communication: F2F-via/wires

TSV dimension: 1.2um width, 6um height  
 TSV density: 1000 x 1000 with 5um pitch  
 TSV used: 50 x 260 for IO cells  
 F2F-via dimension: 3.4um width, 1.8um height  
 F2F-via density: 1000 x 1000 with 5um pitch  
 F2F-via used: 116 x 64

## Data Bandwidth (DBW)

| benchmark             | DBW    | DBW(a)  | speed |
|-----------------------|--------|---------|-------|
| string search         | 1.60Nf | 51GB/s  | 10.5x |
| matrix multiplication | 2.00Nf | 64GB/s  | 13.2x |
| AES standard          | 2.92Nf | 94GB/s  | 19.3x |
| histogram             | 4.00Nf | 128GB/s | 26.3x |
| sobel detector        | 2.00Nf | 64GB/s  | 13.2x |
| k-means               | 2.66Nf | 85GB/s  | 17.5x |
| median filter         | 2.18Nf | 70GB/s  | 14.4x |
| motion estimation     | 0.67Nf | 22GB/s  | 4.5x  |

DBW based on 500MHz, N=64

Speed-up based on 3D-MAPS @ 130nm vs Core i7 @ 45nm



# Single Core Planning

- 116 F2F vias
  - 32-bit data in
  - 32-bit data out
  - 10-bit address
  - Other controls
  - Central located



(Not drawn to scale)





# Single Core Flow

## Single Core

1. Area estimation
2. P/G routing
3. Placement
  - F2F/dummy vias
  - Gates
  - I-memory
  - Register file
4. Pre-CTS timing
  - Buffer insertion
  - Gate sizing
5. Clock tree routing
6. Post-CTS timing
7. Routing
  - Gate2gate
  - Gate2macros
  - gates2F2Fvias
8. Post-route timing

## D-Mem tile

1. Layout
2. Characterization



## 3D sign-off analysis

1. Merge GDSII
2. Parasitic extraction
3. Timing analysis
4. Power analysis
5. Clock skew/slew analysis
6. IR-drop analysis
7. Coupling noise analysis
8. Thermal analysis
9. LVS and DRC





# Many-Core Flow

## Many-core Tier

1. Place I/O cells
2. Core characterization
3. P/G routing
4. Placement
  - Many cores
  - Gates
5. Pre-CTS timing
  - Buffer insertion
  - Gate sizing
6. Clock tree routing
7. Post-CTS timing
8. Routing
  - Core2core
  - Core2IOcells
  - Core2gate
9. Post-route timing

## D-Mem tile

1. Placement of tiles
2. P/G routing
3. Characterization



## 3D sign-off analysis

1. Merge GDSII
2. Parasitic extraction
3. Timing analysis
4. Power analysis
5. Clock skew/slew analysis
6. IR-drop analysis
7. Coupling noise analysis
8. Thermal analysis
9. LVS and DRC





# Multi-Functional Interconnect

- TSVs are not free, microfluidic channels are expensive
- Route Thermal, Power, and Signal Wires





# Typical Dimension of Routing Resources

VARIOUS TECHNOLOGY AND SETTING PARAMETERS USED IN OUR EXPERIMENTS.

| Item                                           | Value        |
|------------------------------------------------|--------------|
| Number of dies                                 | 4            |
| Bonding type                                   | face-to-back |
| Die thickness ( $\mu m$ )                      | 150          |
| Bonding layer thickness ( $\mu m$ )            | 10           |
| TSV aspect ratio                               | 15:1         |
| Routing grid size ( $\mu m$ )                  | 50           |
| Signal TSV diameter ( $\mu m$ )                | 10           |
| Signal TSV minimum pitch ( $\mu m$ )           | 20           |
| P/G TSV diameter ( $\mu m$ )                   | 40           |
| P/G TSV pitch ( $\mu m$ )                      | 400          |
| P/G grid size ( $\mu m$ )                      | 200          |
| Microfluidic channel width ( $\mu m$ )         | 100          |
| Microfluidic channel pitch ( $\mu m$ )         | 200          |
| Microfluidic channel depth ( $\mu m$ )         | 100          |
| Microfluidic channel occupancy ratio ( $MFO$ ) | 0.5          |
| target cell occupancy ratio ( $tCO$ )          | 0.25         |





# 3D-IC Challenges and State-of-the-Art





# 3D Yield – Known Good Die (KGD) Issues

