

# **EE277: Data-Centric Computer Architecture**

Hung-Wei Tseng

**Briefly introduce who you are and  
one thing you would like to share**

# **Learning eXperience**

# Traditional lectures



Me

# Peer Instructions

You



# What kind of show is ours?



# Peer Instruction



Before the lecture

During the lecture

# Our lectures



Before the lecture

During the lecture

**How “data” are generated and  
processed computer systems?**



# The “data path” in modern heterogeneous computers



The processing power of “state-of-the-art” hardware



How fast can these  
processing units  
“compute” matrix  
multiplications? (say  
2K by 2K)

# The processing power of “state-of-the-art” hardware



**Input:**  
 $2*2K*2K \text{ matrices} = 2*4M \text{ numbers}$   
 $= 2*16MB = 32MB$

**The data processing “throughput”:**



# The “data path” in modern heterogeneous computers



# How does the interconnect limitation affect?



# The evolution of technologies



# The total number of lanes from PCIe Root Complex is limited



<https://ark.intel.com/content/www/us/en/ark/products/134597/intel-core-i912900-processor-30m-cache-up-to-5-10-ghz.html>

**Figure 2. Example conventional storage server architecture with multiple NVMe SSDs.**



**What is the most important next step  
you think to make data processing  
more efficient?**



**Figure 2. Example conventional storage server architecture with multiple NVMe SSDs.**



Jaeyoung Do, Sudipta Sengupta, and Steven Swanson. 2019. Programmable solid-state storage in future cloud datacenters. Commun. ACM 62, 6 (June 2019), 54–62.

# The mismatching internal/external bandwidth

## Flashtec™ NVMe2032 and NVMe2016 Controllers

32- and 16-Channel PCIe Flash Controller Pro

### Summary

The Flashtec™ 2nd generation NVMe Controller Family enables the world's first 32 and 16 channel data centers to realize the highest performance SSDs utilizing next-generation NVMe and PCIe Gen 3 technologies. Combining world-class capacity and flexibility, the Flashtec NVMe2032 and NVMe2016 controllers are the most reliable choice. The Flashtec NVMe2032 and NVMe2016 controllers support the PCIe Gen 3 x8 or dual independent PCIe Gen 3 x4 (active, active/standby) host interface and are optimized for high-performance operations, performing all Raed management operations on-chip and utilizing processing and memory resources.

### Features

- Flashtec NVMe2032 controller can achieve up to 1 million random read IOPS on 4 KB operations
- Up to 20 TB Flash capacity using 256 GB Flash
- SLC, MLC, Enterprise MLC, and TLC Flash with toggle and ONFI interface
- PCIe Gen 3 x8 or dual independent PCIe Gen 3 x4 (active, active/standby) host interface
- 16 and 32 independent Flash channels, each supporting up to 8 CE

Bottleneck!

Each ~500MB/sec  
== 8GB/sec

x4  
4GB/sec



- Organization
  - Page size x8: 18,592 bytes (16,384 + 2208 bytes)
  - Block size: 2304 pages, (36,864K + 4968K bytes)
  - Plane size: 4 planes x 504 blocks
  - Device size: 512Gb; 2016 blocks; 1Tb: 4032 blocks; 2Tb: 8064 blocks; 4Tb: 16,128 blocks; 8Tb: 32,256 blocks

### TLC 512Gb-8Tb NAND - Array Characteristics

| is suspended or ongoing                        | $t_R$       | 57/41 | 60 | $\mu s$ | 2, 12    |
|------------------------------------------------|-------------|-------|----|---------|----------|
| READ PAGE operation time without/with $V_{PP}$ | $t_R$       | 57/41 | 60 | $\mu s$ | 2, 12    |
| SNAP READ operation time                       | $t_{RSNAP}$ | 27    | 37 | $\mu s$ |          |
| Cache read busy time                           | $t_{RCBSY}$ | 11    | 60 | $\mu s$ | 2, 8, 12 |

# Overview of the course content

## Heterogeneous Computer Architecture

- Why heterogeneous architectures
- What are heterogeneous computing resources
- Identifying and modeling the bottlenecks

## In/Near-Storage Processing

- Why ISP makes sense?
- Hardware architectures
- Software optimizations
- Limitations

## In/Near-Memory Processing

- Why In-memory processing
- Hardware architectures
- Programming models
- Limitations

## Alternative Architectures /Models

- Dataflow architectures/programming models?
- Automata?

# Your tasks

- Discuss — participate in the discussion questions during the lectures
- Project or paper presentation
  - The last 3 weeks we will schedule presentations on papers or projects
  - Each paper/project presentation will be 40 minutes including discussions and Q & As
  - You may present your current research project if it's relevant to this class
  - We can start discussing project ideas now! — send me an e-mail to setup a time!
    - Creating an “emulated” in-storage processing platform
    - Evaluating the trade-offs of applications in near-data processing
    - Accelerating applications through hardware/software co-design
    - Anything related to what we discussed in this class!

# Course resources

- Course webpage
  - <https://www.escalab.org/classes/eecs277-2022sp/>
- Discussion/Live
  - Discord — <https://discord.com/channels/956215712593092648>

# Instructor — Hung-Wei Tseng

- Website:  
<https://intra.engr.ucr.edu/~htseng/>
- Office hour by appointment
- E-mail: htseng@ucr.edu
- BS/MS in **Computer Science**,  
National Taiwan University
- PhD in **Computer Science**,  
University of California, San Diego
- Research Interests
  - Accelerating applications using AI/ML accelerators
  - Intelligent storage devices
  - Non-volatile memory based systems
  - Anything could accelerate applications



# **Thank you!**