

# **Parallel Architectures & Programming on Parallel Architectures (2): Too many things**

Hung-Wei Tseng

# Recap: Wider-issue processors won't give you much more

| <b>Program</b> | <b>IPC</b> | <b>BP Rate %</b> | <b>I cache %MPCI</b> | <b>D cache %MPCI</b> | <b>L2 cache %MPCI</b> |
|----------------|------------|------------------|----------------------|----------------------|-----------------------|
| compress       | 0.9        | 85.9             | 0.0                  | 3.5                  | 1.0                   |
| eqntott        | 1.3        | 79.8             | 0.0                  | 0.8                  | 0.7                   |
| m88ksim        | 1.4        | 91.7             | 2.2                  | 0.4                  | 0.0                   |
| MPsim          | 0.8        | 78.7             | 5.1                  | 2.3                  | 2.3                   |
| applu          | 0.9        | 79.2             | 0.0                  | 2.0                  | 1.7                   |
| apsi           | 0.6        | 95.1             | 1.0                  | 4.1                  | 2.1                   |
| swim           | 0.9        | 99.7             | 0.0                  | 1.2                  | 1.2                   |
| tomcatv        | 0.8        | 99.6             | 0.0                  | 7.7                  | 2.2                   |
| pmake          | 1.0        | 86.2             | 2.3                  | 2.1                  | 0.4                   |

| <b>Program</b> | <b>IPC</b> | <b>BP Rate %</b> | <b>I cache %MPCI</b> | <b>D cache %MPCI</b> | <b>L2 cache %MPCI</b> |
|----------------|------------|------------------|----------------------|----------------------|-----------------------|
| compress       | 1.2        | 86.4             | 0.0                  | 3.9                  | 1.1                   |
| eqntott        | 1.8        | 80.0             | 0.0                  | 1.1                  | 1.1                   |
| m88ksim        | 2.3        | 92.6             | 0.1                  | 0.0                  | 0.0                   |
| MPsim          | 1.2        | 81.6             | 3.4                  | 1.7                  | 2.3                   |
| applu          | 1.7        | 79.7             | 0.0                  | 2.8                  | 2.8                   |
| apsi           | 1.2        | 95.6             | 0.2                  | 3.1                  | 2.6                   |
| swim           | 2.2        | 99.8             | 0.0                  | 2.3                  | 2.5                   |
| tomcatv        | 1.3        | 99.7             | 0.0                  | 4.2                  | 4.3                   |
| pmake          | 1.4        | 82.7             | 0.7                  | 1.0                  | 0.6                   |

Table 5. Performance of a single 2-issue superscalar processor.

Table 6. Performance of the 6-issue superscalar processor.

# Recap: Simultaneous Multithreading

① movq 8(%rdi), %rdi  
② addl \$1, %eax  
③ testq %rdi, %rdi  
④ jne .L3  
⑤ movq 8(%rdi), %rdi  
⑥ addl \$1, %eax  
⑦ testq %rdi, %rdi  
⑧ jne .L3  
⑨ movq 8(%rdi), %rdi  
⑩ addl \$1, %eax  
⑪ testq %rdi, %rdi  
⑫ jne .L3  
⑬ movq 8(%rdi), %rdi  
⑭ addl \$1, %eax  
⑮ testq %rdi, %rdi  
⑯ jne .L3  
⑰ movl (%rdi), %ecx

By scheduling another running program instance (thread), the processor has 0 wasted issue slots!



- ① movl (%rdi), %ecx
- ② addq \$4, %rdi
- ③ addl %ecx, %eax
- ④ cmpq %rdx, %rdi
- ⑤ jne .L3
- ⑥ movl (%rdi), %ecx
- ⑦ addq \$4, %rdi
- ⑧ addl %ecx, %eax
- ⑨ cmpq %rdx, %rdi
- ⑩ jne .L3
- ⑪ movl (%rdi), %ecx
- ⑫ addq \$4, %rdi
- ⑬ addl %ecx, %eax
- ⑭ cmpq %rdx, %rdi
- ⑮ jne .L3
- ⑯ movl (%rdi), %ecx

## Recap: Wide-issue SS processor v.s. multiple narrower-issue SS processors



# Recap: Intel Sandy Bridge



# Recap: AMD Ryzen



AMD

RYZEN

# Recap: What software thinks about “multiprogramming” hardware



# Recap: Coherency & Consistency

- Coherency — Guarantees all processors see the same value for a variable/memory address in the system when the processors need the value at the same time
  - What value should be seen
- Consistency — All threads see the change of data in the same order
  - When the memory operation should be done

# Recap: parallel programming is hard (1)

prevents the compiler from putting the variable "loop" in the "register"

|                                                                                                                                                                                                                                                                                            |                                                                                                                                             |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------|
| thread 1                                                                                                                                                                                                                                                                                   | thread 2                                                                                                                                    |
| <pre>volatile int loop;  int main() {     pthread_t thread;     loop = 1;      pthread_create(&amp;thread, NULL, modifyloop, NULL);     while(loop == 1)     {         continue;     }     pthread_join(thread, NULL);     fprintf(stderr,"User input: %d\n", loop);     return 0; }</pre> | <pre>void* modifyloop(void **x) {     sleep(1);     printf("Please input a number:\n");     scanf("%d",&amp;loop);     return NULL; }</pre> |

# Recap: Cache coherency

- Assuming that we are running the following code on a CMP with a cache coherency protocol, how many of the following outputs are possible? (a is initialized to 0 as assume we will output more than 10 numbers)

| thread 1                      | thread 2         |
|-------------------------------|------------------|
| while(1)<br>printf("%d ", a); | while(1)<br>a++; |

- Ⓐ 0123456789  
Ⓑ ~~1259368101213~~  
Ⓒ 1111111164100  
Ⓓ 1111111111100

A. 0  
B. 1  
C. 2  
D. 3  
E. 4

# Outline

- Parallel programming — is not easy
- The era of dark silicon

# Cache coherency



# Performance comparison

- Comparing implementations of thread\_vadd — L and R, please identify which one will be performing better and why

## Version L

```
void *threaded_vadd(void *thread_id)
{
    int tid = *(int *)thread_id;
    int i;
    for(i=tid;i<ARRAY_SIZE;i+=NUM_OF_THREADS)
    {
        c[i] = a[i] + b[i];
    }
    return NULL;
}
```

## Version R

```
void *threaded_vadd(void *thread_id)
{
    int tid = *(int *)thread_id;
    int i;
    for(i=tid*(ARRAY_SIZE/NUM_OF_THREADS);i<(tid+1)*(ARRAY_SIZE/NUM_OF_THREADS);i++)
    {
        c[i] = a[i] + b[i];
    }
    return NULL;
}
```

- L is better, because the cache miss rate is lower
- R is better, because the cache miss rate is lower
- L is better, because the instruction count is lower
- R is better, because the instruction count is lower
- Both are about the same

### FalseSharing

## Main thread

```
for(i = 0 ; i < NUM_OF_THREADS ; i++)
{
    tids[i] = i;
    pthread_create(&thread[i], NULL, threaded_vadd, &tids);
}
for(i = 0 ; i < NUM_OF_THREADS ; i++)
    pthread_join(thread[i], NULL);
```

# L v.s. R

## Version L

```
void *threaded_vadd(void *thread_id)
{
    int tid = *(int *)thread_id;
    int i;
    for(i=tid;i<ARRAY_SIZE;i+=NUM_OF_THREADS)
    {
        c[i] = a[i] + b[i];
    }
    return NULL;
}
```



## Version R

```
void *threaded_vadd(void *thread_id)
{
    int tid = *(int *)thread_id;
    int i;
    for(i=tid*(ARRAY_SIZE/NUM_OF_THREADS);i<(tid+1)*(ARRAY_SIZE/NUM_OF_THREADS);i++)
    {
        c[i] = a[i] + b[i];
    }
    return NULL;
}
```



# 4Cs of cache misses

- 3Cs:
  - Compulsory, Conflict, Capacity
- Coherency miss:
  - A “block” invalidated because of the sharing among processors.

# False sharing

- True sharing
  - Processor A modifies X, processor B also want to access X.
- False sharing
  - Processor A modifies X, processor B also want to access Y. However, Y is invalidated because X and Y are in the same block!

# Performance comparison

- Comparing implementations of thread\_vadd — L and R, please identify which one will be performing better and why

## Version L

```
void *threaded_vadd(void *thread_id)
{
    int tid = *(int *)thread_id;
    int i;
    for(i=tid;i<ARRAY_SIZE;i+=NUM_OF_THREADS)
    {
        c[i] = a[i] + b[i];
    }
    return NULL;
}
```

## Version R

```
void *threaded_vadd(void *thread_id)
{
    int tid = *(int *)thread_id;
    int i;
    for(i=tid*(ARRAY_SIZE/NUM_OF_THREADS);i<(tid+1)*(ARRAY_SIZE/NUM_OF_THREADS);i++)
    {
        c[i] = a[i] + b[i];
    }
    return NULL;
}
```

- A. L is better, because the cache miss rate is lower
- B. R is better, because the cache miss rate is lower
- C. L is better, because the instruction count is lower
- D. R is better, because the instruction count is lower
- E. Both are about the same

## Main thread

```
for(i = 0 ; i < NUM_OF_THREADS ; i++)
{
    tids[i] = i;
    pthread_create(&thread[i], NULL, threaded_vadd, &tids);
}
for(i = 0 ; i < NUM_OF_THREADS ; i++)
    pthread_join(thread[i], NULL);
```

# Again — how many values are possible?

- Consider the given program. You can safely assume the caches are coherent. How many of the following outputs will you see?

- ① (0, 0)
  - ② (0, 1)
  - ③ (1, 0)
  - ④ (1, 1)
- A. 0  
B. 1  
C. 2  
D. 3  
E. 4

```
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>

volatile int a,b;
volatile int x,y;
volatile int f;
void* modifya(void *z) {
    a=1;
    x=b;
    return NULL;
}
void* modifyb(void *z) {
    b=1;
    y=a;
    return NULL;
}
```

```
int main() {
    int i;
    pthread_t thread[2];
    pthread_create(&thread[0], NULL, modifya, NULL);
    pthread_create(&thread[1], NULL, modifyb, NULL);
    pthread_join(thread[0], NULL);
    pthread_join(thread[1], NULL);
    fprintf(stderr, "(%d, %d)\n", x, y);
    return 0;
}
```

Consistency

# Possible scenarios

Thread 1

a=1;

x=b;

Thread 2

b=1;  
y=a;

(1,1)

Thread 1

a=1;  
x=b;

Thread 2

b=1;  
y=a;

(0,1)

Thread 1

a=1;  
x=b;

Thread 2

b=1;  
y=a;

(1,0)

Thread 1

x=b;  
a=1;

Thread 2

y=a;

OoO Scheduling!

b=1;

(0,0)

# Why (0,0)?

- Processor/compiler may reorder your memory operations/instructions
  - Coherence protocol can only guarantee the update of the same memory address
  - Processor can serve memory requests without cache miss first
  - Compiler may store values in registers and perform memory operations later
- Each processor core may not run at the same speed (cache misses, branch mis-prediction, I/O, voltage scaling and etc..)
- Threads may not be executed/scheduled right after it's spawned

# Again — how many values are possible?

- Consider the given program. You can safely assume the caches are coherent. How many of the following outputs will you see?

- ① (0, 0)
  - ② (0, 1)
  - ③ (1, 0)
  - ④ (1, 1)
- A. 0
- B. 1
- C. 2
- D. 3
- E. 4

```
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <unistd.h>

volatile int a,b;
volatile int x,y;
volatile int f;
void* modifya(void *z) {
    a=1;
    x=b;
    return NULL;
}
void* modifyb(void *z) {
    b=1;
    y=a;
    return NULL;
}
```

```
int main() {
    int i;
    pthread_t thread[2];
    pthread_create(&thread[0], NULL, modifya, NULL);
    pthread_create(&thread[1], NULL, modifyb, NULL);
    pthread_join(thread[0], NULL);
    pthread_join(thread[1], NULL);
    fprintf(stderr, "(%d, %d)\n", x, y);
    return 0;
}
```

# fence instructions

- x86 provides an “mfence” instruction to prevent reordering across the fence instruction
  - All updates prior to mfence must finish before the instruction can proceed
- x86 only supports this kind of “relaxed consistency” model. You still have to be careful enough to make sure that your code behaves as you expected



# Take-aways of parallel programming

- Processor behaviors are non-deterministic
  - You cannot predict which processor is going faster
  - You cannot predict when OS is going to schedule your thread
- Cache coherency only guarantees that everyone would eventually have a coherent view of data, but not when
- Cache consistency is hard to support

# **Power and Energy**

# Power & Energy

- Regarding power and energy, how many of the following statements are correct?
  - ① Lowering the power consumption helps reducing the heat generation
  - ② Lowering the energy consumption helps reducing the electricity bill
  - ③ Lowering the power consumption helps extending the battery life
  - ④ A CPU with 10% utilization can still consume 33% of the peak power

A. 0  
B. 1  
C. 2  
D. 3  
E. 4



# Power v.s. Energy

- Power is the direct contributor of “heat”
  - Packaging of the chip
  - Heat dissipation cost
  - $\text{Power} = P_{Dynamic} + P_{static}$
- $\text{Energy} = P * ET$ 
  - The electricity bill and battery life is related to energy!
  - Lower power does not necessarily means better battery life if the processor slow down the application too much

# **Dynamic Power**

# Dynamic/Active Power

- The power consumption due to the switching of transistor states
- Dynamic power per transistor

$$P_{dynamic} \sim \alpha \times C \times V^2 \times f \times N$$

- $\alpha$ : average switches per cycle
- $C$ : capacitance
- $V$ : voltage
- $f$ : frequency, usually linear with  $V$
- $N$ : the number of transistors

# Double Clock Rate or Double the # of Processors?

- Assume 60% of the application can be fully parallelized with 2-core or speedup linearly with clock rate. Should we **double the clock rate** or **duplicate a core**?

$$P_{dynamic} \sim \alpha \times C \times V^2 \times f \times N$$

$$Speedup_{parallel}(f_{parallelizable}, n) = \frac{1}{(1 - f_{parallelizable}) + \frac{f_{parallelizable}}{n}}$$

$$Speedup_{parallel}(60\%, 2) = \frac{1}{(1 - 60\%) + \frac{60\%}{2}} = 1.43$$

$$Power_{2-core} = 2 \times P_{baseline}$$

$$Energy_{2-core} = 2 \times P_{baseline} \times ET_{baseline} \times \frac{1}{1.43} = 1.39 \times Energy_{baseline}$$

$$Speedup_{2\times clock} = 2$$

$$Power_{2\times clock} = 2^3 \times P_{baseline} = 8 \times P_{baseline}$$

$$Energy_{2\times clock} = 2^3 \times P_{baseline} \times ET_{baseline} \times \frac{1}{2} = 4 \times P_{baseline} \times ET_{baseline}$$

# Dynamic voltage/frequency scaling

- Dynamically lower power for performance
  - Change the voltage and frequency at runtime
  - Under control of operating system — that's why updating iOS may slow down an old iPhone
- Recall:  $P_{dynamic} \sim \alpha \times C \times V^2 \times f \times N$ 
  - Because frequency  $\sim$  to  $V$ ...
  - $P_{dynamic} \sim$  to  $V^3$
- Reduce both  $V$  and  $f$  linearly
  - Cubic decrease in dynamic power
  - Linear decrease in performance (actually sub-linear)
    - Thus, only about quadratic in energy
  - Linear decrease in static power
    - Thus, only modest static energy improvement
  - Newer chips can do this on a per-core basis
    - `cat /proc/cpuinfo` in linux

# Demo — changing the max frequency and performance

- Change the maximum frequency of the intel processor — you learned how to do this when we discuss programmer's impact on performance
- LIKWID a profiling tool providing power/energy information
  - likwid-perfctr -g ENERGY [command\_line]
  - Let's try blockmm and popcorn and see what's happening!

| Metric                    | Sum          | Min       | Max        | Avg          |
|---------------------------|--------------|-----------|------------|--------------|
| Runtime (RDTSC) [s] STAT  | 1.1772       | 0.1962    | 0.1962     | 0.1962       |
| Runtime unhalted [s] STAT | 38080.0461   | 0         | 38080.0460 | 6346.6743    |
| Clock [MHz] STAT          | 9.629741e+08 | 1697.5067 | 962966500  | 1.604957e+08 |
| CPI STAT                  | 17.7088      | 1         | 5.4991     | 2.9515       |
| Temperature [C] STAT      | 236          | 36        | 49         | 39.3333      |
| Energy [J] STAT           | 2.5281       | 0         | 2.5281     | 0.4213       |
| Power [W] STAT            | 12.8846      | 0         | 12.8846    | 2.1474       |
| Energy PPO [J] STAT       | 2.3954       | 0         | 2.3954     | 0.3992       |
| Power PPO [W] STAT        | 12.2080      | 0         | 12.2080    | 2.0347       |
| Energy PP1 [J] STAT       | 0            | 0         | 0          | 0            |
| Power PP1 [W] STAT        | 0            | 0         | 0          | 0            |
| Energy DRAM [J] STAT      | 0.2024       | 0         | 0.2024     | 0.0337       |
| Power DRAM [W] STAT       | 1.0315       | 0         | 1.0315     | 0.1719       |

| Metric                    | Sum          | Min      | Max        | Avg          |
|---------------------------|--------------|----------|------------|--------------|
| Runtime (RDTSC) [s] STAT  | 4.0692       | 0.6782   | 0.6782     | 0.6782       |
| Runtime unhalted [s] STAT | 38080.0031   | 0        | 38080.0030 | 6346.6672    |
| Clock [MHz] STAT          | 2.211432e+08 | 797.0617 | 221140000  | 3.685720e+07 |
| CPI STAT                  | 12.0339      | 1        | 4.4400     | 2.0057       |
| Temperature [C] STAT      | 213          | 35       | 36         | 35.5000      |
| Energy [J] STAT           | 1.4547       | 0        | 1.4547     | 0.2425       |
| Power [W] STAT            | 2.1450       | 0        | 2.1450     | 0.3575       |
| Energy PPO [J] STAT       | 1.0040       | 0        | 1.0040     | 0.1673       |
| Power PPO [W] STAT        | 1.4804       | 0        | 1.4804     | 0.2467       |
| Energy PP1 [J] STAT       | 0            | 0        | 0          | 0            |
| Power PP1 [W] STAT        | 0            | 0        | 0          | 0            |
| Energy DRAM [J] STAT      | 0.6870       | 0        | 0.6870     | 0.1145       |
| Power DRAM [W] STAT       | 1.0130       | 0        | 1.0130     | 0.1688       |

| Metric                    | Sum        | Min          | Max       | Avg       |
|---------------------------|------------|--------------|-----------|-----------|
| Runtime (RDTSC) [s] STAT  | 28.3404    | 4.7234       | 4.7234    | 4.7234    |
| Runtime unhalted [s] STAT | 5.8087     | 1.914906e-05 | 5.8083    | 0.9681    |
| Clock [MHz] STAT          | 20478.2138 | 2237.6941    | 4560.8206 | 3413.0356 |
| CPI STAT                  | 13.7354    | 0.2683       | 4.8856    | 2.2892    |
| Temperature [C] STAT      | 264        | 40           | 59        | 44        |
| Energy [J] STAT           | 106.6913   | 0            | 106.6913  | 17.7819   |
| Power [W] STAT            | 22.5877    | 0            | 22.5877   | 3.7646    |
| Energy PPO [J] STAT       | 103.5564   | 0            | 103.5564  | 17.2594   |
| Power PPO [W] STAT        | 21.9240    | 0            | 21.9240   | 3.6540    |
| Energy PP1 [J] STAT       | 0          | 0            | 0         | 0         |
| Power PP1 [W] STAT        | 0          | 0            | 0         | 0         |
| Energy DRAM [J] STAT      | 4.7322     | 0            | 4.7322    | 0.7887    |
| Power DRAM [W] STAT       | 1.0019     | 0            | 1.0019    | 0.1670    |

| Metric                    | Sum       | Min      | Max      | Avg      |
|---------------------------|-----------|----------|----------|----------|
| Runtime (RDTSC) [s] STAT  | 161.6694  | 26.9449  | 26.9449  | 26.9449  |
| Runtime unhalted [s] STAT | 5.8108    | 0.0002   | 5.8088   | 0.9685   |
| Clock [MHz] STAT          | 4788.5470 | 797.9943 | 798.2168 | 798.0912 |
| CPI STAT                  | 4.6770    | 0.2683   | 1.1603   | 0.7795   |
| Temperature [C] STAT      | 211       | 34       | 36       | 35.1667  |
| Energy [J] STAT           | 47.8532   | 0        | 47.8532  | 7.9755   |
| Power [W] STAT            | 1.7760    | 0        | 1.7760   | 0.2960   |
| Energy PPO [J] STAT       | 29.9814   | 0        | 29.9814  | 4.9969   |
| Power PPO [W] STAT        | 1.1127    | 0        | 1.1127   | 0.1855   |
| Energy PP1 [J] STAT       | 0         | 0        | 0        | 0        |
| Power PP1 [W] STAT        | 0         | 0        | 0        | 0        |
| Energy DRAM [J] STAT      | 26.9831   | 0        | 26.9831  | 4.4972   |
| Power DRAM [W] STAT       | 1.0014    | 0        | 1.0014   | 0.1669   |

# Power & Energy

- Regarding power and energy, how many of the following statements are correct?
    - ① Lowering the power consumption helps reducing the heat generation
    - ② Lowering the energy consumption helps reducing the electricity bill
    - ✗ ③ Lowering the power consumption helps extending the battery life
    - ④ A CPU with 10% utilization can still consume 33% of the peak power
- A. 0
- B. 1
- C. 2
- D. 3
- E. 4

# What happens if power doesn't scale with process technologies?

- If we are able to cram more transistors within the same chip area (Moore's law continues), but the power consumption per transistor remains the same. Right now, if put more transistors in the same area because the technology allows us to. How many of the following statements are true?

- ① The power consumption per chip will increase
- ② The power density of the chip will increase
- ③ Given the same power budget, we may not able to power on all chip area if we maintain the same clock rate
- ④ Given the same power budget, we may have to lower the clock rate of circuits to power on all chip area

- A. 0
- B. 1
- C. 2
- D. 3
- E. 4



# What happens if power doesn't scale with process technologies?

- If we are able to cram more transistors within the same chip area (Moore's law continues), but the power consumption per transistor remains the same. Right now, if put more transistors in the same area because the technology allows us to. How many of the following statements are true?
  - ① The power consumption per chip will increase
  - ② The power density of the chip will increase
  - ③ Given the same power budget, we may not able to power on all chip area if we maintain the same clock rate
  - ④ Given the same power budget, we may have to lower the clock rate of circuits to power on all chip area

A. 0

B. 1

C. 2

D. 3

E. 4

# **Dark Silicon and the End of Multicore Scaling**

**H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam and D. Burger**  
**University of Washington, University of Wisconsin—Madison, University of Texas at Austin,**  
**Microsoft Research**

# Static/Leakage Power

- The power consumption due to leakage — transistors do not turn all the way off during no operation
- Becomes the **dominant** factor in the most advanced process technologies.

$$P_{leakage} \sim N \times V \times e^{-V_t}$$

- $N$ : number of transistors
- $V$ : voltage
- $V_t$ : threshold voltage where transistor conducts (begins to switch)



Figure 1: Leakage power becomes a growing problem as demands for more performance and functionality drive chipmakers to nanometer-scale process nodes (Source: IBS).

# Dennardian Broken

- Given a scaling factor S

| Parameter                                   | Relation                   | Classical Scaling | Leakage Limited |
|---------------------------------------------|----------------------------|-------------------|-----------------|
| <b>Power Budget</b>                         |                            | 1                 | 1               |
| <b>Chip Size</b>                            |                            | 1                 | 1               |
| <b>Vdd (Supply Voltage)</b>                 |                            | 1/S               | 1               |
| <b>Vt (Threshold Voltage)</b>               | 1/S                        | 1/S               | 1               |
| <b>tex (oxide thickness)</b>                |                            | 1/S               | 1/S             |
| <b>W, L (transistor dimensions)</b>         |                            | 1/S               | 1/S             |
| <b>Cgate (gate capacitance)</b>             | WL/tox                     | 1/S               | 1/S             |
| <b>I<sub>sat</sub> (saturation current)</b> | WVdd/tox                   | 1/S               | 1               |
| <b>F (device frequency)</b>                 | $I_{sat}/(C_{gate}V_{dd})$ | S                 | S               |
| <b>D (Device/Area)</b>                      | $1/(WL)$                   | $S^2$             | $S^2$           |
| <b>p (device power)</b>                     | $I_{sat}V_{dd}$            | $1/S^2$           | 1               |
| <b>P (chip power)</b>                       | D <sub>p</sub>             | 1                 | $S^2$           |
| <b>U (utilization)</b>                      | $1/P$                      | 1                 | $1/S^2$         |

# Moore's Law<sup>(1)</sup>

- The number of transistors we can build in a fixed area of silicon doubles every 12 ~ 24 months.

**We still have no trouble putting more transistors in a chip even for now!**



(1) Moore, G. E. (1965), 'Cramming more components onto integrated circuits', Electronics 38 (8).

# If you can add power budget...

| NVIDIA Accelerator Specification Comparison |                                  |                                  |                                 |
|---------------------------------------------|----------------------------------|----------------------------------|---------------------------------|
|                                             | H100                             | A100 (80GB)                      | V100                            |
| FP32 CUDA Cores                             | 16896                            | 6912                             | 5120                            |
| Tensor Cores                                | 528                              | 432                              | 640                             |
| Boost Clock                                 | ~1.78GHz<br>(Not Finalized)      | 1.41GHz                          | 1.53GHz                         |
| Memory Clock                                | 4.8Gbps HBM3                     | 3.2Gbps HBM2e                    | 1.75Gbps HBM2                   |
| Memory Bus Width                            | 5120-bit                         | 5120-bit                         | 4096-bit                        |
| Memory Bandwidth                            | 3TB/sec                          | 2TB/sec                          | 900GB/sec                       |
| VRAM                                        | 80GB                             | 80GB                             | 16GB/32GB                       |
| FP32 Vector                                 | 60 TFLOPS                        | 19.5 TFLOPS                      | 15.7 TFLOPS                     |
| FP64 Vector                                 | 30 TFLOPS                        | 9.7 TFLOPS<br>(1/2 FP32 rate)    | 7.8 TFLOPS<br>(1/2 FP32 rate)   |
| INT8 Tensor                                 | 2000 TOPS                        | 624 TOPS                         | N/A                             |
| FP16 Tensor                                 | 1000 TFLOPS                      | 312 TFLOPS                       | 125 TFLOPS                      |
| TF32 Tensor                                 | 500 TFLOPS                       | 156 TFLOPS                       | N/A                             |
| FP64 Tensor                                 | 60 TFLOPS                        | 19.5 TFLOPS                      | N/A                             |
| Interconnect                                | NVLink 4<br>18 Links (900GB/sec) | NVLink 3<br>12 Links (600GB/sec) | NVLink 2<br>6 Links (300GB/sec) |
| GPU                                         | GH100<br>(814mm <sup>2</sup> )   | GA100<br>(826mm <sup>2</sup> )   | GV100<br>(815mm <sup>2</sup> )  |
| Transistor Count                            | 80B                              | 54.2B                            | 21.1B                           |
| TDP                                         | 700W                             | 400W                             | 300W/350W                       |
| Manufacturing Process                       | TSMC 4N                          | TSMC 7N                          | TSMC 12nm FFN                   |
| Interface                                   | SXM5                             | SXM4                             | SXM2/SXM3                       |
| Architecture                                | Hopper                           | Ampere                           | Volta                           |



<https://www.workstationspecialist.com/product/nvidia-tesla-a100/>



<https://www.servethehome.com/wp-content/uploads/2022/03/NVIDIA-GTC-2022-H100-in-HGX-H100.jpg>

# Power Density of Processors



# Power consumption to light on all transistors

| Chip |   |   |   |   |   |   |
|------|---|---|---|---|---|---|
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |

=49W

## Dennardian Scaling

| Chip |     |     |     |     |     |     |
|------|-----|-----|-----|-----|-----|-----|
| 0.5  | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
| 0.5  | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
| 0.5  | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
| 0.5  | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
| 0.5  | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
| 0.5  | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
| 0.5  | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
| 0.5  | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
| 0.5  | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
| 0.5  | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |

=50W

## Dennardian Broken

| Chip |   |   |   |   |   |   |
|------|---|---|---|---|---|---|
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |
| 1    | 1 | 1 | 1 | 1 | 1 | 1 |

On ~ 50W  
Off ~ 0W  
Dark!

=100W!

# Clock rate improvement is limited nowadays



# Announcements

- Assignment #4 due this Thursday
- If you submit iEVAL and submit the screenshot through eLearn, it counts as a “full-credit” notebook assignment
  - We drop two notebook assignments with this one included
  - In other words, if you submit iEVAL and the screenshot, you got two lowest assignments dropped.
- The final exam will be held at  
**BRNHL B118 (not the classroom for lectures)**  
**6/15/2023 11:30 a.m. – 2:30 p.m**

# Computer Science & Engineering

203

つづく

