

# Hardware/Software Codesign I

## Winter Term 2015/16

## Lesson 2

### SYNTHESIS, SYSTEM MODELS

### Exercise 1

The following system specification with 4 tasks  $T_1, \dots, T_4$  is given:



The tasks can be executed on different hardware units. The following table shows these units and the according costs and execution times.

| unit | cost/unit | exec. time (ms) |    |    |    |
|------|-----------|-----------------|----|----|----|
|      |           | T1              | T2 | T3 | T4 |
| MIPS | \$200     | 5               | -  | -  | 2  |
| DSP  | 120       | -               | 20 | 18 | 5  |
| FPGA | 240       | -               | 12 | 10 | -  |
| ASIC | 400       | -               | -  | 1  | -  |

- (a) Assume that every hardware unit can be used at most 1 time. Note the design space by listing all possible *Design Points*. A design point is a set of allocation, binding and scheduling.
- (b) Determine the costs and the execution times of all design points.
- (c) Draw the design point to a cost / execution-time diagram and mark the *Pareto Points*.
- (d) Are there new design points, if each hardware unit can be used more than just 1 time?
- (e) How many design points exist at least, if a task graph with  $n$  tasks is arbitrary mapped to a system with at least 3 hardware units?

|    | $T_1$ | $T_2$ | $T_3$ | $T_4$ | case | Exc. Time   |
|----|-------|-------|-------|-------|------|-------------|
| b) | M     | D     | D     | M     | 320  | $5+20+18+2$ |
|    | M     | D     | D     | D     | 320  | $5+20+18+5$ |
|    | M     | D     | F     | M     | 560  | $5+20+2$    |
|    | M     | D     | F     | D     | 560  | $5+20+15$   |
|    | M     | F     | D     | M     | -    | 25          |
|    | M     | F     | D     | D     | -    | 28          |
|    | M     | P     | F     | M     | 480  | 28          |
|    | M     | F     | F     | D     | 520  | 32          |
|    | M     | D     | A     | M     | 720  | 27          |
|    | M     | D     | A     | D     | 720  | 30          |
|    | M     | F     | A     | M     | 840  | 18          |
|    | M     | P     | A     | D     | 860  | 22          |

c)  $x_i = \text{cost}$   
 $x_L = \text{Runtime}$   
 $\exists y \in \mathbb{R}^n$  with  $y_i \in x_i ; 1 \leq i \leq n$  at least 1 point inequality

插入 Lecture  $\Rightarrow$  Pareto Point 因

d)  $M$   $F_1$   $\hat{F}_2$   $M$   $680$   $5+12+2$

e)  $k^n$   $k = \text{objects}$   
 $n = \text{repitions (tasks)}$

## Exercise 2

Graphs are a very important formalism for system modeling and automated synthesis.

- (a) Give the definition of a *graph*.
- (b) What are differences of directed and undirected graphs?
- (c) A completely connected graph  $G$  is given with  $n$  nodes. How many edges exist in  $G$  for the cases that it is a directed or an undirected graph?

## Exercise 3

Given is a set of tasks  $T = \{A, B, C, D, E\}$  and the dependency relation  $R = \{(A, B), (A, C), (C, D), (D, E), (C, E)\}$ . Hereby, a relation  $(A, B)$  means that task  $A$  produces data for task  $B$  and thus has to be executed before  $B$ .

Furthermore, a micro controller (MC), a digital signal processor (DSP) and a configurable gate array (FPGA) are available to implement the system. All components are connected to a common system-bus and can communicate pairwise by bidirectional point-to-point connections.

Task  $A$  can only be implemented on the FPGA or on the DSP. Task  $E$  can only be implemented on the FPGA. All other tasks can run on all components.

- (a) How is a *Specification Graph* defined?
- (b) Note the specification graph for the mentioned system.

$$G = (V, E)$$

$V$  = set of vertices / nodes

$E \subseteq V \times V : (U, V) | U, V \in V, U \neq V$  (Edges)

$\forall e_1, e_2 \in E: e_1 = (U, V) \quad e_2 = (V, U) \rightarrow e_1 = e_2 \rightarrow$  undirected Graph

$e_1 \neq e_2 \rightarrow$  directed Graph

Exercise

2

dir  $n \cdot n - 1$

undirected  $\frac{n \cdot (n-1)}{2}$



specification graph

Problem graph

possible mapping

Architecture graph

$V$  Task & Communication

$E$  Dependencies

$V$  HW component

$E$  comm. interface

$E$  Direct com port



Ex. A  $\rightarrow$  MC

And on each different path  
have different weight



# Hardware/Software Codesign I

## Winter Term 2015/16

# Lesson 3

## TARGET ARCHITECTURES (PROCESSORS)

### Exercise 1

The following schematic of a simple *central processing unit* (CPU) is given:



- Of what kind of architecture is this processor? *Von - no man*
- Note an appropriate state machine for the control unit.
- What are the most important classes of commands for this processor? How are they executed by the CPU?
- What are disadvantages of this proposed architecture?

[https://software.intel.com/zh-cn/articles/book-Processor-Architecture\\_CPU\\_work\\_process](https://software.intel.com/zh-cn/articles/book-Processor-Architecture_CPU_work_process)

*read data from RAM  
Save data directly*

### Exercise 2

Compare important characteristics of CISC- and RISC-architectures.

Fetch → Decode → Execute → |  
↓  
Update

CISC

complex Operator

memory management

different exec. time  
for each command

short program

(less memory)

special command

complex compilers

RISC

only Res. Operator

fixed exec. time  
for each command

long programmes

Pipelining

### **Exercise 3**

- (a) What are possible extensions for general purpose processors that accelerate the execution of multimedia applications.
- (b) What is the meaning of a permutation command?
- (c) Note a HW-implementation (Register-Transfer level) for the permutation command based on a 32-bit register with a subword size of 8 bit.

### **Exercise 4**

Complete the following table to compare characteristics of a *microcontroller* and a *digital signal processor*:

|                                | Microcontroller | DSP |
|--------------------------------|-----------------|-----|
| code-majority                  |                 |     |
| nr. of jumps/ branches         |                 |     |
| nr. of bit-/ logic operations  |                 |     |
| nr. of arithmetical operations |                 |     |
| data throughput                |                 |     |
| peripheral devices             |                 |     |
| parallelism                    |                 |     |
| real time properties           |                 |     |

### Exercise 3

(a) special induction sets (MMA)

(b) sub-word execution

(c)



### Exercise 4

MC

— Control flow

many J/b

many b/logic operations

Less arit op.

low

integrated

multitask

DSP

— Dataflow

Less Jumper/branches

Less

many

high

external

dedicated comm.  
(MAC)

Possible (but not  
typical for APP)

possible

efficiency

## Exercise 5

The following equation is given:

$$y = \sum_{k=0}^{N-1} (c[k] * x[k])$$

- (a) Note a C-like program, that calculates  $y$ .
- (b) Note an assembler DSP program, that calculates  $y$ . You can use the following assembler commands:

| Mnemonic   | Parameter                              | Description                                                                                |
|------------|----------------------------------------|--------------------------------------------------------------------------------------------|
| LDI        | $value, reg$                           | Loads an integer $value$ in a register $reg$                                               |
| ADDI3      | $reg\_op1, reg\_op2, reg\_val$         | Adds 2 register values $reg\_op1$ and $reg\_op2$ and stores the result in $reg\_val$       |
| MPYI3      | $reg\_op1, reg\_op2, reg\_val$         | Multiplies 2 register values $reg\_op1$ and $reg\_op2$ and stores the result in $reg\_val$ |
| RPTS       | $value$                                | Repeats the adjacent command $value$ -times                                                |
| OP1    OP2 | jeweils $reg\_op1, reg\_op2, reg\_val$ | Executes the operations OP1 and OP2 in parallel (OP has to be MPYI3 or ADDI3)              |

To access the operands in the memory, two address registers  $AR0$  and  $AR1$  are available. They can be used like normal registers. At startup time  $AR0$  points to the first value of  $c$  and  $AR1$  points to the first value of  $x$ .

The value in the memory address of  $ARx$  can be accessed with  $*ARx$ .

The operation ' $ARx ++$ ' increments the address automatically after the access to the memory.

Int y = 0;  
For ( int k=0; k < N; k++ )

$y += c[k] * x[k]; }$

LDI R0 // multiplication

LDI R1 // addition

RPTs N

ADDI R0, R1, R1

MPLY 3 \* R0++, \* R1++, R0

ADDI R0, R1, R1

# Hardware/Software Codesign I

## Winter Term 2015/16

# Lesson 4

TARGET ARCHITECTURES (FPGA, ASIC)

## Exercise 1

The following interface (ENTITY) and behavioral description (ARCHITECTURE) is given in the hardware description language VHDL:

```

1  ENTITY ex1  is
2    PORT(  a:      IN  BIT;
3          b:      IN  BIT;
4          c:      IN  BIT;
5          d:      IN  BIT;
6          y:      OUT BIT;
7          z:      OUT BIT
8        );
9  END ex1;
10
11 ARCHITECTURE behavioral OF ex1 IS
12 SIGNAL      temp:      BIT;
13 BEGIN
14
15     temp    <= a AND b;
16
17     y       <= temp AND (c OR d);
18     z       <= '1' WHEN temp = '1';
19
20 END behavioral;

```

- Explain the structure of a *Field Programmable Gate Arrays* (FPGA).
- Explain the different design steps to get from a given behavioral description to a FPGA programming file (bitstream).
- In the figure at the end of this sheet, the schematic of a very simple FGPA is shown. Due to simplification, the interconnection lines are not shown. Explain how this FPGA works.
- Implement the behavioral description (given above) as a complete FPGA configuration for the device in the mentioned figure. Show all interim results from the mentioned steps of task (b).



| $t_1 = a$ | $t_2 = b$ | $t_3 = c$ | $t_4 = \text{temp}$ | $t_5 = d$ | $t_6 = e$ | $t_7 = f$ |
|-----------|-----------|-----------|---------------------|-----------|-----------|-----------|
| 0         | 0         | 0         | 0                   | 0         | 0         | 0         |
| 0         | 1         | 0         | 0                   | 0         | 1         | 1         |
| 0         | 0         | 1         | 0                   | 0         | 0         | 0         |
| 1         | 1         | 1         | 1                   | 1         | 1         | 1         |



## Exercise 2

Compare advantages and disadvantages of *Application Specific Integrated Circuits* (ASICs) and FPGAs.

## Exercise 3

- Explain what is meant with *block-oriented design*?
- Classify block types depending on their abstraction level. — *On lecture ppt*
- Blocks can be offered as *Intellectual Properties* (IPs). Name some important characteristics for the classification of IPs.

|                                       | ASIC        | FPGA          |
|---------------------------------------|-------------|---------------|
| density                               | very high   | medium to low |
| performance                           | very high   | medium to low |
| Dev. Time<br>(Development)            | very long   | short         |
| production time                       | medium      | short         |
| COST<br>- low amort.<br>- high amort. | high<br>low | low<br>high   |

soft blocks  
firm blocks  
hard blocks

- 
- cost
  - flexibility / size
  - timing / delays
  - interfaces



# Lesson 5

## COMPILER

### Exercise 1

What are the main phases and tasks of a *Compiler*?

### Exercise 2

The following equation is given:

$$y := (-(a + b) * (c + d)) + (a + b + c)$$

- (a) Convert this equation to a *Syntax-Tree*.
- (b) Convert this equation to a *Directed Acyclic Graph* (DAG).
- (c) Convert this equation to *3-Address-Code*.

### Exercise 3

- (a) How is a *Basic Block* defined?
- (b) Note an algorithm (C-similar notation) that determines the basic blocks of a series of 3-Address-Commands. The commands are given in an array. Furthermore, following functions can be used:

```

1   bool JumpCommand(codeline cl);
2     // true, if cl is a jump
3
4   int  JumpTarget(codeline cl);
5     // gives the target (as code line number)
6     // of a jump in cl

```

The algorithm has to output the assignment of each command to a basic block.

## Lesson 5 Oper. 1



Ex 2:



- Max 3 Addresses (2 operands, 1 result)
- max 1 operators and 2 oprands

$$t_1 = a - b$$

$$t_2 = -t_1$$

$$t_3 = c + d$$

$$t_4 = t_2 * t_3$$

$$t_5 = t_1 + c$$

$$t_6 = t_4 + t_5$$

$$y = t_6$$

Bx3.

Basic Blocks 只有一个出口且执行不被中断 slide 4 pag 16

### Beginnings of BB

1. first intro. of code
2. introduction of jump
3. target of jump.

CodeLine [n];

bool jumpCommand (CodeLine c)

int jumpTarget (CodeLine C)

int BB[n]

// at least 2 initiators - the jmp target may

// before the jmp instruction)

BB[0] = 1  
|  
1  
2  
3  
[4] 3

int JT[n]

-for (int i=0; i<n; i++)

{ if (jumpCommand (L[i]))

JT [JumpTarget (L[i])] = i; } // Set slave to 1

int j=1;

BB[0] = 1;

-for (int i=1; i<n; i++)

{ if (jumpCommand (L[i-1]) || JT[i-1] == 1) j++;

BB[i] = j

}

return BB;

## Exercise 4

The following C program is given:

```
1 main()
2 {
3     int i;
4     int a[10];
5
6     i = 0;
7     while (i < 10)
8     {
9         if (a[i] == 0)
10            a[i] = a[i] + (i * 3);
11        else
12            a[i] = a[i] + ((i * 2) / (a[i] * 3));
13
14        i = i + 1;
15    }
16    ...
17 }
```

- (a) Translate this program in 3-Address-Code.
- (b) Determine the basic blocks.
- (c) Draw the *Degenerated Control-Flow Graph*.
- (d) Draw the DAGs of all basic blocks.

a) BB<sub>1</sub> → i := 0

BB<sub>2</sub> → if i >= 10 goto [15]

BB<sub>3</sub> [ t<sub>1</sub> := a[i] ]

if t<sub>1</sub> == 0 goto [10]

BB<sub>4</sub> [ t<sub>2</sub> := i \* 2 ]

else t<sub>3</sub> := t<sub>1</sub> \* 3

t<sub>4</sub> := t<sub>2</sub> / t<sub>3</sub>

t<sub>5</sub> := t<sub>1</sub> + t<sub>4</sub>

$$\begin{aligned} t_5 &:= t_2 / t_3 \\ t_5 &= t_1 + t_4 \end{aligned}$$

goto [12]

BB<sub>5</sub> [ line 10 ] t<sub>6</sub> := i \* 3

t<sub>7</sub> := t<sub>1</sub> + t<sub>6</sub>

BB<sub>6</sub> [ a[i] = t<sub>1</sub> ]

i := i + 1

goto [2]

BB<sub>7</sub> → 15

if x < loop y goes L

Start of BB<sub>1</sub>

1) Start of Program

2) Target of Jump

3) Line after Jump

End of BB<sub>1</sub>

4) Jump Line

b)



Next page

d)



# Hardware/Software Codesign I

## Winter Term 2015/16

# Lesson 6

## REGISTER BINDING

### Exercise 1

The following degenerated control-flow graph describes a program (branch conditions excluded):



- Determine the *Life Times* of the variables in *BB1*. Use the algorithm that was presented in the lecture. Assume, that the variables *a*, *d* and *e* are active at the end of *BB1*.
- Identify the active variables at each block entrance and each block end.

- (c) Determine the life times of all variables during the runtime of the whole loop. Draw the result to a diagram.
- (d) Note the costs that can be saved if a register is allocated for a variable during the runtime of the whole loop. Define the equation to calculate the saved costs during one run of the loop.
- (e) Determine a *register binding*. Use the algorithm of *Usage Counters*. Assume, that 5 registers are available on the target system.
- (f) Draw the *Register Conflict Graph* of the given program.
- (g) Determine a register binding by using the *Graph Colouring*. Assume, that the target system has 5 registers.
- (h) What is a *memory spill*? What variables should be selected for a spill?

## Task 6 Exercise /



$x$ : active

$\uparrow$  BB<sub>1</sub>



$$l_1 - a = t_0$$

|

$$j_1 \times x = a$$

$$6 \times y = a$$

$$7 \times d = 6 \times a$$



$$x = a$$

$$-a = -a$$

$$x \times d = 4 \times a$$

- ADD  $a, R_0(2) \rightarrow ADD R_1, R_0(1) \rightarrow$  save 1 cost

- if  $a$  has been defined in  $BB$  and is achieved at the end

ADD  $R_0, a(2)$

# Lesson 7

## Exercise 1 (a)



(b) ASAP as soon as possible  
ALAP as late as possible

(c) ASAP

(d) Mobility = ALAP - ASAP

the others can not move  
so mobility = 0

## Exercise 2

| step | schedule nodes of | PLise(o)<br>= MUL | PLise<br>= ADD |
|------|-------------------|-------------------|----------------|
| 0    | 3 - 9             | 1                 | 2              |
| 1    | 5 - 9             |                   | 3 4            |
| 2    | 6 - 9             | 5                 | 4              |
| 3    | 8                 | 8 . 6             | 7              |
| 4    | -                 | 9 . 6             |                |
| 5    |                   | 6                 |                |
| 6    |                   |                   |                |

# Lesson 8

## Exercise 1

- (a)
- (1)  $\text{MOV } \#1, R_0$
  - (2)  $\text{MOV } Y, R_0$
  - (3)  $\text{MOV } X, R_0$
  - $\text{ADD } \#1, R_0$
  - (4)  $\text{MOV } b, R_0$
  - $\text{MOV } C, R_1$
  - $\text{MUL } C, R_1$
  - $\text{MOV } a, R_1$
  - $\text{SUB } R_0, R_1$
  - (5)  $\text{MOV } b, R_0$
  - $\text{ADD } C, R_0$
  - $\text{MOV } a, R_1$
  - $\text{DIV } R_0, R_1$
  - $\text{MOV } e, R_0$
  - $\text{ADD } f, R_0$
  - $\text{MOV } d, R_2$
  - $\text{SUB } R_0, R_2$
  - $\text{SUB } R_2, R_1$



$$E = T_1 \cap T_2$$

- determine cost vectors
- instruction order
- generate code

(c)



$\text{MUL } R_1, R_2, R_1$

if 1 Reg  $\rightarrow$  op MUL 1 Reg:  $R_i = \frac{R_i \text{ op } M_j}{1 \ 1 \ 0} = 2$

2 Regs  $R_i = R_i \text{ op } M_j = 2$

$R_i = \frac{R_i \text{ op } R_j}{1 \ 1 \ 1} = 3$

3 Regs = 2 Regs.

$\text{memory} \simeq (\min(1, 2, 3 \text{ Reg}) +$

1 MOV = 3

move to memory



(5)



$$1 \text{ Reg} \quad R_1 = R_i \text{ op } M_j \quad \frac{5}{5} + \frac{1}{5} = 11$$

$$2 \text{ Reg} \quad R_i = R_i \text{ op } R_j \quad \frac{4}{4} + \frac{1}{5} > 10$$

$$R_1 = R_i \text{ op } M_j = 10$$

$$R_i = R_i \text{ op } M_j = 10$$

$$R_i = R_i \text{ op } R_j = 8 \quad \frac{4}{4} + \frac{4}{4}$$

$$\text{memory} = \min(11, 2, 5) + 1 = 10$$

MOV a, R0

MOV b, R1, 在左

ADD c, R1

DIV R1, R0

MOV d, R1

MOV e, R2

ADD f, R2

SUB R1, R2

SUB R1, R2

# Lesson 9

## Exercise 1

(b)

$$a = b + c$$

$$b = a - d$$

$$c = b + c$$

$$d = a - d \rightarrow d = b$$

$$x := y + 0 \times (2 \times x / y) \rightarrow x = y$$

$$y = y \times 1 \rightarrow \text{nop} (\text{no operation})$$

$$x = x + 0 \rightarrow \text{nop}$$

$$a = 5$$

$$b = a \times 4 \stackrel{\downarrow \text{simplification}}{=} b = 20$$

Globale Optimisation

$$x = t_1$$

$$a(t_2) = t_3$$

$$a(t_4) = x \rightarrow a[t_4] = t_1$$

white (  $i < (x * 4 + 2)$  )  
 $\{ \dots \}$   
 take out  $a =$

(a)

$a = ( \dots )$   
 $\text{white} ( i < a )$   
 $\{ \dots \}$

$$j = n$$

$$j = j - 1$$

$$t_4 = 4 \times j$$

$$t_4 = a(t_4)$$

$$\text{if } t_5 > 0 \text{ goto 3}$$

$$j = n$$

$$= n = j \times 4$$

$$j = j - 1$$

$$t_4 = t_4 - 4$$

$$t_5 = a[t_4]$$

$$\text{if } t_5 > 0 \text{ goto 3}$$

```

a := 5
b := a
d := MEM(x)
a := b
e := a + b
a := d ** 2
e := a + b
f := d ** 2
d := 12
goto (12)
c := a + b
goto (14)
d := 5 * 4
e := 1
c4 := e - 1
t3 := a ** c4
t2 := d * t3
c := a + t2
b := 0
if b > (a * 8) goto (27)
d := c * e
a := d
t1 := b * 4
g(b) := t1
b := b * 2
goto (28)
d := c * e
a := d
t := b * a
b := a

```

```

for( i=0; i< len; i=i+1)
    {
        a[i] = i;
        a[i+1] = i+1
    }

```

$$d = \text{MEM}(x)$$

$$a = d * d$$

$$f = a$$

$$d = 12$$

(P)

$$16 > 0 \rightarrow 1$$

$$a = b / f \rightarrow a < 0$$

$$C = a + b$$

$$a = 1$$

$$b = 0$$

$$t1 := 1$$

$$t2 := a * t1$$

$$t3 := a * t2 = 0$$

$$t4 := e - 1 = 0$$

$$t5 := a * t4 = 1$$

$$t2 := d * t3 = d$$

$$e := a + t2 = a + d$$

$$b := 0$$

$$\text{if } b > (a + 8) \text{ goto (2)}$$

$$20 > f + b = t_0 \text{ goto (3)}$$

$$g(t_0) := t_0$$

$$t_0 > 8$$

$$\text{goto (2)}$$

$$\text{goto (1)}$$

$$b = C + f$$

$$d = b$$

$$a = b + c$$



$$b = 5,$$

$$d = \text{MEM}(x)$$

$$a = d * d$$

$$e = a + 5$$

(C)

if  $b > (a + 8)$  goto (27)

*done  
do any  
thing*

machine code

mov R, a  $\rightarrow$  MOp

mov a, R

MUL R0, 8  $\rightarrow$  SHL R1, 3

SQR R0,  $\rightarrow$  MUL R0, R0

(1) Jump 3  $\rightarrow$  Jump N

(2);

(3) jump N