

2-1-17

# C S O

Servers, Embedded Systems, desktop system  
A digital watch may not be called a computing system as it is not programmable. It does only one job.

Can we say that; even the computers are limited; so how can you call it programmable?

No. Turing Machine model  
(of a computer)



example:  
 $f(\text{program as an input}) \rightarrow 1 \text{ if there is no } \infty \text{ loop}$   
 $\rightarrow 0 \text{ if there is an } \infty \text{ loop}$

So we call a device a "computing system" if one can realise all the computable funct.  
The digital watch was limited. Hence we MAY NOT consider it to be a digital computing system.

Levels of programmability:

digital watches  $\rightarrow$  Multiplexers  $\rightarrow$  Laptops

# Basic Units of a Computer:



For Embedded systems, Input Output I/O is not same for each of the system.

Internet of Things?

Embedded systems?

To Run a processor, we need to know its Instruction set.

1. Inst. set Arch. design (ISA)
2. processor design
3. Interface w/ memory
4. I/O devices

course topics

Plug Computer?

[small computers; display, I/O is a problem]

We can use a projector  
PICO projector?

When we execute a <sup>compiled</sup> program, it gets loaded onto memory and the PC is set accordingly.  
Processor can interact only with the memory.



Output of the processor is stored in the memory. The OS redirects the stored output to the destination.

We want to design machines which have good

- 1) performance/speed      4) cost.
- 2) power (low consumption)      (less cost)
- 3) size/area

### Processor Model



All the processor does is: fetch-decode-execute cycle.

For improving the performance of the processor the memory model should also be understood.

Loader (present as a part of the OS) loads the executable into the main memory.

Bootstrapping problem: initialization problem.

The problem here is that ~~we can't~~ when we switch on the computer, the loader must be activated so that meaningful instructions are executed & not some garbage values.



When we compile & run a program in IA-32 processor in Linux and run it in Windows IA-32 processor, it will not run as many instructions like printf, scanf include OS calls which are not same in all the OS'.

Processor models:

Harvard architecture:



Van Neumann or Princeton arch.

(most common)

Both memory in Single memory

The advantage with harvard is: we can have // access to both inst, data. But since we have partitioned the memory, there is a limit on how <sup>any</sup> much instructions can be accommodated.

SOC architectures: System on chip arch.  
(Smart phones, etc)

1. gcc-gS hello.c gives hello's assembly code.

2. gcc-g hello.c -o hello  
objdump -D -S hello

Given a binary string,

$$B2 \text{ Unsigned } (\vec{x}) = \sum_{i=0}^{w-1} 2^i x_i$$

$$B2 \text{ Signed } (\vec{x}) = (-1)^{x_{w-1}} \sum_{i=0}^{w-2} 2^i x_i$$

$$B2 \text{ One's Comp } (\vec{x}) = -(2^{w-1} - 1) + x_{w-1} + \sum_{i=0}^{w-2} 2^i x_i$$

$$B2 \text{ 2's Comp } (\vec{x}) = -2^{w-1} x_{w-1} + \sum_{i=0}^{w-2} 2^i x_i$$

|    | J | SM | 1's | 2's | t <sub>1</sub> | 00 01 10 11 | 10 123  |
|----|---|----|-----|-----|----------------|-------------|---------|
| 00 | 0 | 0  | 0   | 0   | 00             | 00 01 10 11 | 0 0 123 |
| 01 | 1 | 1  | 1   | 1   | 01             | 01 10 11 00 | 1 1 230 |
| 10 | 2 | -0 | -1  | -2  | 10             | 10 11 00 01 | 2 2 301 |
| 11 | 3 | -1 | 0   | -1  | 11             | 11 00 01 10 | 3 3 012 |

1. Depending on the technology which we use,  
~~we~~ we may have to use a particular notation
2. The cost of mapping from 0-9 notation to  
that particular notation should be low.

Unary notation:

|   |     |   |     |      |
|---|-----|---|-----|------|
| 0 | [ ] | 1 | 11  | 111  |
|   | [ ] | 1 | 11  | 111  |
|   |     | 1 | 11  | 111  |
|   |     |   | 111 | 1111 |

Cost of conversion is high.

Another thing to be taken care of: Overflow  
We can send a signal to indicate this.

There is partial isomorphism b/w 1's comp  
and binary rep. table.

$$\begin{array}{r} \text{Overflow:} \\ \text{Cin} \\ \downarrow \\ \begin{array}{r} x_{w-1} \dots x_1 \\ \underline{y_{w-1} \dots y_1} \\ \text{(out)} \quad s_{w-1} \dots s_1 \end{array} \end{array}$$

(if  $((\text{Cin} \oplus \text{out}) = 1)$  "overflow" "No overflow")

There is a flag register which is used for storing the meta information about a computation.

This flag register is called CPSR in ARM processor.

| <u>C</u> | <u>V</u> | <u>Op-1</u> | <u>Op-2</u> |
|----------|----------|-------------|-------------|
| 0        | 0        | 00          | 00          |
| 0        | 1        | 01          | 01          |
| 1        | 0        | 11          | 11          |
| 1        | 1        | 10          | 10          |

1. Sign-extension
2. Arithmetic & logical shift.



function fa(a,b,cin)  
  \underbrace{\hspace{1cm}}\_{wires (not args)}

$$\begin{cases} s = (a \wedge b) \wedge cin & [PTO] \\ c\_out = \text{_____} \end{cases}$$

names of wires:

### ALU design

Data Pack



has info  
abt the  
functions  
to be  
performed

Control Pack



Takes in a fn,  
co-ordinates the  
fn execution by  
taking info from data pack.

n-bit IS Architecture: Operands can be of n-bits.

RISC V Architecture?

Hardware (H/w) design



BSV source code

(SILICON)

Simulation

(check for  
correctness of circ.)

ASIC

[IC  
fabrication]

Risky

FPGA

Re-programmable  
hardware.

function:

```
fa(a, b, c-in);
s = (a ^ b) ^ c-in;
c-out = (a & b) | (c-in & (a ^ b));
return {c-out, s};
```

ordering is important;  $s$  is the least significant bit.

Blue spec compiler remembers  $a \wedge b$  and directly uses it for the next time rather than re-computing it.

But here we didn't specify the no. of bits of the input.

Bit #(2) fa(a, b, c-in)

Bit #(1).s = a<sup>1</sup> b<sup>1</sup> c-in

Bit #(1).c-out ...

return {...}

end.function

Advantages:

1. Readability

2. Type-check

Bit #(1).s = a<sup>1</sup> b<sup>1</sup> c-in  
1bit ^ 1bit ^ 1bit  
1bit

matched.

3. This is a strongly-typed-language.

e.g. of weakly typed:

```
char a = 'A';
int b = a;
```

easy to  
this is where we say  
it is weakly typed.

2-bit Ripple Carry Adder:

function: Bit#(3) add (Bit#(2) x, Bit#(2) y,  
Bit#(1) co);

Bit#(2) s = 0;

Bit#(3) c;

c[0] = co;

Let  $c_{so} = fa(x[0], y[0], c[0]);$

$c[1] = c_{so}[1]; s[0] = c_{so}[0];$

Let  $c_{s1} = fa(x[1], y[1], c[1]);$

$c[2] = c_{s1}[1]; s[1] = c_{s1}[0];$

return {c[2], s};

endfunction.



When we use let, the type exp of RHS  
is evaluated & assigned to cs0. We need not  
mention it.

w - Bit Ripple Carry adder  $\uparrow \#$

function Bit#(w+1) addN (#(w) x,

#(w) y,

#(1) co);

#(w) s; #(w+1) c; c[0] = co;

for (Integer i=0; i<w; i=i+1)

begin

let cs = fa(x[i], y[i], c[i]);

c[i+1] = cs[1]; s[i] = cs[0];

end.

return {c[w], s};

end function

To call this fw:

function #(33) add32 (#(32)x...)  
= addN(x, y, c).

① Bit#(w+1) → Error

Bit#(TAdd#(w, 1))

② i < w → Error

Let value = valueOf(w);

i < value.

Bluespec Compiler says w is a parametric type variable and hence we can't say w+1 or i < w.

When we say Integer i, it has no bound.

(not a 32bit or

64 bit  
no.)

Logical Shift  
Right Circuit



0 abc

0 0 ab



Code this.

## ALU

ADD, SUB, MUL, DIV

AND, OR, XOR,

LSR, ASR, LSL, ROR

## Shift Operations

define

$$S_0(a b c d e) = a b c d e$$

$$S_1(a b c d e) = 0 a b c d$$

$$S_2(a, b, c, d, e) = 00 a b c$$

$$S_4(a b c d e) \Rightarrow 0000 a$$

$S =$  No. of shifts required.

$$S = \underbrace{b_2 b_1 b_0}_{\text{bits}}$$

Given  $S, a b c d e$  design shift circ.



Circuit

complexity:  $O(K)$

(depth  
of the circ.)

→ No. of bits  
in 's!

Can we do it in  $O(1)$ ,  $O(\log K)$ ,  $O(\sqrt{K})$ .

### 4x1 Multiplexer

case {S1, S0} , matches

0: A;

1: B;

2: C;

3: D;

end case



The above code in machine language will take atleast 30-40 lines. If we are executing many times, we lose both power, time in executing the same inst. Instead if we design hardware for it, it would be really helpful. Now-a-days

we have re-configurable processors in which we can design it ourselves.

There are some typedefs : [ slides ]

[ ~~Did not note down~~ a topic: Colour Coding, etc: Refer slides ]

## Multiplication by repeated Addition

b Multiplicand  $\underline{1101}$

a Multiplier  $\underline{1010}$  ( slides )

$$[(a[i] \neq 0) \cdot b]$$

Processors  $\rightarrow$  Single Cycle [ All the operations in 1 cycle ]  
clock length:  $\uparrow$

Multi-cycle [ Multiple cycles ]

Clock time period: lesser

Pipelined [ model ]

( Doing operations

in parallel by

using multiple registers )

Superscalar processors

use pipelined model

**Single Cycle**



Clock cycle length is increased.

clk span > max( delayadd, delaysub, ... )

Performance isn't so good.

To counter this, we have multiple clock cycle method:

For eg: Load inst. may take 5 steps.  
Add: 3 steps.

$$\text{clk span} > \max([l_1 - l_5], [a_1 - a_3], \dots)$$

here we pay the price of a long inst process only when it is invoked as opposed to single clock system where we pay the price for all the functions in the ISA.

Single clock system



Superscalar: [out of order execution]

↳ Instruction level parallelism, [ILP]

Sequential Circuits



In brief, if  $(t_{su}, t_{tw})$ , input shd not oscillate.

Else it enters a metastable state in which

it oscillates b/w 0/1 only.  $[t - t_{su}, t]$   $[t, t + t_h]$

$\rightarrow$  setup

violation

$\rightarrow$  hold-time

violation

If clock span is so less that its value is b/w  $(t_{su} + t_h)$ ; meta stability state is inevitable.

If we want to put an EN to the FF:



Dangerous

Read port, Write ports in a register file; (set of registers)



2 read ports as we may want to access both the Registers simultaneously.

When we are executing inst. in parallel 2 instructions may want to access 5-diff Registers in a file at a time leading to structural hazards. So, increase the read, write ports.

If 2 ADD's are present:  
(and only one adder is present)  
structural hazard.  
so increase the no. of ALU's or use  
pipelined model.

To check whether Inst. can be executed  
in parallel or not: We put a window  
(generally 4-8) set of instructions.

### Instruction Scheduling

\* Let  $w = 2$ .  
(window)

→ [ADD R1 R2  
ADD R3 R5  
OR R25 R23]

It thinks  
they can't be  
run in 1/  
run

But if we swap Add, OR  
we can do this in parallel.

So, the compiler does these swaps  
for us.

This is called Instruction Scheduling.



The efficiency of swapping may be  
diff in diff. compilers.

## Data hazards

ADD R1 R2 R3

ADD R4 R7 R1

RAW [Read after write] dependency.

flow dependency, True dep.

MUL R1 R2 R3

ADD R3 R7 R10

If ADD finishes faster than MUL,  
new value of R3 is multiplied  
with R2 but we want old value  
(WAR)

Loop using registers:

for C Integer i=0 ; i<32 ; i:=i+1).

begin

let s = f(s);

end

return s;

Equivalent circuit:  $\xrightarrow[s]{f} \xrightarrow{s} \xrightarrow[32]{f} \dots$

Another implementation:



Sequential elements in BlueSpec:

```
Reg#(Bit#(32)) a <- mkRegU();
```

initialisation

$b, prod, tp;$

```
Reg#(Bit#(6)) i <- mkReg(32);
```

rule mulStep if (i < 32);

$Bit#(32) m = (a[i] == 0) ? 0 : b[31:0]$

$[a[0] == 0] \rightarrow a \leftarrow a \gg 1;$

$m' (33) sum = add32(m, tp, 0);$

$prod[i] <= sum[0] \rightarrow prod \leftarrow$

$prod \ll 1 + tp. i <= sum[32:1]; (sum[0],$

$i <= i + 1; (prod \gg 1) [30:1])$

endrule.

Instructions in the loop can be executed in parallel.

What happens if  $i++$  happens before  $prod[i]$  is updated? Although  $i+1$  is available at the Adder bus, registers get updated only in the next rising edge.

Although we saved space, we couldn't do good with speed. (we added extra time - updating registers).

2-2-17

## Combinational IFFT

(Inverse fast forward transform).

The inputs are complex nos: represented with 32 bits. (16 for real part, 16 for complex).

Four inputs each are given to butterfly 4 units which returns 4 complex nos. which are then fed into the permute box which permutes the inputs and this process goes on.

All the bfly 4 units are identical in design but the constant factor ( $\tilde{T}$  twiddle factor) depends on the position at which the bfly unit is present. We observe in the circuit that the same hardware again and again? Can we optimize this? Yes,

with FOLDING.



can be done in a more efficient way:



This width must be enough for function f to finish.



1st computation is done

In the first circ:

$$t_{clock} > 2t_f + t_g$$

$$\stackrel{2nd}{=} t_{clock} > \max(t_{cq}, t_f + t_{su})$$

If we try to introduce an extra instruction into the ISA, we may have to pay the price as we may have to increase the clock period. (Processor freq. decreases.)

$$[ t_{clock} > \max(t_{add}, \dots, t_{IFFT}) ]$$

so by folding we have opt. on space



stage gets updated only at the rising edge  
as it is a seq. element-

rule folded-pipeline (True);

if (stage == 0)

begin SxIn = inQ.first(); inQ.deq();

else SxIn = SReg;

SxOut = f(stage, SxIn);

if (stage == n - 1) outQ.enq(SxOut);

endrule else SReg <= SxOut;

stage <- (stage == n - 1) ? 0 : stage + 1;

Having a Queue makes our circ flexible.  
We need not wait for ~~that~~ the reqd no. of clock cycles. We can just push it into the Queue.

Even if we change the order of statements in the code, it doesn't affect the output. As update happens only at the next clock cycle.

'=' symbol: naming a bus.

$\Leftarrow \rightarrow$  state element.

~~Border Cases:~~ if (~~in Q is full || out Q is full~~)  
~~blue spec sees to it that the rule doesn't fire at all.~~

This is Atomicity [All the rules in the hardware will work or none of them work]

↳ ALL or NONE

Going back to IFFT:



Although this takes 3 clock cycles we get one output / every clockcycle.

⇒ Throughput is 1 output / clock cycle.  
So we gain on throughput.

If we see pipeline and Multicycle:  
~~pipeline~~ both of them have similar clock speed but throughput is lost in M-cycle.

Pipeline code:

```
rule sync-pipeline (True);
    inQ : deg();
    sReg1 ∈ f0 (inQ.first());
    sReg2 ∈ f1(sReg1);
    outQ.enq (f2 (sReg2));
end rule.
```