



**DEPARTMENT  
OF  
ELECTRONICS & COMMUNICATION ENGINEERING**

**EMBEDDED SYSTEM DESIGN  
(Theory Notes)**

**Autonomous Course**

**Prepared by**

**Prof. MANASA R**

**Module – 2 Contents**

**Processors Architecture:** Advanced Processor Technology, Super Scalar and Vector Processors

**Dayananda Sagar College of Engineering**

**Shavige Malleshwara Hills, Kumaraswamy Layout,  
Banashankari, Bangalore-560078, Karnataka**

**Tel : +91 80 26662226 26661104 Extn : 2731 Fax : +90 80 2666 0789**

**Web - <http://www.dayanandasagar.edu> Email : [hod-ece@dayanandasagar.edu](mailto:hod-ece@dayanandasagar.edu)**

**( An Autonomous Institute Affiliated to VTU, Approved by AICTE & ISO 9001:2008 Certified )  
( Accredited by NBA, National Assessment & Accreditation Council (NAAC) with 'A' grade )**

PROCESSORS      ARCHITECTURE

- In today's era there are variety of multiprocessor exist.
- In this module we will study about advances in the processor technology, Superscalar & Vector Processors architecture in detail.

### Advanced Processor Technology

- Architectural families of modern processors are introduced below with the underlying microelectronics/packaging technologies.
- The coverage spans from VLSI microprocessors used in workstations or multiprocessors to heavy-duty processors used in mainframes & Supercomputers.
- Major processor families to be studied include the CISC, RISC, Superscalar, VLIW (Very long Instruction Word), Superpipelined, vector processors.
- Vector & Scalar processors are for numerical computation.

### Design Space of Processors

- Various processor families can be mapped onto a co-ordinated space of clock rate versus Cycles per instruction (CPI) as given in fig below.
- As implementation technology evolves rapidly, the clock rates of various processors are gradually moving from low to higher speeds toward the right of the design space.
- Another trend is that processor manufacturers are trying to lower the CPI rate using H/w & S/w approaches.

- Based on these trends, the mapping of processor in fig below reflects their implementation during the past decade.
- As time passes, some of the mapped ranges may move toward the lower right corner of the design space.



Fig: Design space of modern processor clock rate families.

### The Design Space

- Scalar CISC (Complex-instruction-set computing)
- Conventional processors like Intel i486, M68040, VAX 18600, IBM 390 etc fall into the family known as complex-instruction set computing (CISC) architecture
- The typical clock rate of today's CISC processors ranges from 33 to 50 MHz.
- With microprogrammed control, the CPI of different

Embedded System Design Varies from 1 to 20.

CISC instructions are at the upper left of the design space.

### Scalar RISC (Reduced instruction set computing)

- Today's reduced instruction set computing (RISC) processors, such as the Intel i860, SPARC, MIPS R3000, IBM RS/6000 etc have faster clock rates ranging from 20 to 120 MHz.

- With the use of hardwired control, the CPI of most RISC instructions have been reduced to one to two cycles.

### Superscalar RISC

- A Special subclass of RISC processors which allow multiple instructions to be issued simultaneously during each cycle.

- Thus the effective CPI of a superscalar processor should be lower than that of a generic scalar RISC processor.

- The clock rate of superscalar processors matches that of scalar RISC processors.

### VLIW (Very Long Instruction Word)

- VLIW architecture uses even more functional units than that of a superscalar processor.

- The CPI of a VLIW processor can be further lowered.

- Due to the use of very long instructions (256 to 1024 bits per instruction), VLIW processors have been mostly implemented with microprogramme control.

- Embedded System Design
- Thus, the clock rate is slow with the use of read-only memory (ROM).
  - A large number of microcode access cycles may be needed for some instructions.
  - . Supersampled processors
    - It uses multiphase clocks with a much increased clock rate ranging from 100 to 500 MHz.
    - CPI rate is rather high unless supersampling is practiced jointly with multiinstruction issue.
    - The processors in vector supercomputers are mostly supersampled & use multiple functional units for concurrent scalar & vector operations.
    - The effective CPI of a processor used in a supercomputer should be very low, positioned at the lower right corner of the design space.
    - However, the cost increases appreciably if a processor design is restricted to the lower right corner.

### Instruction Pipelines

- The execution cycle of a typical instruction includes four phases
  - 1) Fetch
  - 2) Decode
  - 3) Execute
  - 4) Write-back
- These instruction phases are often executed by an instruction pipeline as demonstrated in fig below



### Basic Definitions:

#### 1). Instruction pipeline Cycle:

The clock period of the instruction pipeline

#### 2). Instruction Issue Latency:

The time (in cycles) required bet" the issuing of 2 adjacent instructions.

3). Instruction issue rate:

The number of instructions issued per cycle.  
also called the degree of a Superscalar processor.

4). Simple operation Latency:

- Simple operations make up the vast majority of instructions executed by the machine, such as integer adds, loads, stores, branches, moves etc.

- Complex operations are those requiring an order-of-magnitude longer latency, such as divides, cache misses etc.

- These latencies are measured in number of cycles

5). Resource Conflicts:

This refers to the situation where 2 or more instructions demand use of the same functional unit at the same time.

- A Base Scalar processor is defined as a machine with one instruction issued per cycle, a one-cycle latency for a simple operation, & one-cycle latency between instruction issues.
- The instruction pipeline can be fully utilized if successive instructions can enter it continuously at the rate of one per cycle as shown in fig @.

- The instruction issue latency can be more than one cycle for various reasons.

Example: If the instruction issue latency is two cycles per instruction, the pipeline can be underutilized, as demonstrated in fig (b) below.



Fig (b) Underutilized pipeline with two cycles per inst issue

- Another underutilized situation is shown in fig (c) below, in which the pipeline cycle time is doubled by combining pipeline stages.
- In this case, the fetch & decode phases are combined into one pipeline stage, & execute & write-back are combined into another stage.
- This will also result in poor pipeline utilization



Fig (c) Underutilized pipeline with twice base cycle

- The effective CPI rating is 1 for the ideal pipeline shown in fig @ & 2 for the case in fig (b).
- In fig (c) the clock rate of the pipeline has been lowered by one-half.
- According to fig (d) & fig (e) will reduce the performance by one-half, compared with ideal case fig @ for the base machine.

### Processors & Co-processors

- The central processor of a computer is called the Central processing unit (CPU) as given in fig below



ALU - Arithmetic & Logic Unit

DMA - Direct Memory Access

CPU - Central processing unit

fig @ CPU with built-in floating point Unit



- Fig (b) CPU with an attached co-processor*
- Architectural models of a basic scalar computer system*
- This CPU is essentially a scalar processor, which may consist of multiple functional units such as an integer arithmetic & logic unit (ALU) a floating-point accelerator etc.
  - The floating point unit can be built on a co-processor (Fig (b)) which is attached to CPU.
  - The co-processor executes instruction dispatched from the CPU.
  - A co-processor may be a floating point accelerator executing scalar data, vector processor executing vector operands, a DSP or a Lisp processor executing AI (Artificial intelligence) programs.
  - Co-processor cannot handle I/O operations.

### Co-processor - Processor pairs & characteristics

- Table below lists some processor - coprocessor pairs developed in recent years to speed up numerical computations.
- Co-processors cannot be used alone.

- The processor & co-processor operate with a host-back-end relationship & compatibility bet'n them is a necessity.
- Co-processors attached processors are also called as slave processors.
- An attached processor may be more powerful than its host.

Example: Cray Y-MP is a back-end processor driven by a small mini computer.

| Coprocessor      | Compatible Processor          | Coprocessor Characteristics                                      |
|------------------|-------------------------------|------------------------------------------------------------------|
| Intel 8087       | Intel 8086/8088               | 5MHz, 70 cycles for Add & 700 cycles for log.                    |
| Intel 80287      | Intel 80286                   | 12.5MHz, 30 cycles for Add & 264 cycles for log.                 |
| Intel 387DX      | Intel 386DX                   | 33MHz, 12 cycles for Add & 210 cycles for log.                   |
| Intel i486       | Intel i486<br>(the same chip) | 33MHz, 8 cycles for Add & 171 cycles for log                     |
| Motorola MC68882 | Motorola MC68020/68030        | 40MHz, 56 cycles for Add & 574 cycles for log                    |
| Weitek 3167      | Intel 386DX                   | 33MHz, 6 cycles for Add 365 cycles for log by software emulation |
| Weitek 4167      | Intel i486                    | 33MHz, 2 cycles for Add & not available for log                  |

## Embedded System Design

### Instruction Set Architecture (lowest level visible to programmer)

- The instruction set of a computer specifies the primitive commands or machine instructions that a programmer can use in programming the machine.
- The complexity of an instruction set is attributed to the instruction formats, addressing modes, general-purpose registers, opcode & specifications & flow control mechanisms used.
- Based on past experience in processor design, two schools of thought on instruction-set architectures have evolved, namely CISC & RISC.

#### Complex Instruction Set Computer [CISC]

- In early days of computer history, most computer families started with an instruction set which was rather simple.
- The main reason for being simple then was the high cost of H/w.
- The H/w cost has dropped & S/w cost has gone up steadily over the past 3 decades.
- The semantic gap b/w HLL (High level language) features & computer architecture has widened.
- More & More functions have been built into the H/w making the instruction set very large & very complex.
- The growth of instruction sets was encouraged & user defined instruction sets were implemented using microcodes in some processors for special purpose applications.

- A typical CISC instruction set contains approximately 120 to 350 instructions using variable instruction/data formats, uses small set of 8 to 24 general-purpose registers (GPRs) & executes large number of memory reference operations based on more than a dozen addressing modes.
  - Many HLL statements are directly implemented in H/w/firmware in a CISC architecture.
  - This may simplify the compiler development, improve execution efficiency, & allow an extension from scalar instructions to vector & symbolic instructions.
- Reduced Instruction Set Computer [RISC].
- Started with RISC & gradually moved to CISC in 1980's.
  - After 2 decades of using CISC processors, computer users began to re-evaluate the performance relationship bet" instruction-set architecture & available H/w/S/w technology.
  - Later, Computer Scientists realized that only 25% of the instructions of a complex instruction set are frequently used about 95% of time.
  - This implies that about 75% of H/w supported instructions often are not used at all.
  - Why should we waste valuable chip area for rarely used instruction? was arises.
  - With low-frequency elaborate instructions demanding long microcode to execute them, it may be

from H/w & rely on S/w to implement them.

- Even if S/w implementation is slow, the net result will be still a plus due to their low frequency of appearance.
- Pushing rarely used instructions in S/w will vacate chip areas for building more powerful RISC or Superscalar processors, even with on-chip caches or floating-point units.
- A Typical RISC contains less than 100 instructions with a fixed instruction formats (32 bits).
- Only 3 to 5 simple Addressing modes are used.
- Most instructions are register-based.
- Memory access is done by load/store instructions only.
- A large register file (at least 32) is used to improve fast context switching among multiple users & most instructions executed in one cycle with hardwired control.
- Because of the reduced instruction set complexity, the entire processor is implementable on a single VLSI chip.
- The resulting benefits include a higher clock rate & a lower CPI, which lead to higher MIPS (Microprocessor without interlocked pipeline stages) ratings as reported on commercially available RISC/superscalar processors.

Architectural Distinctions bet'n CISC & RISC

(a) CISC architecture with microprogrammed control &amp; unified cache

- It uses Unified Cache for holding both instructions & data.  $\therefore$  They must share same data/instruction path.
- The use of microprogrammed control can be found
- Split caches & hardwired control are used
- Hardwired control seen in CISC which will reduce CPI effectively to 1 instruction per cycle if pipelining is carried out perfectly.



(b) RISC architecture with hardwired control &amp; split instruction cache &amp; data cache

- It uses separate instruction & data caches with different access paths.

- Hardwired control found in RISC.

- Split caches & Hardwired control are not exclusive in RISC.

- $CPI < 1.5$ .

## Characteristics of CISC & RISC Architectures

- We compare the main features of RISC & CISC processors which involves 5 area: Instruction Sets, addressing modes, register file & cache design, clock rate & expected CPI, & control mechanisms.

| Architectural characteristic                  | Complex Instruction Set Computer (CISC)                                                              | Reduced Instruction Set Computer (RISC).                                                  |
|-----------------------------------------------|------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------|
| Instruction - Set Size & Instruction format   | Large set of instructions with variable formats (16-64 bits per instruction).                        | Small sets of instructions with fixed (32-bit) format & most register based instructions. |
| Addressing modes                              | 12-24                                                                                                | Limited to 3-5                                                                            |
| GPR (General purpose Register) & Cache design | 8-24 GPRs, mostly with a unified cache for instructions & data, recent designs also use split caches | Large numbers (32-192) of GPRs with mostly split data cache & instruction cache.          |
| Clock rate & CPI                              | 33-50 MHz in 1992 with a CPI betw 2 & 15                                                             | 50-150 MHz in 1993 with one cycle for almost all instructions & an average CPI < 1.5.     |
| CPU Control                                   | Most microcoded using control memory (ROM), but modern CISC also uses hardwired control.             | Most hardwired without control memory.                                                    |

## I CISC Scalar Processors

- A scalar processor executes with scalar data
- The simplest scalar processor executes integer instructions using fixed-point operands.
- More capable scalar processors execute both integer & floating-point operations.
- A modern scalar processor may possess both an integer unit & a floating-point unit in the same CPU.
- Based on a complex instruction set, a CISC scalar processor can be built either with a single chip or with multiple chips mounted on a processor board
- In the ideal case, a CISC scalar processor should have a performance equal to that of the basic scalar processor

### Representative CISC Processors

| Feature                            | Intel i486                              | Motorola MC68040                                       | NS 32532                            |
|------------------------------------|-----------------------------------------|--------------------------------------------------------|-------------------------------------|
| Instruction Set Size & Word length | 157 instructions, 32 bits               | 113 instructions, 32 bits                              | 63 instructions, 32 bits.           |
| Addressing Modes                   | 12                                      | 18                                                     | 9                                   |
| Integer Unit & GPR's               | 32-bit ALU with 8 registers             | 32-bit ALU with 16 registers                           | 32-bit ALU with 8 registers         |
| On-chip Cache(s) & MMUs            | 8-KB unified cache for both code & data | 4-KB codes cache<br>4-KB data cache with separate MMUs | 512-B code cache<br>1-KB data cache |

| Feature                                              | Intel i486                                                 | Motorola MC68040                                             | NS 32532                                                   |
|------------------------------------------------------|------------------------------------------------------------|--------------------------------------------------------------|------------------------------------------------------------|
| Floating-point unit, registers, & function units     | On-chip with 8 FP registers, adder, multiplier, shifter    | On-chip with 3 pipeline stages, 8 80-bit FP registers        | off-chip FPU<br>NS 32381, or<br>WTL 3164                   |
| Pipeline Stages                                      | 5                                                          | 6                                                            | 4                                                          |
| Protection level                                     | 4                                                          | 2                                                            | 2                                                          |
| Memory organization & TLB/ATC entries                | Segmented paging with 4KB/page & 32 entries in TLB         | Paging with 4 or 8 KB/page, 64 entries in each ATC           | Paging with 4 KB/page, 64 entries                          |
| Technology, clock rate, packaging, & year introduced | CH MOS IV, 25MHz, 33MHz, 1.2M transistors, 168 pins, 1989. | 0.8-μm HCMOS, 1.2M transistors, 20MHz, 40MHz, 179 pins, 1990 | 1.25 μm CMOS<br>370K transistors, 30MHz,<br>175 pins, 1987 |
| Claimed performance                                  | 24 MIPS at 25MHz                                           | 20 MIPS at 25MHz,<br>30 MIPS at 30MHz                        | 15 MIPS<br>at 30MHz                                        |
| Successors to watch                                  | i586,<br>i686                                              | MC 68050,<br>MC 68066                                        | Unknown                                                    |

- 3 Representative CISC Scalar processors are listed above.
- The VAX8600 processor is built on a PC board.
- The i486 & M68040 are single-chip microprocessors.
- In any processor design, the designer attempts to achieve higher throughput in the processor pipelines.

## Embedded System Design

- Due to the complexity involved in a CISC scalar processor, the most difficult task for a designer is to shorten the clock cycle to match the simple operation latency.
- This problem is easier to overcome with a RISC architecture.



CPU - Central Processor Unit

TLB - Translation Lookaside Buffer

GPR - General Purpose Register

- This machine implements a typical CISC architecture with microprogrammed control.
- The instruction set contains about 300 instructions with 20 different addressing modes.

- VAX 8600 as shown in fig executes the same instruction set, runs the same VMS operating system, & interfaces with the same I/O buses as VAX 11/780.
- The CPU in the VAX 8600 consists of 2 functional units for concurrent execution of integer & floating-point instructions.
- The unified cache is used for holding both instructions & data.
- There are 16 GPR's in the instruction unit.
- Instruction pipelining has been built with six stages in VAX 8600.
- The instruction unit prefetches & decodes instructions, handles branching operations & supplies operands to the 2 functional units in a pipelined fashion.
- A translation lookaside buffer (TLB) is used in the memory control unit for fast generation of a physical address from a virtual address.
- Both integer & floating-point units are pipelined.
- The performance of the processor pipelines relies heavily on the cache hit ratio & on minimal branching damage to the pipeline flow.
- The CPI of a VAX 8600 instruction varies within a wide range from 2 cycles to as high as 20 cycles.

Example: Both multiply & divide may tie up the execution unit for a large number of cycles.

- This is caused by the use of long sequences of microinstructions to control H/w operations.
- The general philosophy of designing a CISC processor is to implement useful instructions in H/w/Firmware which may result in a shorter program length with a lower S/w overhead.
- This advantage has been obtained at the expense of a lower clock rate & a higher CPI which may not pay off at all.
- The VAX 8600 was improved from the earlier VAX/11 Series.
- The System was later further upgraded to the VAX 9000 Series offering both vector H/w & multiprocessor options.
- All the VAX Series have used a paging technique to allocate the physical memory to user programs.

### CISC Microprocessor Families

- The Intel 4004 appeared as 1<sup>st</sup> microprocessor based on a 4-bit ALU, then 8 bit 8008, 8080, & 8085, 16 bit 8086, 8088, 80186 & 80286, 32-bit 80386, 80486 & 80586 are the latest 32-bit micros in Intel 80x86 family.
- Motorola produced its first 8-bit micro, MC6800, then 16-bit 68000, 32-bit 68020, MC68030 & MC68040 are the latest Motorola MC680x0 family.

Example 2 : The Motorola MC68040 microprocessor

Architecture      Scalar CISC Family



Fig: Architecture of MC68040 processor

(high-density cores)

EA - Effective Address

FA - Final Address

- The MC68040 is a 0.8µm HCMOS microprocessor containing more than 1.2 million transistors, comparable to the i80486.
- Fig Shows the MC68040 architecture.
- The processor implements over 100 instructions using 16 GPR, a 4-K byte data cache, & a 4-K byte instruction cache, with separate memory management units (MMUs) supported by an address translation cache (ATC).

which is equivalent to the TLB used in other Systems..

- The data formats range from 8 to 80 bits, based on IEEE floating-point Standard.
- 18 Addressing modes are supported, including register direct & indirect, indexing, memory indirect, program counter indirect, absolute & immediate modes.
- The instruction set includes data movement, integer, BCD, & floating point arithmetic, logical, shifting, bit-field manipulation, cache maintenance, & multiprocessor communications, in addition to program & system control & memory management instructions.
- The integer unit is organized in a Six-Stage instruction pipeline.
- The floating point unit consists of 3 pipeline stages.
- All instructions are decoded by the integer unit.
- Floating point instructions are forwarded to the floating-point unit for execution.
- Separate instruction & data buses are used to & from the instruction & data memory units, respectively.
- Dual MMU's allow interleaved fetch of instructions & data from the main memory.
- Both the address bus & the data bus are 32 bits wide.
- 3 Simultaneous memory requests can be generated by the dual MMU's including data operand read & write & instruction pipeline refill.

- Snooping logic is built into the memory unit for monitoring bus events for cache invalidation.
- The complete memory management is provided with a virtual demand paged OS.
- Each of the 2 ATC's has 64 entries providing fast translation from virtual address to physical address.
- With the CISC complexity involved, the M68040 does not provide delayed branch H/w support, which is often found in RISC processors like Motorola's M88100 microprocessor.

## II RISC Scalar Processors

- Generic RISC processors are called scalar RISC because they are designed to issue one instruction per cycle, similar to the base scalar processor shown in ideal case of pipelining.
- In theory both RISC & CISC scalar processors should perform about the same if they run with the same clock rate & with equal program length.
- However these 2 assumptions are not always valid as the architecture affects the quality & density of code generated by compilers.
- The RISC design gains its power by pushing some of the less frequently used operations into S/w.
- The reliance on a good compiler is much more demanding in a RISC processor than in a CISC processor.

## Embedded System Design

- Instruction-level parallelism is exploited by pipelining in both processor architecture.
- Without a high clock rate, a low CPI, and good compilation support, neither CISC nor RISC can perform well as designed.
- The simplicity introduced with a RISC processor may lead to the ideal performance of the base scalar machine modeled in ideal pipelining case.

Representative RISC Scalar Processors

| Feature                                    | Sun SPARC<br>CY7601                                            | Intel<br>i860                                                         | Motorola<br>M 88100                                          | AMD<br>29000                                                                         |
|--------------------------------------------|----------------------------------------------------------------|-----------------------------------------------------------------------|--------------------------------------------------------------|--------------------------------------------------------------------------------------|
| Instruction set, formats, addressing modes | 69 Inst., 32-bit format, 7 data types, 4-stage instr. pipeline | 82 inst, 32-bit format, 4 addressing modes                            | 51 insts. & data types, 3 instr. formats, 4 addressing modes | 112 insts, 32 bit format, all registers indirect addressing                          |
| Integer unit, GPRs                         | 32-bit RISC, 36 registers divided into 8 windows               | 32-bit RISC core, 32 registers                                        | 32-bit IU with 32 GPRs & scoreboard                          | 32-bit IU with 192 registers without windows.                                        |
| Cache(s), MMU & memory organization        | Off-chip cache/ MMU on CY7C604 with 64-entry TLB.              | 4-kB code, 8 kB data, On-chip MMU, paging with 4 kB/page              | Off-chip M88200 caches/MMUs, Segmented paging, 16-kB cache.  | On-chip MMU with 32-entry TLB with 4-word prefetch buffer 512-B branch target cache. |
| FPU registers & functions                  | Off-chip FPU on CY7C602, 32 registers, 64-bit pipeline         | On-chip 64-bit FP multiplier & FP Adder with 32 FP, 3-D Graphics unit | On-chip FPU Adder                                            | Off-chip FPU on AMD2902                                                              |

| Features                                  | Sun SPARC<br>CY7C601                                                                               | Intel<br>i860                                                           | Motorola<br>M 88100                                                            | AMD<br>29000                                              |
|-------------------------------------------|----------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------|--------------------------------------------------------------------------------|-----------------------------------------------------------|
| Operation modes                           | Concurrent I/O & FPU operations                                                                    | Allow dual insts & dual FP operations                                   | Concurrent I/O, FPU & memory access with delayed branch                        | 4-stage pipeline processor                                |
| Technology, clock rate, packaging, & year | 0.84 μm CMOS IV, 33 MHz, 207 pins, 1989                                                            | 1-4 μm CHMOS IV over 1M transistors, 40 MHz, 168 pins, 1989             | 1-4 μm HCMOS, 1.2 M transistors, 20 MHz, 180 pins, 1988                        | 1.2 μm CMOS, 30 MHz, 40 MHz, 169 pins, 1988               |
| Claimed performance & Successor to watch  | 24 MIPS for 33 MHz version, 50 MIPS for 80 MHz ECL version, Up to 32 register windows can be built | 40 MIPS & 60 MFlops for i860/XP announced in 1992 with 2.5M transistors | 17 MIPS & 6 MFlops at 20 MHz up to 7 special function units can be configured. | 27 MIPS at 40 MHz, new version AMD 29050 at 55MHz in 1990 |

### Representative RISC Processors

- 4 RISC-based processors, the Sun SPARC, Intel i860, Motorola M88100, & AMD 29000, are summarized in the table above.
- All of these processors use 32-bit instr's.
- The instruction sets consist of 51 to 124 basic instr's.
- On chip floating-point units are built into the i860 & M88100, while SPARC & AMD use off-chip floating point units

- We consider these 4 processes generic scalar RISC issuing essentially only one instruction per pipeline cycle.
- We examine Sun SPARC & i860 architectures.
- SPARC stands for scalable processor architecture.
- The scalability of the SPARC architecture refers to the use of a different number of register windows in different SPARC implementations.
- This is different from the M88100, where scalability refers to the number of special function units (SFUs) implementable on different versions of M88000 processor.
- The Sun SPARC is derived from original Berkeley RISC design.

### Example 1: The Sun Microsystems SPARC architecture

- The SPARC has been implemented by a number of licensed manufacturers as given in table below.
- Different technologies & window numbers are used by different SPARC manufacturers.

| SPARC Chip      | Technology                  | Clock Rate (MHz) | Claimed VAX MIPS |
|-----------------|-----------------------------|------------------|------------------|
| Cypress CY7C601 | 0.8 μm CMOS IV,<br>207 pins | 33               | 24               |

| SPARC Chip             | Technology                | Clock Rate (MHz) | Claimed VAX MIPS |
|------------------------|---------------------------|------------------|------------------|
| Fujitsu MB<br>86901 IU | 1.2 μm<br>CMOS, 179 pins  | 25               | 15               |
| LSI Logic<br>L64811    | 1.0 μm<br>HCMOS, 179 pins | 33               | 20               |
| TS 8846                | 0.8 μm CMOS               | 33               | 24               |
| BE SU B-3100           | ECL Family                | 80               | 50               |

- All of these manufacturers implement the floating-point unit (FPU) on a separate co-processor chip.
- The SPARC processor architecture contains essentially a RISC integer unit (IU) implemented with 2 to 32 register windows.
- SPARC Family chips produced by Cypress Semiconductor. Shown in fig below is the architecture of the Cypress CY7C601 SPARC processor & of CY7C602 FPU.
- The Sun SPARC instruction set contains 69 basic instructions, a significant increase from 39 instructions in the original Berkeley RISCI instruction set.

# Embedded System Design

## SPARC architecture with processor & FPU on 2 Separate Chips



Fig ⑤ The Cypress CY7C601 SPARC processor



Fig ⑥ Cypress CY7C602 FPU

- The SPUR runs each procedure with a set of thirty two 32-bit IV registers.

- Eight of these registers are global registers shared by all procedures, & the remaining 24 are window registers associated with only each procedure.

- The concept of using overlapped register windows is the most important feature introduced by Berkeley RISC architecture.



Fig: 3 Overlapping register windows & the global registers (Window invalid mark)



Fig: 8 register windows forming a circular clock.

Embedded System Design The Concept is illustrated in fig for 8 overlapping Windows (formed with 64 local registers & overlapped registers) & eight globals with a total of 136 registers as implemented in the Cypress 601.

- Each register Window is divided into three 8-register Sections, labeled Ins, Locals, & Outs.
- The Local registers are only locally addressable by each procedure.
- The Ins & Outs are Shared among procedures.
- The calling procedure passes parameters to the called procedure via its Outs (x8 to x15) registers, which are the Ins registers of the called procedure.
- The window of the currently window pointer.
- A window invalid mask is used to indicate which window is invalid.
- The trap base-register serves as a pointer to a trap handler.
- A Special register is used to create a 64-bit product in multiple step instructions.
- Procedures can also be called without changing the window.
- The Overlapping windows can significantly save the time required for interprocedure communications, resulting in much faster context switching among co-operative procedures.
- The FPU features 32 single-precision (32-bit) or 16 double-precision (64-bit) floating point operations.

- Fourteen of the 69 SPARC instructions are for floating-point operations.
- The SPARC architecture implements 3 basic instruction formats all using a single word length of 32 bits.

### Example 2: Intel i860 processor architecture



Fig: Functional Units & data paths of the Intel i860 RISC microprocessor.

- In 1981, Intel Corporation introduced the i860 microprocessor.
- It is a 64-bit RISC processor fabricated on a single chip containing more than 1 million transistors.
- The peak performance of the i860 was designed to achieve 80 MFlops single precision or 60 MFlops double precision, or 40 MIPS in 32-bit integer operations at a 40-MHz clock rate.
- A schematic block diagram of major components in the i860 is shown in fig above.
- There are 9 functional units (shown in 9 boxes) interconnected by multiple data paths with widths ranging from 32 to 128 bits.
- All external or internal address buses are 32-bit wide, & the external data path or internal data bus is 64 bits wide.
- However the internal RISC integer ALU is only 32 bits wide.
- The instruction cache has 4 Kbytes organized as a 2 way set-associative memory with 32 bytes per cache block.
- It transfers 64 bits per clock cycle, equivalent to 320 Mbytes/s at 40 MHz.
- The data cache is a 2 way set-associative memory of 8 Kbytes.
- It transfers 128 bits per clock cycle (640 Mbytes/s) at 40 MHz.

- ~~A write-back~~ policy is used Caching can be inhibited by SW, if needed.
- The bus control unit coordinates the 64-bit data transfer bet" the chip & the outside world.
- The MMU implements protected 4K byte paged virtual memory of  $2^{32}$  bytes via of TLB.
- The paging & MMU structure of the i860 is identical to that implemented in the i486.
- An i860 & an i486 can be used jointly in a heterogeneous multiprocessor system, permitting the development of compatible OS kernels.
- The RISC integer unit<sup>(IV)</sup> execute load, store, integer, bit, & control instructions & fetches instructions for the floating-point control unit as well.
- There are 2 floating-point units, namely, the multiplier unit & the adder unit which can be used separately or simultaneously under the co-ordination of the floating point control unit.
- Special dual-operation floating-point instructions such as add-and-multiply & subtract-and-multiply use both the multiplier & adder units in parallel (fig below).
- Both the integer unit & floating-point control unit can execute concurrently.
- The i860 is also a Superscalar RISC processor capable of executing 2 instructions, one integer & one floating point at the same time.



(a) Dual-Instruction mode transitions.



(b) Dual operations in floating point units .

- The floating point unit conforms to the IEEE 754 floating-point standard, operating with single-precision (32-bit) & double-precision (64-bit) operands.
- The graphics unit executes integer operations corresponding to 8, 16, 32 bit pixel data types.
- This unit supports 3 dimensional drawing in a graphics frame buffer with color intensity, shading & hidden surface elimination.
- The merge register is used only by vector integer instructions.
- This register accumulates the results of multiple addition operations.
- The i860 executes 82 instructions including 42 RISC integer, 24 floating point 10 graphics & 6 assembler pseudo operations.
- All the instructions execute in one cycle, which equals 25 ns for a 40-MHz clock rate.
- The i860 & its successor, the i860XP are used in floating point accelerators, graphics subsystems, workstations, multiprocessors & multicomputers.

## RISC IMPACTS

- RISC will outperform CISC if the program length does not increase dramatically.
- Based on one reported experiment, converting from a CISC program to an equivalent RISC program increases the code length by only 40%.

The increase depends on program behavior, & 40% increase may not be typical of all programs.

- Nevertheless, the increase in code length is much smaller than the increase in clock rate & reduction in CPI.
- Thus the intuitive reasoning should prevail in both cases.
- A RISC processor lacks some sophisticated instructions found in CISC processors.
- The increase in RISC program length implies more instruction traffic & greater memory demand.
- Another RISC problem is caused by the use of a large register file.
- Although a larger register set can hold more intermediate results & reduce the data traffic b/w the CPU & memory, the register decoding system will be more complicated.
- Longer register access time places a greater demand on the compiler to manage the register window functions.
- Another shortcoming of RISC lies in its hardwired control, which is less flexible & more error-prone.
- RISC shortcomings are directly related to some of its claimed advantages.
- More benchmark & application experiments are needed to determine the optimal sizes of the register set, I-cache, & D-cache.

- Further processor improvements may include a 64-bit integer ALU, multiprocessor support such as Snoopy logic for cache coherence control, faster interprocessor synchronization or H/w support for message passing, & special-function units for I/O interface & graphics support.
- The boundary bet<sup>n</sup> RISC & CISC architectures has become blurred because both are now implemented with the same H/w technology.

Example: The VAX 9000, Motorola 88100, & i586 are built with mixed features taken from both the RISC & CISC camps.

- It is the applications that eventually determine the best choice of a processor architecture.

### Superscalar & Vector Processors

- A CISC or a RISC scalar processor can be improved with a Superscalar or Vector architecture.
- Scalar processors are those executing one instruction per cycle.
- Only one instruction is issued per cycle, & only one completion of instruction is expected from the pipeline per cycle.
- In a superscalar processor, multiple instruction pipelines are used.
- This implies that multiple instructions are issued per

- The floating point unit conforms to the IEEE 754 floating-point standard, operating with single-precision (32-bit) & double-precision (64-bit) operands.
- The graphics unit executes integer operations corresponding to 8, 16, 32 bit pixel data types.
- This unit supports 3 dimensional drawing in a graphics frame buffer with color intensity, shading & hidden surface elimination.
- The merge register is used only by vector integer instructions.
- This register accumulates the results of multiple addition operations.
- The i860 executes 82 instructions including 42 RISC integer, 24 floating point 10 graphics & 6 assembler pseudo operations.
- All the instructions execute in one cycle, which equals 25 ns for a 40-MHz clock rate.
- The i860 & its successor, the i860XP are used in floating point accelerators, graphics subsystems, workstations, multiprocessors & multic平ners.

## RISC IMPACTS

- RISC will outperform CISC if the program length does not increase dramatically.
- Based on one reported experiment, converting from a CISC program to an equivalent RISC program increases the code length by only 40%.

The increase depends on program behavior, & 40% increase may not be typical of all programs.

- Nevertheless, the increase in code length is much smaller than the increase in clock rate & reduction in CPI.
- Thus the intuitive reasoning should prevail in both cases.
- A RISC processor lacks some sophisticated instructions found in CISC processors.
- The increase in RISC program length implies more instruction traffic & greater memory demand.
- Another RISC problem is caused by the use of a large register file.
- Although a larger register set can hold more intermediate results & reduce the data traffic b/w the CPU & memory, the register decoding system will be more complicated.
- Longer register access time places a greater demand on the compiler to manage the register window functions.
- Another shortcoming of RISC lies in its hardwired control, which is less flexible & more error-prone.
- RISC shortcomings are directly related to some of its claimed advantages.
- More benchmark & application experiments are needed to determine the optimal sizes of the register set, I-cache, & D-cache.

- Further processor improvements may include a 64-bit integer ALU, multiprocessor support such as Snoopy logic for cache coherence control, faster interprocessor synchronization or H/w support for message passing, & special-function units for I/O interface & graphics support.
- The boundary bet<sup>n</sup> RISC & CISC architectures has become blurred because both are now implemented with the same H/w technology.

Example: The VAX 9000, Motorola 88100, & i586 are built with mixed features taken from both the RISC & CISC camps.

- It is the applications that eventually determine the best choice of a processor architecture.

### Superscalar & Vector Processors

- A CISC or a RIS scalar processor can be improved with a superscalar or vector architecture.
- Scalar processors are those executing one instruction per cycle.
- Only one instruction is issued per cycle, & only one completion of instruction is expected from the pipeline per cycle.
- In a superscalar processor, multiple instruction pipelines are used.
- This implies that multiple instructions are issued per

cycle. & multiple results are generated per cycle.

- A vector processor executes vector instructions on array of data .
- Thus each instruction involves a string of repeated operations which are ideal for pipelining with one result per cycle .

III

### Superscalar Processors

- Superscalar processors are designed to exploit more instruction-level parallelism in user programs .
- Only independent instructions can be executed in parallel without causing a wait state .
- The amount of instruction-level parallelism varies widely depending on the type of code being executed .
- It has been observed that the average value is around 2 for code without loop unrolling .
- ∵ for these codes there is not much benefit gained from building a machine that can issue more than 3 instructions per cycle .
- The instruction-issue degree in a superscalar processor has thus been limited to 2 to 5 in practice .

### Pipelining in Superscalar Processors

- The fundamental structure of a superscalar pipeline is illustrated in fig below .
- The diagram shows the use of 3 instruction pipeline in 11d for a triple issue processor .

- Superscalar processors were originally developed as an alternative to vector processors.



Fig: A Superscalar processor of degree  $m=3$

- A Superscalar processor of degree  $m$  can issue  $m$  instructions per cycle.
- In this sense, the base scalar processor, implemented either in RISC or CISC has  $m=1$ .
- In order to fully utilize a superscalar processor of degree  $m$ ,  $m$  instructions must be executable in parallel.
- This situation may not be true in all clock cycles.
- Some of the pipelines may be stalling in a wait state.
- In a Superscalar processor, the simple operation latency should require only one cycle, as in base scalar processor.
- Due to the desire for a higher degree of instruction level parallelism in programs, the superscalar processor depends more on an optimizing compiler to exploit parallelism.

- In theory, a superscalar processor can attain the same performance as a machine with vector H/W.
- A superscalar machine that can issue a fixed-point, floating-point, load, & branch all in one cycle achieves the same effective parallelism as a vector machine which executes a vector load, chained into a vector add, with one element loaded & added per cycle.
- This will become more evident?
- A typical superscalar architecture for a RISC processor is shown in fig below.



- Multiple instruction pipelines are used.
- The instruction cache supplies multiple instructions per fetch.
- However, the actual number of instructions issued to various functional units may vary in each cycle.
- The number is constrained by data dependences & resource conflicts among instructions that are simultaneously decoded.
- Multiple functional units are built into the integer unit & into the floating-point unit.
- Multiple data buses exist among the functional units.

### Representative Superscalar Processors

- A number of commercially available processors have been implemented with the Superscalar architecture.
- Notable ones include the IBM RS/6000, DEC Alpha, & intel i960CA processors as summarized in table.
- Due to reduced CPI & higher clock rates used, most Superscalar processors out-perform Scalar processors.
- The maximum number of instructions issued per cycle ranges from 2 to 5 in these 4 Superscalar processors.
- Typically the register files in IU & FPU each have 32 registers.
- Most Superscalar degree is low due to limited instruction parallelism that can be exploited in ordinary programs.

# Embedded System Design

## Representative Superscalar Processors

| Feature                                       | Intel i960CA                                                                                                           | IBM RS/6000                                                            | DEC α1064                                                                                                                      |
|-----------------------------------------------|------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------|
| Technology, clock rate, year                  | 20MHz, 1986                                                                                                            | 1-μm CMOS, 30MHz, 1990                                                 | 0.75μm CMOS, 150MHz, 431 pins, 1992.                                                                                           |
| Functional units & multiple instruction issue | Issue up to 3 instructions (register, memory & control) per cycle, Seven functional units available for concurrent use | POWER archi, issue 4 inst's (1FPU, 1FPU, & 2 ICU operations) per cycle | Alpha archi, issue 2 inst's per cycle, 64-bit 3U FPU, 128-bit data bus & 34-bit address bus implemented in initial version     |
| Registers, Cache, MMU, address Space .        | 1 KB, I-Cache, 1.5-KB RAM, 4-channel I/O with DMA, 11 bit p-decode, multiplexed registers                              | 32 32-bit GPRs, 8-KB I-Cache, 64 KB D-Cache with separate TLBs .       | 32 64 bit GPRs, 8-KB I Cache, 8 KB D Cache, 64-bit virtual space designed, 43 bit address space implemented in initial version |
| Floating point Unit & functions               | On-chip FPU, fast multimode interrupt, multitask                                                                       | On-chip FPU 64-bit multiply, add, divide, subtract, IEEE 754 standard. | On-chip FPU, 32 64-bit FP registers, 10-stage pipeline, IEEE & YAX FP standards                                                |
| Claimed performance & remarks .               | 30 VAX/MIPS peak at 25MHz, real-time embedded system Control & multiprocessor applications .                           | 34 MIPS & 11 M flops at 25MHz on POWER station 530                     | 300 MIPS peak & 150 M flops peak at 150 MHz, multiprocessor & cache coherence support                                          |

- Besides the Register files, reservation stations & reorder buffers can be used to establish instruction windows.
- The purpose is to support instruction lookahead & internal data forwarding, which are needed to schedule multiple instructions through the multiple pipelines simultaneously.

Example 1: IBM RS/6000 architecture



- IBM announced the RISC System 6000.
- It is a Superscalar processor as illustrated in fig above.
- There are 3 functional units called the branch processor, fixed point unit & floating point units, which can operate in parallel.
- The branch processor can arrange the execution of upto 5 instructions per cycle.
- These include one branch instruction in the branch processor, one fixed point instruction in the FXU, one condition register instruction in the branch processor, & one floating point multiply add instruction in the FPU, which can be counted as 2 floating point operations.
- The RS/6000 is hardwired rather than microcoded.
- The System uses a number of wide buses ranging from one word (32 bits) for the FXU to 2 words (64 bits) for the FPU, & 4 words for I-Cache & D-Cache respectively.
- These wide buses provide the high instruction & data bandwidths required for Superscalar implementation.
- The RS/6000 design is optimized to perform well in numerically intensive scientific & engineering applications as well as in multiuser commercial environments.
- A number of RS/6000 based workstations & servers are produced by IBM.

Example : The powerstation 530 has a clock rate of 25 MHz with performances benchmarks reported as 34.5 MIPS & 10.9 MFlops.

# The VLIW Architecture

- The VLIW architecture is generalized from two well-established concepts
  - Horizontal microcoding →
  - Superscalar processing
- A typical VLIW (very long Instruction Word) machine has instruction words hundreds of bits in length.
- As illustrated in fig @ below multiple functional units are used concurrently in a VLIW processor.



Fig(a). A typical VLIW processor & Instruction Format

- All functional units share the use of a common large register file.
- The operations to be simultaneously executed by the functional units are synchronized in a VLIW instruction 256 or 1024 bits per instruction word,

- Embedded System Design
- as implemented in the multilow computer models.
- The VLIW concept is borrowed from horizontal microcoding.
  - Different fields of the long instruction word carry the opcodes to be dispatched to different functional units.
  - Programs written in conventional short instruction words (say 32 bits) must be compacted together to form the VLIW instructions.

This code compaction must be done by a compiler which can predict branch outcomes using elaborate heuristics or run-time statistics.

### Pipelining in VLIW Processors



Fig: VLIW execution with degree  $m = 3$ .

- The execution of instructions by an ideal VLIW processor is shown in above fig.
- Each instruction specifies multiple operations.
- The effective CPI becomes  $0.33$  in this particular example.  
 $(\frac{1}{3})$

- VLIW machines behave much like Superscalar machines with 3 differences
- First, the decoding of VLIW instructions is easier than that of Superscalar instructions.
- Second, the code density of the Superscalar machine is better when the available instruction-level parallelism is less than that exploitable by the VLIW machine.
- This is because the fixed VLIW format includes bits for nonexecutable operations, while the Superscalar processor issues only executable instructions.
- Third, a Superscalar machine can be object-code-compatible with a large family of nonparallel machines.
- On the contrary, a VLIW machine exploiting different amounts of parallelism would require different instruction sets.
- Instruction parallelism & data movements in a VLIW architecture are completely specified at compile time.
- Run time resource Scheduling & synchronization are thus completely eliminated
- One can view a VLIW processor as an extreme of a Superscalar processor in which all independent or unrelated operations are already synchronously compacted together in advance.
- The CPI of a VLIW processor can be even lower than that of a Superscalar processor.

Example: The multiflow trace computer allows up to Seven operations to be executed concurrently with 256 bits per VLIW instruction.

### VLIW Opportunities

- In a VLIW architecture, random parallelism among scalar operations is exploited instead of regular or synchronous parallelism as in a vectorized supercomputer or in an SIMD computer.
- The success of a VLIW processor depends heavily on the efficiency in code compaction.
- The architecture is totally incompatible with that of any conventional general-purpose processor.
- The instruction parallelism embedded in the compacted code may require a different latency to be executed by different functional units even though the instructions are issued at the same time.
- ∵ different implementations of the same VLIW architecture may not be binary-compatible with each other.
- By explicitly encoding parallelism in the long instruction, a VLIW processor can eliminate the H/w or S/w needed to detect parallelism.
- The main advantages of VLIW architecture is its simplicity in H/w structure & instruction set.

- The VLIW processor can potentially perform well in Scientific applications where the program behaviour (branch predictions) is more predictable.
- In general-purpose applications, the architecture may not be able to perform well.
- Due to its lack of compatibility with conventional H/w & S/w, the VLIW architecture has not entered the mainstream of computers.

### Vector & Symbolic Processors

- A vector processor is a co-processor specially designed to perform vector computations.
- A vector instruction involves a large array of operands.
- In other words, the same operation will be performed over a string of data.
- Vector processors are often used in a multip pipelined Supercomputer.
- A vector processor can assume either a register-to-register architecture or memory-to-memory architecture.
- The former uses shorter instructions & vector register files.
- The latter uses memory-based instructions which are longer in length, including memory addresses.

Vector Instructions

- Register based vector instructions appear in most register-to-register vector processors like Cray Supercomputers.
- Denote a vector register of length  $n$  as  $V_i$ , a scalar register as  $S_i$ , & a memory array of length  $n$  as  $M(1:n)$ .
- Typical register-based vector operations are listed below, where a vector operator is denoted by a small circle "o":

$V_1 \circ V_2 \rightarrow V_3$  (binary Vector)

$S_1 \circ V_1 \rightarrow V_2$  (Scaling)

$V_1 \circ V_2 \rightarrow S_1$  (binary reduction)

$M(1:n) \rightarrow V_1$  (vector load)

$V_1 \rightarrow M(1:n)$  (vector store)

$\circ V_1 \rightarrow V_2$  (unary vector)

$\circ V_1 \rightarrow S_1$  (unary reduction)

- It should be noted that the vector length should be equal in all operands used in a vector instruction
- The reduction is an operation on one or 2 vector operands, & the result is a scalar - such as the dot product bet" 2 vectors & the maximum of all components in a vector

- Embedded System Design
- In all cases, these vector operations are performed by dedicated pipeline units including functional pipelines & memory access pipelines.
  - Long vectors exceeding the register length  $n$  must be segmented to fit the vector registers  $n$  elements at a time.
  - Memory based vector operations are found in memory vector processors such as those in Cyber 205.
  - Listed below are a few examples

$$M_1(1:n) \circ M_2(1:n) \rightarrow M(1:n)$$

$$S_1 \circ M_1(1:n) \rightarrow M_2(1:n)$$

$$\circ M_1(1:n) \rightarrow M_2(1:n)$$

$$M_1(1:n) \circ M_2(1:n) \rightarrow M(k)$$

where  $M_1(1:n)$  &  $M_2(1:n)$  are 2 vectors of length  $n$  &  $M(k)$  denotes a scalar quantity stored in memory location  $k$ .

- Note that the vector length is not restricted by register length.
- Long vectors are handled in a streaming fashion using Superwords cascaded from many short memory words.

### Vector Pipelines

- Vector processor's take advantages of unrolled-loop-level parallelism.
- The vector pipelines can be attached to any scalar processor, whether it is superscalar, superpipelined or both.
- Dedicated vector pipelines will eliminate some S/W overhead in looping control, the effectiveness of a vector processor relies on the capability of an optimizing

Embedded System Design  
Compiler that vectorizes sequential code for vector pipeline



Fig(a) Scalar pipeline execution



Fig(b) Vector pipeline execution

- The pipelined execution in a vector processor is compared with that in a scalar processor in above fig.
- Fig(a) is redrawing of in which each scalar instruction executes only one operation over one data element
- Only Serial issue & 11th execution of vector instructions are shown in fig (b).
- Each vector instruction executes a string of operations, one for each element in the vector.

## Embedded System Design

### Symbolic Processors

- Symbolic processing has been applied in many areas, including theorem proving, pattern recognition, expert systems, knowledge engineering. Test knowledge representations, primitive operations, algorithmic behavior, memory, I/O & communications & Special architectural features are different than in numerical computing.
- Symbolic processors have also been called prolog processors, Lisp processors or symbolic manipulators
- Example : A Lisp program can be viewed as a set of functions in which data are passed from function to function.
  - The concurrent execution of these functions forms the basis for parallelism.
  - The applicative & recursive nature of Lisp requires an environment that efficiently supports stack computations & function calling.

Table : Characteristics of Symbolic Processor.

| Attributes                | Characteristics                                                                                                                              |
|---------------------------|----------------------------------------------------------------------------------------------------------------------------------------------|
| Knowledge representations | Lists, relational databases, Scripts, Semantic nets, frames, blackboards, objects, production systems.                                       |
| Common Operations         | Search, Sort, pattern matching, filtering, contents, partitions, transitive closures, unification, test retrieval, set operations Reasoning. |
| Memory Requirements       | Large memory with intensive access                                                                                                           |

Characteristics

pattern. Addressing is often content-based. Locality of reference may not hold.

Properties of Algorithms

Non deterministic, possibly parallel & distributed computation. Data dependences may be global & irregular in pattern & granularity.

Communication patterns

Message traffic varies in size & destination, granularity & format of message units change with applications.

I/p/O/p requirements

User-guided programs, intelligent person-machine interfaces, I/p's can be graphical & audio as well as from keyboard, access to very large on-line databases.

Architecture Features

real update of large knowledge bases, dynamic load balancing, dynamic memory allocation, H/w supported garbage collection, Stack processor architecture, symbolic processors.

- Table Summarizes the major characteristics of symbolic processing.

- Instead of dealing with numerical data, symbolic processing deals with logic program, objects, scripts, blackboards, production system, Semantic n/w's, frames & ANN (Includes search, compare, logic inference, pattern matching, unification, filtering, context, retrieval, set operations, transitive, closure & reasoning operations).

Example The Symbolics 3600 Lisp Processor

Fig: Architecture of Symbolic 3600 Lisp processor

- The processor architecture of the Symbolics 3600 is shown in fig above.
- This is a stack-oriented machine.
- The division of the overall machine architecture into layers allows the use of a pure stack model to simplify instruction-set design, while implementation is carried out with a stack-oriented machine.
- Nevertheless most operands are fetched from the stack, so the stack buffer & scratch-pad memories are

implemented as fast caches to main memory.

- The Symbolic 3600 executes most Lisp instructions in one machine cycle.
- Integer instruction fetch operands from the Stack buffer & the duplicate top of the Stack in the Scratch-pad memory.
- Floating-point addition, garbage collection, data type checking by the tag processor, & fixed-point addition can be carried out in parallel.