

A research team is designing a vision-based drone navigation system that can make quick obstacle-avoidance decisions with minimal power consumption.

Explain how **neuromorphic computing** can be applied in this scenario and how it mimics the **efficiency of the human brain** in processing visual information.

**Answer:**

Neuromorphic computing can be effectively applied in a **vision-based drone navigation system** to enable **real-time obstacle detection and avoidance** while consuming very low power.

In this approach, the drone's onboard processor is designed using **neuromorphic chips** that mimic the **structure and function of the human brain**, particularly **neurons and synapses**.

Instead of processing data sequentially like conventional processors, neuromorphic systems use **spiking neural networks (SNNs)** where information is transmitted as electrical spikes — similar to the way biological neurons communicate.

This design allows:

- **Event-driven processing** – the system reacts only when new visual input (like an obstacle) is detected, reducing unnecessary computations.
- **Parallel information flow** – enabling faster visual pattern recognition and decision-making.
- **Low power consumption** – since neuromorphic chips operate asynchronously and process sparse data efficiently.

In the drone scenario, the onboard neuromorphic vision sensor (e.g., **Intel Loihi** or **IBM TrueNorth**) can continuously analyze visual data, detect motion or obstacles, and take immediate action (change path or stop) — just as the **human brain** quickly processes visual cues to avoid collisions.

Neuromorphic computing mimics the brain's visual processing efficiency through **spike-based parallel computation**, allowing **drones to make fast, energy-efficient obstacle avoidance decisions** in real time.

**Q1.**

A research team is designing a vision-based drone navigation system that can make quick obstacle-avoidance decisions with minimal power consumption.

Explain how **neuromorphic computing** can be applied in this scenario and how it mimics the **efficiency of the human brain** in processing visual information.

**Q2.**

Modern data centers consume massive energy for AI inference tasks. A company wants to design energy-efficient AI chips inspired by the brain's synaptic structure.

Describe how **neuromorphic architectures** (like spiking neural networks) can address this problem and improve computational efficiency compared to traditional von Neumann systems.

**Q3.**

In autonomous vehicles, rapid decision-making with low latency is critical for safety.

Discuss how neuromorphic processors can enhance **real-time decision-making** by mimicking biological neurons and synapses. Mention one **real-world neuromorphic chip** as an example.

**Q4.**

Healthcare devices like smart prosthetics and neural implants require adaptive learning with very low power.

Explain how **neuromorphic computing principles** can be used in such **bio-inspired devices** to replicate the adaptability of the human brain.

**Q5.**

You are tasked with designing a robot that can learn from its environment and respond intelligently without cloud connectivity.

Describe how **neuromorphic chips** could enable such **edge intelligence** and what biological mechanisms they attempt to emulate.

**Question:**

In an embedded system used for industrial automation, multiple sensors and actuators are connected through a **synchronous bus**. Explain how timing coordination between the master and slave devices ensures reliable data transfer, and discuss one potential drawback if clock synchronization is slightly mismatched.

**Answer:**

- In a **synchronous bus**, all devices share a **common clock signal** that coordinates when data is placed on and read from the bus.
- The **master** controls timing by generating clock pulses; all **slaves** respond at fixed clock intervals.
- This ensures **predictable and fast** communication, ideal for real-time industrial systems.
- **Drawback:** If devices have **different propagation delays** or **clock skew**, data may be read incorrectly, leading to **timing errors** or **bus contention**.

**Question:**

A company is developing a communication system where devices of different speeds (a slow temperature sensor and a fast processor) need to exchange data. Explain why an **asynchronous bus** is preferred and how handshaking enables smooth data transfer.

**Answer:**

- In an **asynchronous bus**, there is **no common clock**; instead, devices use **handshaking signals** like ACK and REQ to coordinate data transfer.
- The slow sensor signals readiness via REQ, and the processor responds with ACK once data is received.
- This allows each device to **operate at its own speed**, eliminating clock dependency.
- Ideal for **heterogeneous systems** with devices having varying access times.

**Question:**

In a manufacturing unit, a **parallel port** is used to control a robotic arm that needs multiple control signals to be sent simultaneously. Explain how data transmission through a parallel port benefits this system, and mention one limitation.

**Answer:** A **parallel port** transmits multiple bits simultaneously through separate data lines (typically 8 bits).

- This allows **fast and synchronized control** of multiple actuators in the robotic arm.
- **Benefit:** High data transfer rate and simultaneous signal delivery.
- **Limitation:** As the distance or number of connected devices increases, **signal interference and cable complexity** become major issues.

**Question:**

An autonomous weather station uses a **serial communication link (RS-232)** to send sensor readings to a remote control center. Explain how data is transmitted bit by bit using the serial port and why it is suitable for long-distance communication.

**Answer:**

- **Serial communication** sends data **one bit at a time** over a single line, synchronized by start and stop bits.
- Since only one data line is used, **signal degradation and crosstalk** are minimal over long distances.
- **RS-232** supports error checking and low-cost wiring, making it ideal for **remote weather data transmission** where speed is secondary to reliability.

**Question:**

A gaming computer uses a **PCI Express** slot to connect a high-speed graphics card. Explain how PCI architecture supports fast and direct data transfer between the graphics card and CPU.

**Answer:**

- **PCI Express (PCIe)** uses **high-speed serial links (lanes)** for point-to-point communication between CPU and peripherals.
- It allows **Direct Memory Access (DMA)**, enabling the graphics card to read/write memory **without CPU intervention**.
- This reduces latency and increases data throughput, which is critical for **real-time rendering in gaming**.
- **Scalability:** PCIe lanes can be multiplied (x1, x4, x8, x16) for higher bandwidth devices.

**Question:**

A data server uses multiple hard drives connected via a **SCSI interface** to handle simultaneous read/write requests. Explain how SCSI's ability to support multiple devices on a single bus improves performance.

**Answer:**

- **SCSI** allows multiple drives (initiators and targets) to share a **common bus** with unique IDs.
- It supports **command queuing** and **concurrent operations**, enabling one device to process data while another transfers.
- This parallelism enhances **I/O throughput**, ideal for **file servers and data centers** requiring multi-disk access.
- Supports **hot-swapping** and reliable data integrity mechanisms.

**Question:**

A smartphone connects to a computer via **USB-C** to transfer data and charge simultaneously. Explain how the USB protocol manages both power delivery and data communication efficiently over a single cable.

**Answer:**

- **USB-C** integrates **data lines and power lines** within a single reversible connector.
- It supports **Power Delivery (PD)** up to 100W and **SuperSpeed data transfer** up to 10 Gbps.
- **Protocol layers** separate data and power channels to avoid interference.
- **Plug-and-play** and **hot-swapping** improve user experience.
- Example: Smartphones can **charge and sync data** with PCs simultaneously.

**Question:**

Modern cars use **CAN bus networks** to enable communication between ECUs like engine control, airbag, and braking systems. Explain how CAN bus ensures reliable message delivery and prioritizes critical messages.

**Answer:**

- **CAN** uses **message-based communication**, not node-based addressing.
- Each message has an **identifier** indicating its priority — lower ID = higher priority.
- **Bitwise arbitration** ensures that the highest-priority message always gets bus access first.
- Built-in **error detection, CRC, and automatic retransmission** guarantee reliability.
- Example: In a car, the **brake command** has higher priority than infotainment data.

| Concept          | Real-Time Application | Key Mechanism               | Benefit                     |
|------------------|-----------------------|-----------------------------|-----------------------------|
| Synchronous Bus  | Industrial sensors    | Common clock sync           | Predictable timing          |
| Asynchronous Bus | Mixed-speed devices   | Handshaking                 | Speed flexibility           |
| Parallel Port    | Robotic control       | Multi-bit lines             | Fast simultaneous control   |
| Serial Port      | Weather station       | Bit-by-bit data             | Long-distance reliable comm |
| PCI              | Gaming PC             | DMA & lanes                 | High-speed parallelism      |
| SCSI             | Data servers          | Multi-device bus            | Concurrent I/O              |
| USB              | Smartphones           | Data + Power over same link | Convenience, versatility    |
| CAN Bus          | Automobiles           | Priority-based messaging    | Reliable & safe opera       |

### Scenario:

A processor executes instructions in five stages: Fetch, Decode, Execute, Memory, and

Write-back. Without pipelining, each instruction takes 5 clock cycles. With pipelining, a new instruction enters the pipeline every cycle.

**Question:**

If 10 instructions are executed, how many clock cycles are needed with and without pipelining? What is the speedup?

**Answer:**

- **Without pipelining:**  $10 \text{ instructions} \times 5 \text{ cycles} = 50 \text{ cycles}$
- **With pipelining:** First instruction takes 5 cycles, then 1 cycle per additional instruction →  $5 + (10 - 1) = 14 \text{ cycles}$
- **Speedup:**  $50 / 14 \approx 3.57 \times \text{faster}$

Superscalar and VLIW Architectures

**Scenario:**

A CPU can issue up to 4 instructions per cycle. In one cycle, the instruction queue contains:

1. Integer addition
2. Floating-point multiplication
3. Memory load
4. Branch instruction

**Question:**

How would a **superscalar** processor handle this versus a **VLIW** processor?

**Answer:**

- **Superscalar:** Dynamically analyzes dependencies and issues instructions to available functional units. If no conflicts, all 4 may be issued.
- **VLIW:** Relies on the compiler to schedule instructions. If the compiler packed these into one long instruction word, all 4 would execute in parallel—but only if no  hazards were predicted at compile time

SIMD Architecture

**Scenario:**

You are processing a  $1000 \times 1000$  grayscale image to increase brightness by adding 10 to each pixel value.

**Question:**

Why is SIMD ideal for this task?

**Answer:**

- **SIMD (Single Instruction, Multiple Data)** executes the same operation (add 10) on multiple data points (pixels) simultaneously.
- This task is **data-parallel** and **uniform**, making SIMD highly efficient—e.g., using vector registers to process 8 or 16 pixels at once.

MIMD Architecture

**Scenario:**

A web server handles multiple client requests. One thread processes a database query, another serves static content, and another compresses a file.

**Question:**

Why is MIMD suitable for this workload?

**Answer:**

- **MIMD (Multiple Instruction, Multiple Data)** allows each processor/thread to execute **different instructions on different data**.
- This is ideal for **task-parallel** workloads like web servers, where each task is independent and diverse.

What is Pipelining?

Pipelining is a technique used in CPUs to improve instruction throughput by overlapping the execution of multiple instructions. Think of it like an assembly line: while one instruction is being decoded, another can be fetched, and yet another can be executed.

## Key Stages

Typical stages include:

- **Fetch:** Retrieve instruction from memory
- **Decode:** Interpret the instruction
- **Execute:** Perform the operation
- **Memory:** Access memory if needed
- **Write-back:** Store the result

## Advantages

- **Increased throughput:** More instructions completed per unit time
- **Efficient resource utilization:** All parts of the CPU are active simultaneously
- **Reduced latency per instruction** (after pipeline fill)
- Without pipelining:  $10 \text{ instructions} \times 5 \text{ cycles} = 50 \text{ cycles}$
- With pipelining: First instruction takes 5 cycles, then 1 per instruction → **14 cycles**
- **Speedup**  $\approx 3.57\times$

## Superscalar vs VLIW Architectures

### Superscalar

A **superscalar** processor can issue multiple instructions per clock cycle using dynamic scheduling. It has multiple execution units and decides at runtime which instructions can be executed in parallel.

### VLIW (Very Long Instruction Word)

**VLIW** relies on the compiler to pack multiple operations into a single long instruction. The processor executes them in parallel but **does not perform dynamic scheduling**.

- Superscalar: CPU dynamically issues 4 instructions if no hazards
- VLIW: Compiler must ensure no conflicts before packing instructions

## SIMD Architecture

What is SIMD?

**Single Instruction, Multiple Data** architecture executes the same instruction on multiple data points simultaneously. It's ideal for **data-parallel** tasks like image processing, matrix operations, and scientific simulations.

Benefits

- **High performance** for uniform operations
- **Efficient use of vector units**
- **Reduced instruction overhead**
- Brightening a  $1000 \times 1000$  image by adding 10 to each pixel
- SIMD applies the addition to multiple pixels at once using vector registers

## MIMD Architecture

What is MIMD?

**Multiple Instruction, Multiple Data** architecture allows each processor to execute different instructions on different data. It's suited for **task-parallel** workloads like servers, simulations, and distributed systems.

Benefits

- **Flexibility:** Each core can run different programs
- **Scalability:** Ideal for multi-threaded and distributed computing
- **Concurrency:** Handles diverse tasks simultaneously
- A web server handles database queries, file compression, and content delivery—all as separate threads with different instructions

Would you like me to create a printable study guide or quiz based on this content? Or generate more diagrams for specific examples like SIMD in GPUs or VLIW in DSPs?

► **Increased throughput:** More instructions completed per unit time

- **Efficient resource utilization:** All parts of the CPU are active simultaneously
- **Reduced latency per instruction** (after pipeline fill)
- Without pipelining:  $10 \text{ instructions} \times 5 \text{ cycles} = 50 \text{ cycles}$

- With pipelining: First instruction takes 5 cycles, then 1 per instruction → **14 cycles**
- Speedup ≈ 3.57×**

Very Long Instruction Word (VLIW) is a type of processor architecture designed to execute multiple operations in a single instruction cycle. Unlike conventional processors that rely heavily on hardware to find parallelism, VLIW shifts this responsibility to the compiler, which packs independent instructions into one long instruction word.

- Execution units work simultaneously on different operations.
- Reduces the need for complex runtime scheduling.
- Improves instruction-level parallelism.



In other architectures, the performance of the processor is improved by using either of the following methods: pipelining (break the instruction into subparts), superscalar processor (independently execute the instructions in different parts of the processor), out-of-order-execution (execute orders differently to the program) but each of these methods add to the complexity of the hardware very much. VLIW Architecture deals with it by depending on the compiler. The programs decide the parallel flow of the instructions and to resolve conflicts. This increases compiler complexity but decreases hardware complexity by a lot.



IF: Instruction Fetch ID: Instruction Decode EX: Execute MEM: Memory WB: Write Back

## Features :

- The processors in this architecture have multiple functional units, fetch from the Instruction cache that have the Very Long Instruction Word.
- Multiple independent operations are grouped together in a single VLIW Instruction. They are initialized in the same clock cycle.
- Each operation is assigned an independent functional unit.
- All the functional units share a common register file.
- Instruction words are typically of the length 64-1024 bits depending on the number of execution unit and the code length required to control each unit.
- Instruction scheduling and parallel dispatch of the word is done statically by the compiler.
- The compiler checks for dependencies before scheduling parallel execution of the instructions.

**Table: Comparison of VLIW Architecture with Other Architectures**

| Architecture | Advantages                                                                                                                                                                                                                                                                                                     | Disadvantages                                                                                                                                                                                                                                                                                                                                                                                                       |
|--------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| VLIW         | <ul style="list-style-type: none"> <li>Reduces hardware complexity.</li> <li>Reduces power consumption.</li> <li>Simplifies decoding and instruction issues.</li> <li>Increases potential clock rate.</li> <li>Functional units are positioned corresponding to the instruction packet by compiler.</li> </ul> | <ul style="list-style-type: none"> <li>Complex compilers are required.</li> <li>Increased program code size.</li> <li>Larger memory bandwidth and register-file bandwidth.</li> <li>Unscheduled events, for example, a cache miss could lead to a stall that will stall the entire processor.</li> <li>In case of un-filled opcodes in a VLIW, there is waste of memory space and instruction bandwidth.</li> </ul> |
| Pipelining   | <ul style="list-style-type: none"> <li>Increases instruction throughput.</li> <li>Enhances performance by overlapping instruction execution.</li> <li>Reduces hardware complexity.</li> </ul>                                                                                                                  | <ul style="list-style-type: none"> <li>Dependency checking between instructions is required.</li> <li>Pipeline hazards and stalls can occur.</li> </ul>                                                                                                                                                                                                                                                             |
| Superscalar  | <ul style="list-style-type: none"> <li>Improves performance by executing multiple instructions per clock cycle.</li> </ul>                                                                                                                                                                                     | <ul style="list-style-type: none"> <li>Dependency checking between instructions is required.</li> </ul>                                                                                                                                                                                                                                                                                                             |

| Architecture           | Advantages                                                                                                                                                                                   | Disadvantages                                                                                                                                                                                                                                                                      |
|------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                        | <ul style="list-style-type: none"> <li>Reduces hardware complexity.</li> <li>Enhances instruction throughput.</li> </ul>                                                                     | <ul style="list-style-type: none"> <li>Out-of-order execution leads to more complexity.</li> </ul>                                                                                                                                                                                 |
| Out-of-order-execution | <ul style="list-style-type: none"> <li>Improves performance by overlapping instruction execution.</li> <li>Enhances instruction throughput.</li> <li>Reduces hardware complexity.</li> </ul> | <ul style="list-style-type: none"> <li>Complexity increases due to out-of-order execution.</li> <li>Dependency checking between instructions is required.</li> <li>Register renaming is required to resolve name dependencies.</li> <li>Dynamic scheduling is required.</li> </ul> |

#### Applications of VLIW Architecture Include:

- Digital signal processing (DSP):** VLIW processors are well-suited for DSP applications because of their ability to perform multiple operations in parallel. DSP applications require high computational power and often involve multiple parallel data streams, which VLIW processors can handle efficiently.
- Multimedia processing:** VLIW processors are also used for multimedia applications such as video and audio processing, where high throughput and parallelism are required.
- Scientific computing:** VLIW processors can be used for scientific computing applications, where high-performance computing is required to solve complex numerical problems.
- Embedded systems:** VLIW processors are used in many embedded systems, such as automotive control systems, medical devices, and industrial automation

equipment. These systems require high-performance processors that can execute multiple instructions in parallel while consuming minimal power.

## ) Synchronous Bus — Industrial Automation Scenario (in depth)

### How timing coordination ensures reliable transfer

- A **synchronous bus** uses a **single global clock** signal distributed to all devices on the bus. All data transfers (putting data on bus, sampling data) happen relative to clock edges (rising or falling).
- Typical transaction sequence (master-driven):
  1. Master asserts address & control lines at a specified clock cycle.
  2. After a fixed number of clock cycles (determined by bus protocol), the slave must place data on the data lines.
  3. The master samples the data on the agreed clock edge.
- Because every device uses the same discrete time reference, **setup** and **hold** times are defined and designers can ensure signals are stable when sampled.
- Example signals: CLK, ADS (address strobe), RD/WR, DATA[ ], and ACK. Each asserted/deasserted in fixed relation to CLK.

### Why it's reliable for industrial systems

- **Determinism:** Latency (number of cycles per operation) is predictable → crucial for control loops.
- **Low protocol overhead:** No extra handshake signals are required beyond the clocked control lines, so throughput is high.

### Potential drawback if clock sync is mismatched (clock skew)

- **Clock skew:** The clock arrives at different devices at slightly different times due to trace length, buffering, temperature, etc.
  - If skew > allowable margin, receiver may sample when data not yet stable → **bit errors**.
- **Consequences:** Intermittent faults, incorrect actuator commands, missed deadlines.

- **Mitigations:**
  - Use **deskewing** techniques, matched-length traces, differential clock lines, and on-board PLLs/clock trees.
  - For rugged distributed systems, prefer phase-aligned local clocks or hybrid synchronous-asynchronous domains.

### Practical considerations

- Synchronous buses work best **on-board** (short traces) and where devices meet same clock speed constraints.
- For long distances or mixed-speed devices, the cost of distributing a low-skew clock becomes prohibitive.

## 2) Asynchronous Bus — Mixed-Speed Devices & Handshaking (in depth)

### Why asynchronous is preferred

- Devices operate at different speeds; asynchronous design doesn't force all to use one clock. Each device signals readiness to send/receive, enabling robust communication between a slow sensor and a fast CPU without the faster device having to slow down to a fixed clock.

### Typical handshake protocol (two-phase example)

- **Four-wire handshake** (example): REQ, ACK, DATA, CTL
  1. **Sender** places valid data on DATA and asserts REQ.
  2. **Receiver** checks data at its own pace; when ready, asserts ACK.
  3. Sender sees ACK, deasserts REQ.
  4. Receiver sees REQ deasserted, deasserts ACK. Transaction complete.
- **Two-phase bundle** or **pulse-based** variants reduce signal transitions but maintain the same semantics.

### Benefits

- **Flexibility:** Slow peripheral can take time to prepare data without blocking the whole system clock.
- **Robustness:** Works over longer cables and with variable propagation delays.

- **Ease of integration:** Heterogeneous devices (e.g., battery-powered sensors, high-speed MCUs) can coexist.

### Drawbacks & overhead

- **Latency overhead:** Extra cycles/time for request/ack sequence.
- **Complex control:** Requires state machines for sender/receiver; more complex than a simple synchronous sample.
- **Throughput:** Lower than tightly-clocked synchronous buses for short-distance, same-speed devices.

### Design tips

- Use **pipeline buffers** on the fast side to absorb slow peripheral delays.
- Use **timeouts** and retries to handle stuck devices.
- For higher throughput, asynchronous protocols may support burst transfers (handshake to start a burst, then stream).

## 3) Parallel Port Interface — Robotic Arm Control (in depth)

### How parallel transmission benefits the system

- **Multiple control signals simultaneously:** Each control line can control a separate actuator or convey an independent bit of a multi-bit control word (e.g., step motor control inputs).
- **Deterministic simultaneous delivery:** All bits of a control word arrive at the same instant (within trace skew), enabling synchronous activation of multiple actuators—important for coordinated motion.

### Typical signals & protocol

- Data lines D0..D7 (8 bits), plus control lines such as STROBE, ACK, BUSY, AUTOFEED.
- Master places the 8-bit command on D[7:0], pulses STROBE; peripheral latches the data and pulses ACK.

### Limitation when scaling up

- **Signal integrity & skew:** As cable length or bit-width grows, differential delays cause bits to arrive at different times → timing errors.

- **Crosstalk & EMI:** Dense parallel wires are susceptible to interference.
- **Connector/cable bulk:** Practical difficulty routing dozens of parallel lines.
- **Mitigation:** Keep parallel ports short, use shielding, or migrate to serial high-speed links (e.g., LVDS or differential pairs).

### Modern usage

- Mostly legacy; replaced by serial interfaces (USB, PCIe) for long runs and high-speed devices. Still useful in tightly coupled short-range control signals.

## 4) Serial Port Interface (RS-232) — Weather Station

### How data is transmitted

- **Start bit** signals beginning, followed by N data bits (LSB first), optional parity bit, and one or more stop bits.
- Example frame (8N1): Start(0), D0..D7, Parity(None), Stop(1).
- Baud rate defines bit duration (e.g., 9600 bps → 104.17 µs per bit).

### Why suitable for long-distance

- **Single-ended vs differential:** RS-232 is single-ended and okay for modest distances (up to ~15m). RS-485 uses differential signaling for longer runs (hundreds of meters).
- **Simple wiring:** One twisted pair and ground or a single TX/RX pair simplifies cabling.
- **Tolerance to ground differences:** Differential serial (RS-485) is resistant to ground offset and noise; appropriate for remote sensors.

### Reliability mechanisms

- **Baud parity** (optional), error detection at higher layers (checksum), retransmissions on error.
- **Start/stop framing** makes receiver self-synchronizing on each byte (no common clock needed).

### Design considerations

- Use **line drivers** and **isolation** for harsh environments.

- Use buffering and retry logic on the application layer to handle intermittent errors.

## 5) PCI / PCIe — Gaming Graphics

### How PCIe supports fast direct data transfer

- **PCIe is a point-to-point, layered protocol** with lanes (each lane is a differential pair per direction).
- Transaction flow:
  - CPU or device issues **Transaction Layer Packets (TLPs)**.
  - TLPs go to the Data Link Layer (CRC, ACK) and then Physical Layer (encoding, lane mapping, SERDES).
- **PCIe lanes** provide scalable bandwidth (e.g., PCIe 3.0  $\approx$  1 GB/s per lane per direction; x16 gives  $\sim$ 16 GB/s).
- **DMA and bus mastering**:
  - GPU can perform DMA to system memory (via IOMMU if present) without CPU copying, drastically reducing CPU load and latency.

### Why this improves gaming performance

- High sustained texture/vertex data movement between GPU and system memory.
- Low latency transfers for streaming textures and framebuffers.
- Scalability: GPU with more PCIe lanes gets more bandwidth for data-heavy workloads.

### Architectural details

- **Memory mapping:** MMIO registers mapped into CPU physical address space for device control.
- **Driver interaction:** OS and drivers set up DMA regions and manage page pinning.
- **Latency considerations:** Small TLPs have overhead; large payloads improve efficiency (MTU tuning).

### Design trade-offs

- Power/perf vs slot width (x8 vs x16).

- CPU/GPU NUMA effects: On multi-socket systems, GPU affinity matters.

## 6) SCSI — Multi-Drive Server

### How SCSI supports multiple devices

- SCSI uses a **command protocol** where the host (initiator) sends Command Descriptor Block (CDB) to a target (drive).
- **Bus architecture:** Parallel SCSI (legacy) had shared bus with device IDs; Serial Attached SCSI (SAS) uses point-to-point serial links with expanders.
- **Tagged command queuing (TCQ)** allows multiple outstanding commands per device, enabling the controller to rearrange commands for optimal throughput.

### Performance & reliability benefits

- **Concurrent I/O:** Several devices can operate independently; a RAID controller can issue commands to multiple drives in parallel.
- **SCSI command set** supports robust features (smart commands, SENSE for detailed errors).
- **Enterprise features:** Hot-swap, enterprise-level error recovery, persistent reservations for clustering.

### Design considerations

- Use HBA (Host Bus Adapter) to offload protocol handling and provide DMA.
- SCSI devices in server environments often use higher-end controllers, queuing, and caching strategies.

### Modern replacements

- SAS and NVMe over PCIe are common; NVMe takes the concurrency idea further with multiple queues per core.

## 7) USB-C — Data + Power over One Cable

### How USB manages both power and data

- **Physical wiring:** USB-C connector includes power rails (VBUS), ground, and high-speed differential pairs (SuperSpeed lanes) plus configuration channel (CC) pins for negotiation.
- **Power Delivery (PD) protocol:**
  - Negotiation over CC pin to set voltage/current (e.g., 5V/9V/12V/20V up to 5A).
  - Dynamic switching of roles (host/device, dual-role power).
- **Data channels:**
  - USB2.0 (D+/D-) and USB3.x SuperSpeed lanes for high bandwidth.
  - Alternate modes (e.g., DisplayPort) allow reassigning lanes for video.

### **Protocol separation**

- Power negotiation and delivery are handled at the PD and electrical layer; data flows via USB protocol stack (enumeration, endpoints, transfers).
- This separation avoids interference and allows safe simultaneous use.

### **Plug-and-play & hot-swap**

- On connect, host polls for device descriptors, assigns addresses, and configures endpoints. USB class drivers simplify device support.

### **Design & safety aspects**

- PD enforces negotiation to prevent overvoltage.
- USB-C implements CC to detect cable orientation and capabilities.

### **Example use case**

- A smartphone charges and communicates with PC: PD negotiates 9V@2A while USB handles file transfers; OS enumerates and mounts storage or uses MTP.

## **8) CAN Bus — Automotive ECUs**

### **How CAN ensures reliable delivery in noisy environments**

- **Differential signaling (CAN\_H/CAN\_L)** reduces susceptibility to common-mode noise.

- **Bitwise arbitration:**
  - Every node starts transmitting when it has data. During arbitration, nodes transmit their message ID MSB first. If a node transmits recessive (1) but reads dominant (0), it loses arbitration and stops transmitting — no collisions or corrupted frames.
  - Lower ID value ⇒ higher priority.
- **Error detection & recovery:**
  - Frame CRC, ACK slot, bit stuffing, and frame checks.
  - Error counters per node; nodes transmit error frames on detecting faults.
  - Faulty nodes can go **bus-off** after repeated errors to protect bus health.
- **Determinism:**
  - Priority-based arbitration ensures high-priority safety messages (e.g., brake) get through with bounded latency.

### Frame structure (classic CAN)

- Start bit, Arbitration ID (11-bit or 29-bit for extended), Control, Data (0–8 bytes for CAN 2.0; larger for CAN FD), CRC, ACK, EOF.

### Why suitable for automotive

- Robustness to EMI, ability to prioritize safety-critical messages, and low wiring complexity — ECUs can be added in distributed topology with minimal wires.

### Practical considerations

- Proper termination ( $120\ \Omega$ ) is mandatory to prevent reflections.
- Use twisted-pair for CAN\_H/CAN\_L, route away from high-power wires.
- For high-data needs, CAN-FD increases payload size and bit rates.

### Quick comparative summary (practical takeaways)

- **Synchronous:** High throughput, low latency, best for short, uniform-speed systems. Critical: clock distribution.
- **Asynchronous:** Flexible, for heterogeneous or long-distance devices. Handshaking increases latency.

- **Parallel port:** Fast for short, simultaneous signals but scales poorly; legacy use.
- **Serial (RS-232/RS-485):** Simple, robust for long-distance telemetry; slower but reliable.
- **PCIe:** High-bandwidth, scalable, point-to-point links; enables DMA and low-latency device comms.
- **SCSI/SAS/NVMe:** Enterprise storage protocols designed for concurrency and reliability.
- **USB-C:** Converged data+power with negotiation; ubiquitous for consumer devices.
- **CAN:** Real-time, priority-based, robust bus for vehicle networks and industrial control.

SIMD in GPUs

#### **Diagram Highlights:**

- Shows **multiple threads** executing the **same instruction** (e.g., add, multiply) on **different data blocks** like pixels or matrix elements.
- Includes **vector units**, **shared memory**, and **global memory** to illustrate how GPUs manage parallelism.
- Demonstrates how **warp-level execution** works in NVIDIA-style architectures.

#### **Use Case:**

Perfect for visualizing tasks like image filtering, deep learning tensor operations, or physics simulations in games.

VLIW in DSPs

#### **Diagram Highlights:**

- Displays a **VLIW instruction** composed of multiple operations: ADD, MULT, LOAD.
- Each operation is routed to a **dedicated functional unit**: ALU, FPU, MEMORY UNIT.
- Emphasizes **compiler-driven scheduling**—no dynamic reordering at runtime.

### **Use Case:**

Ideal for understanding how DSPs handle audio processing, signal filtering, or real-time control systems with predictable instruction flows.

## **Pipelining — Real-Time Scenario: Processor in an Autonomous Vehicle**

### **Scenario:**

An autonomous car's control processor continuously reads sensor data (LiDAR, camera, radar), performs calculations, and issues commands to actuators. Each instruction—fetch, decode, execute, memory access, write-back—must be executed fast enough to keep up with the sensor feed (e.g., 1000 frames/sec).

### **Question:**

How does *instruction pipelining* help meet real-time performance needs, and what problems could arise if a pipeline hazard occurs?

### **Answer:**

#### **Mechanism:**

Pipelining divides instruction execution into stages:

1. **IF:** Instruction Fetch
2. **ID:** Decode
3. **EX:** Execute
4. **MEM:** Memory Access
5. **WB:** Write Back

Each stage works concurrently on different instructions — just like an assembly line.

So while one instruction is in *MEM*, another can be in *EX*, and a third is being *decoded*.

- ◆ **Impact in real-time system:**
  - **Latency per instruction** ≈ unchanged (still 5 stages).
  - **Throughput** increases drastically — ideally, one instruction completes every clock cycle after pipeline fill.

- The car's CPU can now process multiple sensor readings in parallel stages, ensuring timely control actions (steering, braking).
- ◆ **Real-world hazard issues:**
  - **Data Hazard:** Example — if Instruction 2 uses the result of Instruction 1 before it's written back.
    - **Fix:** Forwarding paths or pipeline stalls.
  - **Control Hazard:** From branch predictions in obstacle detection (if-else conditions).
    - **Fix:** Branch prediction & speculative execution.
  - **Structural Hazard:** Two instructions need same hardware resource (e.g., ALU).
    - **Fix:** Duplicate units or stall logic.
- ◆ **Advantages:**
  - Increased instruction throughput
  - Efficient CPU utilization
  - Scalable for predictable workloads (ideal in embedded systems)
- ◆ **Drawbacks:**
  - Pipeline stalls increase worst-case latency.
  - Harder to guarantee deterministic timing — must be analyzed for **real-time deadlines**.

## ② Superscalar Architecture — Real-Time Scenario: Gaming Console CPU

### Scenario:

A PlayStation-like console must render 3D graphics at 60 FPS. Each frame requires thousands of arithmetic and logic operations. A superscalar processor (e.g., Intel Core or AMD Ryzen) issues multiple instructions per clock.

### Question:

How does a superscalar processor improve gaming performance, and what scheduling challenges does it face?

### Answer:

- ◆ **Mechanism:**

- A **superscalar CPU** has multiple **execution units** (e.g., ALU1, ALU2, FPU, load/store unit).
- The **instruction issue logic** analyzes dependencies and issues 2–4 instructions *in parallel* each cycle.

Example:

| Clock Cycle | ALU 1 | ALU 2 | FPU |
|-------------|-------|-------|-----|
| 1           | ADD   | SUB   | MUL |
| 2           | AND   | OR    | DIV |

- ◆ **Real-world benefit in gaming:**

- Physics engine, texture computations, and AI logic all contain independent instructions that can execute simultaneously.
- The CPU completes more work per clock, giving **higher frame rates** and smoother gameplay.

- ◆ **Scheduling challenge:**

- Superscalar relies on **dynamic hardware scheduling** to detect independent instructions.
- **Dependencies** or **branch mispredictions** can cause stalls or under-utilized units.
- For instance, if most instructions depend on a previous one's result, only one pipeline can work while others idle.

- ◆ **Advantages:**

High Instruction-Level Parallelism (ILP)

Compatible with standard compilers (hardware decides scheduling)

Improved performance for general-purpose workloads

- ◆ **Challenges:**

Complex control logic (hazard detection, register renaming)

Power consumption and die area increase

Diminishing returns if code has few independent instructions

### ③ VLIW (Very Long Instruction Word) — Real-Time Scenario: DSP in a Medical Ultrasound Machine

#### Scenario:

A Digital Signal Processor (DSP) in a medical ultrasound machine performs real-time echo analysis — thousands of multiply-accumulate (MAC) operations per second to form images.

#### Question:

How does a VLIW processor accelerate this real-time signal processing, and what trade-offs exist compared to superscalar CPUs?

#### Answer:

##### ♦ Mechanism:

- In **VLIW**, the *compiler* (not hardware) identifies independent operations and packs them into a **single long instruction word**.
- Each VLIW word controls multiple functional units (e.g., ALU, FPU, memory access) **explicitly** in one clock.

#### Example:

##### VLIW Word:

[ADD R1,R2,R3 | MUL R4,R5,R6 | LOAD R7, [R8] | NOP]

All execute simultaneously in one cycle.

##### ♦ Real-time performance in ultrasound:

- The DSP repeatedly performs predictable math operations on independent data samples.
- Since dependencies are known at compile-time, the compiler efficiently schedules operations → **zero runtime scheduling overhead**.
- Deterministic execution timing → critical for medical devices where delays = diagnostic errors.

##### ♦ Advantages:

**Compiler-driven parallelism** — predictable timing

**Simple hardware** — no dynamic scheduling

Ideal for repetitive, well-analyzed workloads

◆ **Drawbacks:**

Code size increases (many NOPs for unused units)

Requires highly optimizing compiler

Poor performance on irregular or unpredictable code

◆ **Comparison to Superscalar:**

| Feature               | Superscalar             | VLIW               |
|-----------------------|-------------------------|--------------------|
| Parallelism detection | Hardware                | Compiler           |
| Hardware complexity   | High                    | Low                |
| Predictability        | Low (dynamic)           | High (static)      |
| Ideal for             | General-purpose<br>CPUs | Embedded /<br>DSPs |

## ④ SIMD (Single Instruction, Multiple Data) — Real-Time Scenario: Video Processing in Smartphone

**Scenario:**

A smartphone applies a real-time video filter (e.g., brightness correction or edge detection) on 4K video frames.

**Question:**

How does SIMD architecture accelerate video filtering, and why is it so power-efficient?

**Answer:**

◆ **Mechanism:**

- SIMD executes **one instruction on multiple data elements simultaneously**.
- Example: A single ADD instruction can add 4 or 8 pixel values at once (vectorized operation).

- Implemented via **vector registers** (e.g., 128-bit SSE, 256-bit AVX, 512-bit NEON).
- ◆ **Real-time impact:**
  - Instead of processing each pixel sequentially, the SIMD engine handles multiple pixels per cycle.
  - This reduces execution time by  $\sim 4\times\text{--}8\times$ , enabling real-time filters at 60 FPS with less power.
- ◆ **Example:**

ADDPS xmm0, xmm1

→ Adds four 32-bit floats (pixels) in parallel.

- ◆ **Advantages:**

High data-level parallelism

Energy efficient — fewer instructions for same work

Perfect for multimedia, neural network layers, and sensor data streams

- ◆ **Challenges:**

Requires **data to be vectorized and aligned**

Not suitable for irregular data or control flow

Compiler must generate SIMD-friendly code (loop unrolling, alignment)

- ◆ **Real-world example:**

ARM NEON or Intel AVX in mobile CPUs — accelerate face recognition, camera pipelines, and AR processing.

## ⑤ MIMD (Multiple Instruction, Multiple Data) — Real-Time Scenario: Cloud-Based AI Inference Engine

### Scenario:

A cloud AI inference server runs multiple neural networks simultaneously — one detects objects, another transcribes speech, another translates text.

### Question:

How does an MIMD architecture handle these workloads efficiently compared to SIMD, and what synchronization issues arise?

**Answer:**

◆ **Mechanism:**

- MIMD systems consist of **multiple independent processors**, each executing different instructions on different data sets.
- Each core runs its own thread/process, possibly on distinct memory segments.

◆ **Real-world behavior:**

- One core runs CNN inference (object detection), another runs RNN for speech, another handles text translation.
- Tasks run concurrently → maximizing utilization and throughput.

◆ **Synchronization & communication:**

- Shared memory or message passing is used to coordinate between processors.

◆ **Challenges:**

- Race conditions if multiple processors access shared variables.
- Cache coherence overhead in multi-core systems.
- Load balancing — one heavy model might stall overall performance.

◆ **Advantages:**

Task-level parallelism

Suitable for heterogeneous workloads (not just vectorizable data)

Scalable to large clusters (supercomputers, cloud AI servers)

◆ **Drawbacks:**

Programming complexity — synchronization, data sharing

Overhead from communication between processors

Power and thermal management challenges in dense multi-core chips

## 6 Comparative Summary — Real-Time Suitability

| Architecture | Parallelism Type            | Real-Time Use Case     | Key Benefit                       | Challenge                   |
|--------------|-----------------------------|------------------------|-----------------------------------|-----------------------------|
| Pipelining   | Instruction Overlap         | Vehicle control CPU    | High throughput                   | Hazard management           |
| Superscalar  | Instruction-level (Dynamic) | Gaming CPU             | Multi-instruction per cycle       | Complex hardware            |
| VLIW         | Instruction-level (Static)  | Medical DSP            | Deterministic, compiler-optimized | Large code size             |
| SIMD         | Data-level                  | Image/Video processing | Massive vector parallelism        | Works only for uniform data |
| MIMD         | Task-level                  | Cloud AI, HPC          | Independent task concurrency      | Synchronization, coherence  |

### Case Study 1 — Pipelining

**Application:** Real-time sensor fusion and control for an autonomous quadcopter.

#### Problem statement

A flight-control microcontroller must process IMU (gyro/accel) + barometer + optical-flow camera inputs and run a control law (sensor fusion + PID) at **1 kHz** with worst-case latency < 1 ms.

#### Design objectives

- Deterministic instruction throughput to meet 1 kHz loop.
- Minimize worst-case latency and ensure predictable deadlines.
- Low power and small footprint (embedded board).

#### Constraints

- MCU clock  $\leq$  200 MHz (budget/thermal).
- Single core, on-chip memory limited (e.g., 256 KB).
- Must handle interrupts (telemetry) without missing control deadlines.

#### Why pipelining

Pipelining increases instruction throughput while keeping per-stage timing predictable. For real-time control, deterministic stage timing enables easier worst-case execution time (WCET) analysis.

## Hardware design

- Use a RISC core with a 5-stage pipeline (IF, ID, EX, MEM, WB).
- Add **forwarding/short-circuit paths** to resolve data hazards for arithmetic ops used in control laws.
- Include **interrupt preemption** only at stage boundaries and an optimized interrupt latency path.
- Duplicate critical resources (ALU for control math and separate MAC for sensor filtering) to avoid structural hazards.

## Software design

- Use fixed-point math or a small FPU (if available) to reduce latency.
- Compiler flags: enable loop unrolling and keep hot loops inlined.
- Arrange critical control loop to minimize data dependencies (helps pipeline utilization).
- Insert explicit NOP or use compiler barriers only where needed (avoid unnecessary stalls).

## Performance targets / metrics

- Throughput: one control iteration per 2000 cycles (at 200 MHz → 1 kHz).
- Worst-case execution time (WCET) margin:  $\leq 800 \mu\text{s}$  including interrupts.
- Pipeline stall rate  $< 5\%$  in steady state.

## Pitfalls & mitigations

- **Branch-heavy code** (conditionals in fusion) causes control hazards — use small, predictable branches or branch prediction hints.
- **Cache misses** can blow WCET — lock critical code/data in TCM (tightly-coupled memory).
- **Interrupt jitter** — prioritize and defer non-critical interrupts.

## Test & validation

- Synthetic worst-case tests (all sensors active + telemetry bursts).
- Use cycle-accurate simulator and measure WCET.
- Hardware-in-loop (HIL) with real sensor latency injection.

## **Case Study 2 — Superscalar**

**Application:** Real-time game physics engine for a console CPU.

### **Problem statement**

Design a CPU core to run a 3D game physics sub-system (rigid-body dynamics, collision detection) with a target of **1200 physics updates/sec** across many independent objects while keeping power reasonable.

### **Design objectives**

- High average IPC (instructions per cycle) for mixed workloads.
- Handle dynamic parallelism in general-purpose code (branching, loads with unpredictable latencies).
- Balance performance vs power.

### **Constraints**

- Single-chip consumer console: die-area and thermal limits.
- Must support legacy instruction set and compiler toolchain.

### **Why superscalar**

Superscalar dynamically extracts ILP at run-time, enabling performance on irregular code paths common in physics (conditional collision tests, diverse object shapes). It relieves compiler burden and adapts to runtime behavior.

### **Hardware design**

- Wide fetch & decode (4-issue), out-of-order issue, large ROB, register renaming.
- Multiple functional units: 2 integer ALUs, 1 vector FPU, 1 branch unit, 2 load/store units.
- Aggressive branch predictor (tournament/TAGE) and small L1 caches tuned for physics workload.

- Hardware prefetcher and L2 for streaming object data.

## Software design

- Optimize hot loops to increase independent instructions (loop splitting, software pipelining).
- Use data-oriented layout (AoS → SoA) to reduce memory stalls and enable vectorization where possible.
- Profile-guided optimization (PGO) to improve branch prediction behavior.

## Performance targets / metrics

- Average IPC  $\geq 2.5$  for physics workloads.
- Branch mispredict rate  $< 2\%$ .
- Pipeline flush penalty amortized to  $< 5\%$  overhead.

## Pitfalls & mitigations

- **Memory stalls:** physics accesses large arrays → use huge pages or scatter-gather buffers, cache-aware allocators.
- **Dependency chains:** long chains reduce ILP → code refactor to break dependencies, use speculation.
- **Power:** dynamic power gating of idle units, clock gating.

## Test & validation

- Run game scenarios with stress physics (large rigid-body counts).
- Use performance counters: issue width utilization, ROB occupancy, load-store queues.
- Thermal tests under long runs to check throttling.

## Case Study 3 — VLIW

**Application:** Real-time image compression pipeline for ultrasound imaging.

### Problem statement

A portable ultrasound system must compress incoming frames (e.g., DCT + quantization + entropy coding) in real-time for streaming over limited bandwidth while minimizing power.

## Design objectives

- Deterministic latency per frame for a guaranteed frame rate (e.g., 60 fps).
- Efficient parallel arithmetic for dense, repetitive operations.
- Low hardware complexity and power.

## Constraints

- Fixed-function operations (DCT, quantize) dominate compute.
- Target is battery-powered portable device.

## Why VLIW

The operations are regular and highly parallelizable — the compiler can statically schedule operations into long instruction words, enabling simple hardware and deterministic timing (suitable for real-time imaging).

## Hardware design

- VLIW core with multiple slots: 2 ALUs, 2 MACs, 1 load/store per cycle.
- Very long instruction word (e.g., 128–256 bits) containing multiple operations.
- Large register file and local scratchpads for data reuse.
- Minimal dynamic scheduling: simpler issue/commit logic.

## Software / Compiler design

- Heavy compiler responsibility: use an optimizing compiler supporting software pipelining, register allocation, and loop unrolling.
- Profile-guided scheduling for hot paths.
- Use predication / trace scheduling to reduce branch stalls.
- Provide intrinsics for developers to express parallel ops.

## Performance targets / metrics

- Deterministic cycles per frame (e.g.,  $\leq X$  cycles) guaranteed by compile-time scheduling.

- Utilization: fraction of VLIW slots filled  $\geq 85\%$  for hot loops.

### Pitfalls & mitigations

- **Code bloat** from NOP-padding: mitigate by software pipelining and register spilling minimization.
- **Cache pressure**: larger binaries  $\rightarrow$  manage with instruction cache partitioning.
- **Runtime variability** (e.g., variable input patterns): use small hardware feedback (profile counters) to guide recompilation or dual-mode execution.

### Test & validation

- End-to-end latency measurement with real ultrasound frames.
- Verify deterministic bounds across varying input scenes.
- Compiler regression tests ensuring scheduling correctness for all inputs.

## Case Study 4 — SIMD

**Application:** Real-time augmented-reality (AR) video filter on mobile SoC.

### Problem statement

Apply computationally heavy filters (convolution, color correction) to live video frames (30–60 FPS) with minimal battery drain on a smartphone.

### Design objectives

- Maximize per-cycle pixel throughput using SIMD vector units.
- Keep energy per frame low.
- Minimize latency for interactive AR overlays.

### Constraints

- Mobile thermal/power limits.
- Memory bandwidth limited; caches small.

### Why SIMD

Pixel and audio processing are data-parallel (same op applied across many data elements). SIMD executes many pixels per instruction, giving large speedups and energy efficiency.

## Hardware design

- Vector unit (e.g., NEON — 128-bit) with FP and integer lanes.
- Support for load/store pair instructions with streaming prefetch.
- Efficient alignment and scatter/gather support for common image memory layouts.

## Software design

- Use intrinsics or auto-vectorizing compiler to express operations in SIMD form.
- Organize data as **SoA** (separate arrays for R, G, B) or interleaved but aligned to vector width.
- Use tiling and blocking to improve cache reuse.
- Fallback scalar path for edge cases.

Example vector instruction sequence:

LOAD V0, [src] ; 4 pixels

LOAD V1, [kernel] ; kernel coefficients

FMAC V2, V0, V1 ; multiply-accumulate on vectors

STORE [dst], V2

## Performance targets / metrics

- Throughput: pixels processed per cycle  $\geq$  vector\_width.
- Power: energy/frame under budget.
- Latency: end-to-end filter  $< 30$  ms for perceived real-time.

## Pitfalls & mitigations

- **Alignment issues** cause penalties — align buffers to vector boundary.
- **Non-uniform data paths** (e.g., irregular kernels) hinder vectorization — use software transpose or reformatting.
- **Memory bandwidth**: use tiling to reduce cache misses and minimize loads.

## Test & validation

- Use representative video workloads, measure FPS and energy (power profiler).

- Verify perceptual latency with human-subject tests (user feels real-time).
- Test corner cases on different SIMD widths.

## **Case Study 5 — MIMD**

**Application:** Real-time multi-sensor fusion and distributed planning on an autonomous ground vehicle (AGV) cluster.

### **Problem statement**

A coordinated fleet of AGVs must share environment maps, plan routes, and execute control in real-time. Each AGV has a multicore system; the fleet uses V2V messages for coordination.

### **Design objectives**

- Run perception, localization, planning concurrently on different cores.
- Scale across multiple vehicles (distributed MIMD).
- Low-latency inter-core and inter-vehicle communication.

### **Constraints**

- Network bandwidth and latency limits (wireless).
- Real-time deadlines for collision avoidance.
- Heterogeneous cores (big.LITTLE), some cores are energy efficient.

### **Why MIMD**

Different tasks (SLAM, path planning, control) are independent and have different computational characteristics. MIMD lets each core/process run different code on different data concurrently, enabling real-time multi-tasking.

### **Hardware design**

- Multi-core SoC with hardware support for low-latency message passing (e.g., shared memory + DMA), NUMA-aware memory.
- Hardware accelerators for vision tasks (NPU) as co-processors.
- Onboard fast interconnect (AXI) and dedicated Ethernet/Wi-Fi for V2V.

### **Software design**

- Partition tasks: core0 runs perception (CNN inference on NPU), core1 runs localization, core2 runs planner, core3 runs safety monitor.
- Use lightweight real-time OS or RT patches (PREEMPT\_RT) and priority-based scheduling.
- Use lock-free queues and real-time communication middleware (DDS/RTPS) for inter-AGV messages.
- Provide redundancy: failover threads for safety-critical tasks.

### Performance targets / metrics

- End-to-end perception-to-actuation latency < 100 ms.
- Inter-AGV update latency < 50 ms for cooperative maneuvers.
- CPU utilization balanced (no core overloaded).

### Pitfalls & mitigations

- **Synchronization overhead:** Use lock-free structures and bounded queues to meet deadlines.
- **Non-deterministic network:** design conservative safety buffers and predictively reserve paths.
- **Load imbalance:** dynamic task migration and load shedding for non-critical tasks.

### Test & validation

- Simulate city scenarios with multiple AGVs; measure latencies end-to-end.
- Fault injection (network loss, core failure) to ensure safety fallback behavior.
- Use real-world trials in controlled environments.

### Deliverables you can extract from these case studies

- **One-page design spec** for each case (useful for lab projects).
- **PowerPoint slides:** convert each case to 3–5 slides (problem, architecture, data flow, metrics).
- **Lab assignment:** implement simplified software (e.g., SIMD filter using intrinsics; pipeline timing on dev board).

- **Evaluation metrics** and test plans for validation.