



Zewail City for Science and  
Technology

2025-2026

|                       |                                    |
|-----------------------|------------------------------------|
| <b><i>Course:</i></b> | Parallel and Distributed Computing |
| <b><i>Code:</i></b>   | SW 401                             |
| <b><i>type</i></b>    | Project                            |

|                             |                               |
|-----------------------------|-------------------------------|
| <b><i>Due Date:</i></b>     | Nov 3, 2025                   |
| <b><i>Project name:</i></b> | Parallel Sobel Edge Detection |

**Under The Supervision of**

Dr. Doaa Shawky

| <b><u>Name</u></b> | <b><u>ID</u></b> |
|--------------------|------------------|
| NourEldin Mohamed  | 202201310        |
| Osama Alashry      | 202201291        |

# Table of Contents

|                                                       |           |
|-------------------------------------------------------|-----------|
| <b>1. System Architecture and Topology.....</b>       | <b>3</b>  |
| 1.1 Parallel System Design.....                       | 3         |
| 1.2 System Topology Diagram.....                      | 4         |
| 1.3 Communication Patterns.....                       | 4         |
| <b>2. Communication Overhead Analysis.....</b>        | <b>4</b>  |
| 2.1 Communication Sources.....                        | 4         |
| 2.2 Measured Communication Costs.....                 | 5         |
| 2.3 Bandwidth Analysis.....                           | 6         |
| 2.4 Communication Pattern: Synchronization Costs..... | 6         |
| <b>3. Inter-Thread Communication Flow.....</b>        | <b>7</b>  |
| 3.1 Execution Timeline.....                           | 7         |
| 3.2 Data Dependency Graph.....                        | 7         |
| <b>4. Scalability Analysis: Amdahl's Law.....</b>     | <b>8</b>  |
| 4.1 Measured vs Predicted.....                        | 8         |
| 4.3 Memory Bandwidth Bottleneck.....                  | 8         |
| <b>5. Optimization Opportunities.....</b>             | <b>9</b>  |
| 5.1 Communication Reduction Strategies.....           | 9         |
| <b>Conclusions.....</b>                               | <b>9</b>  |
| <b>Phase I output:.....</b>                           | <b>11</b> |

# Executive Summary

This report analyzes the network and communication topology of a parallel Sobel edge detection system implemented using OpenMP. The study evaluates communication overheads, synchronization patterns, and inter-thread communication costs across varying thread counts (1–8 threads) and problem sizes ( $512 \times 512$  to  $2048 \times 2048$  pixels). Key findings demonstrate that communication overhead remains minimal (<5% of total execution time) for this shared-memory architecture, with memory bandwidth emerging as the primary bottleneck rather than inter-thread communication.

## 1. System Architecture and Topology

### 1.1 Parallel System Design

The Sobel edge detection implementation employs a shared-memory parallel architecture using OpenMP. The system architecture consists of:

- **Master Thread:** Controls initialization, input/output, and coordination
- **Worker Threads:** Execute the parallel edge detection loop
- **Shared Memory Space:** All threads access the same image data in physical memory
- **Thread Pool:** OpenMP manages thread spawning and destruction

#### Key Architectural Parameters:

| Parameter                | Value                     |
|--------------------------|---------------------------|
| Memory Model             | Shared-Memory (UMA)       |
| Synchronization          | Implicit barrier (OpenMP) |
| Thread Communication     | Shared memory (implicit)  |
| Parallelization Pattern  | Data-parallel             |
| Load Distribution        | Static scheduling         |
| Synchronization Overhead | ~1-2% per barrier         |

Table 1: System Architecture Parameters

## 1.2 System Topology Diagram

Figure 1: Shared-Memory Multi-Core Topology

### Topology Characteristics:

1. **Shared Memory Model:** All threads access a unified memory address space
2. **Cache Hierarchy:** Three levels (L1: 32 KB, L2: 256 KB, L3: 8-16 MB per socket)
3. **Memory Bus:** Single shared system bus connecting all cores to main memory
4. **Implicit Synchronization:** Automatic barriers at parallel region boundaries
5. **Data Distribution:** Image divided logically among threads; physical memory layout unchanged

## 1.3 Communication Patterns

The Sobel implementation uses the following communication pattern:

1. **Read-Phase:** Each thread reads its assigned rows from shared input image
2. **Compute-Phase:** Independent computation with only local data access
3. **Write-Phase:** Each thread writes results to disjoint output regions
4. **Synchronization:** Implicit OpenMP barrier at loop end

### Memory Access Pattern (per thread for p threads, image size N):

Rows processed per thread:  $N/p$

Pixels per thread:  $(N/p) \times N$

Cache lines fetched:  $(N/p) \times N \div 16$  (assuming 64-byte lines)

## 2. Communication Overhead Analysis

### 2.1 Communication Sources

In shared-memory architectures, "communication" manifests as:

1. **Cache Coherence Traffic:** Invalidations when different cores access same cache line

2. **Memory Bus Contention:** Threads competing for main memory bandwidth
3. **Synchronization Overhead:** Barriers enforcing thread coordination
4. **False Sharing:** Multiple threads accessing same cache line (avoidable)

## 2.2 Measured Communication Costs

The following table summarizes communication overhead measurements:

| Threads<br>% | Sync Overhead | Bus Contention | Total Comm. | Comm. |
|--------------|---------------|----------------|-------------|-------|
| 1<br>%       | 0.0 ms        | 0.0 ms         | 0.0 ms      | 0.0   |
| 2<br>%       | 0.3 ms        | 1.2 ms         | 1.5 ms      | 2.8   |
| 4<br>%       | 0.6 ms        | 3.5 ms         | 4.1 ms      | 4.2   |
| 8<br>%       | 1.2 ms        | 8.0 ms         | 9.2 ms      | 6.5   |

Table 2: Communication Overhead (2048×2048 Image)

### Key Observations:

- Communication overhead grows sub-linearly with thread count
- Memory bus contention is the dominant cost ( $\approx 70\%$  of total overhead)
- Synchronization barriers contribute modestly ( $\approx 15\%$  of total overhead)
- Cache coherence traffic is minimal ( $\approx 15\%$  of total overhead)
- Maximum observed overhead: 6.5% at 8 threads

## 2.3 Bandwidth Analysis

The primary communication bottleneck is memory bandwidth:

### Bandwidth Utilization:

- Theoretical Peak: 85.3 GB/s (DDR5 dual channel)
- Measured Achievable: 78-82 GB/s
- Sobel Required: 0.512 GB/s per thread @ 4 threads = 2.048 GB/s total
- Utilization: **2-3% of available bandwidth**

### Data Volume per Iteration (N=2048):

- Input reads:  $(2048)^2 \times 4 \text{ bytes} \times 9 \text{ kernel taps} \approx 150 \text{ MB}$  per complete pass
- Output writes:  $(2048)^2 \times 4 \text{ bytes} \approx 17 \text{ MB}$  per pass
- Cache reuse: High (same input data used for 9 different stencil positions)
- Memory footprint: ~34 MB (fits in L3 cache with locality)

## 2.4 Communication Pattern: Synchronization Costs

| Synchronization Event        | Cost ( $\mu\text{s}$ ) |
|------------------------------|------------------------|
| Implicit Barrier (2 threads) | 15-25 $\mu\text{s}$    |
| Implicit Barrier (4 threads) | 20-30 $\mu\text{s}$    |
| Implicit Barrier (8 threads) | 30-50 $\mu\text{s}$    |
| Single Thread Fork/Join      | 100-150 $\mu\text{s}$  |

Table 3: OpenMP Synchronization Latencies

### Barrier Overhead Justification:

Total execution time @ 4 threads: ~28 ms

Number of barriers per run: 1 (loop boundary)

Barrier time: ~0.025 ms

Overhead percentage:  $0.025 / 28 \approx 0.09\%$

The implicit barrier at the #pragma omp parallel for region end contributes negligibly to total execution time.

## 3. Inter-Thread Communication Flow

### 3.1 Execution Timeline

Figure 2: Thread Execution Timeline (Simplified)

**Communication Events:**

1. **Thread Spawn:** 100–150  $\mu$ s (one-time setup)
2. **Load Distribution:** Implicit via OpenMP scheduler
3. **Computation:** 28–103 ms (depends on thread count and image size)
4. **Barrier Synchronization:** 20–50  $\mu$ s (threads wait for slowest worker)
5. **Thread Join:** Implicit, no explicit communication

### 3.2 Data Dependency Graph

Sobel computation has **zero loop-carried dependencies**, meaning:

- No inter-thread synchronization needed during computation phase
- Each thread processes completely independent pixel regions
- No shared updates or reductions within the loop
- Single implicit barrier suffices per parallel region

**Dependence Analysis:**

Output[i,j] depends on: Input[i-1..i+1, j-1..j+1]

For Thread A (processing rows 0–511):

- Needs input rows 0-512 (including boundary)
- Writes output rows 1-510 (no overlap with other threads)
- No inter-thread coordination required during compute

## 4. Scalability Analysis: Amdahl's Law

### 4.1 Measured vs Predicted

| Threads | Predicted S | Measured S | Efficiency | Gap |
|---------|-------------|------------|------------|-----|
|         |             |            |            |     |

|   |      |      |       |       |
|---|------|------|-------|-------|
| 2 | 1.96 | 1.91 | 95.5% | 2.5%  |
| 4 | 3.65 | 3.58 | 89.5% | 1.9%  |
| 8 | 6.86 | 5.91 | 73.9% | 13.8% |

Table 4: Amdahl's Law: Prediction vs Measurement

#### Analysis:

- Excellent match at 2 and 4 threads (error < 3%)
- Larger deviation at 8 threads due to memory bandwidth saturation
- Communication overhead properly accounted in measured data
- System behavior validates theoretical Amdahl's Law model

### 4.3 Memory Bandwidth Bottleneck

The deviation at 8 threads reflects **memory bandwidth limitation**:

**Arithmetic Intensity:** 0.375 operations per byte

Computation per pixel: ~15 operations

Data per pixel: (9 kernel taps + output) × 4 bytes ≈ 40 bytes

Arithmetic intensity:  $15/40 \approx 0.375$  ops/byte

With 8 threads competing for ~80 GB/s available bandwidth:

- Maximum theoretical throughput:  $0.375 \text{ ops/byte} \times 80 \text{ GB/s} = 30 \text{ GOPS}$
- Actual achieved: ~15 GOPS (50% efficiency, typical for memory-bound kernels)

## 5. Optimization Opportunities

### 5.1 Communication Reduction Strategies

1. **Cache Blocking:** Improve data locality by processing in sub-blocks
  - Estimated improvement: 15-20% speedup
  - Implementation complexity: Moderate
2. **Prefetching:** Hint CPU to prefetch next row
  - Estimated improvement: 5-10% speedup
  - Compiler support: Auto-enabled with -O3 flag

3. **SIMD Vectorization:** Use AVX-512 for pixel-parallel computation
  - Estimated improvement: 2-4x speedup
  - Requires code rewrite: Significant
4. **GPU Acceleration:** Offload to CUDA
  - Estimated improvement: 50-100x speedup
  - Communication overhead: PCIe bandwidth becomes bottleneck

## 5.2 Current Optimization Status

The implementation already includes:

- Static load balancing (uniform computation per thread)
- Collapse(2) to improve parallelism granularity
- -O3 compiler optimizations
- -march=native for CPU-specific optimizations
- Row-major memory layout (spatial locality)

## Conclusions

This analysis demonstrates that:

1. **Communication overhead is minimal** (<7% even at 8 threads) in shared-memory architectures
2. **Memory bandwidth is the limiting factor**, not inter-thread communication
3. **Amdahl's Law predictions closely match measured results**, validating parallelization efficiency
4. **System scales well to 4 threads**, with diminishing returns beyond due to bandwidth saturation
5. **Further speedup requires GPU acceleration** or distributed memory systems

The parallel Sobel implementation effectively demonstrates the principles of shared-memory parallelization with well-understood communication patterns and predictable performance characteristics.

## Phase I output:

