

**Closed-form Processor Equivalence And  
Scheduling Divisible Workloads From Multiple  
Sources In Regular, Toroidal Mesh And  
Hypercube Network**

A Dissertation Presented

by

**Junwei Zhang**

to

The Graduate School

in Partial Fulfillment of the

Requirements

for the Degree of

**Doctor of Philosophy**

in

**Applied Mathematics and Statistics**

Stony Brook University

**August 2018**

**Stony Brook University**

The Graduate School

**Junwei Zhang**

We, the dissertation committee for the above candidate for the  
Doctor of Philosophy degree, hereby recommend  
acceptance of this dissertation.

**Thomas G. Robertazzi - Dissertation Advisor**  
**Professor, Department of Electrical and Computer Engineering**

**Joseph S.B. Mitchell - Chairperson of Defense**  
**Professor, Department of Applied Mathematics and Statistics**

**Esther M. Arkin - Member**  
**Professor, Department of Applied Mathematics and Statistics**

**Yue Zhao - Member**  
**Professor, Department of Electrical and Computer Engineering**

This dissertation is accepted by the Graduate School.

Charles Taber  
Dean of the Graduate School

Abstract of the Dissertation

**Closed-form Processor Equivalence And  
Scheduling Divisible Workloads From Multiple  
Sources In Regular, Toroidal Mesh And  
Hypercube Network**

by

**Junwei Zhang**

**Doctor of Philosophy**

in

**Applied Mathematics and Statistics**

Stony Brook University

**2018**

Abstract Here

*Key Words:*

*To my Parents and all loving ones*

# Table of Contents

|                                                        |             |
|--------------------------------------------------------|-------------|
| <b>List of Figures</b> . . . . .                       | <b>xi</b>   |
| <b>List of Tables</b> . . . . .                        | <b>xii</b>  |
| <b>Acknowledgements</b> . . . . .                      | <b>xiii</b> |
| <b>1 Introduction</b> . . . . .                        | <b>1</b>    |
| <b>2 Problem Description</b> . . . . .                 | <b>3</b>    |
| 2.1 Problem Background . . . . .                       | 3           |
| 2.2 Definitions and Assumption . . . . .               | 5           |
| 2.2.1 Notions . . . . .                                | 7           |
| <b>3 Regular network</b> . . . . .                     | <b>9</b>    |
| 3.1 With Front-end Scenario . . . . .                  | 9           |
| 3.1.1 Data Injection on The Corner Processor . . . . . | 10          |
| 3.2 Without Front End Scenario . . . . .               | 37          |
| 3.2.1 Regular network . . . . .                        | 37          |
| 3.2.2 Toroidal network . . . . .                       | 54          |
| 3.2.3 General Case . . . . .                           | 54          |

|          |                                                                                            |           |
|----------|--------------------------------------------------------------------------------------------|-----------|
| 3.3      | Comparison Result Between Front End Processor and Without<br>Front End Processor . . . . . | 56        |
| 3.3.1    | Regular network . . . . .                                                                  | 56        |
| 3.3.2    | Data Injection on Boundary Processor . . . . .                                             | 60        |
| 3.3.3    | Data Injection on Inner Grid Processor . . . . .                                           | 61        |
| 3.4      | Without Front End Scenario . . . . .                                                       | 62        |
| 3.4.1    | Data Injection on Corner Processor . . . . .                                               | 62        |
| 3.4.2    | Data Injection on Boundary Processor . . . . .                                             | 63        |
| 3.4.3    | Data Injection on Inner Grid Processor . . . . .                                           | 64        |
| 3.5      | Comparison Result With Corner Processor and Inner Grid Pro-<br>cessor . . . . .            | 65        |
| 3.6      | Comparison Result With Front End and Without Front End .                                   | 66        |
| 3.7      | Store and Froward Switching . . . . .                                                      | 67        |
| <b>4</b> | <b>Toroidal Network . . . . .</b>                                                          | <b>68</b> |
| 4.0.1    | Torus mesh . . . . .                                                                       | 68        |
| 4.0.2    | Toroidal . . . . .                                                                         | 73        |
| 4.1      | Front End Scenario . . . . .                                                               | 74        |
| 4.1.1    | Data Injection on Corner . . . . .                                                         | 74        |
| 4.1.2    | Toroidal . . . . .                                                                         | 76        |
| 4.2      | Comparison Result Between Regular Mesh and Toroidal . . .                                  | 77        |
| <b>5</b> | <b>Hypercube Network . . . . .</b>                                                         | <b>78</b> |
| 5.1      | Even Data Fraction . . . . .                                                               | 78        |
| 5.1.1    | Subgraph Data Injection . . . . .                                                          | 79        |

|          |                                               |           |
|----------|-----------------------------------------------|-----------|
| 5.1.2    | Individual Data Injection Positions . . . . . | 81        |
| 5.1.3    | Toroidal . . . . .                            | 85        |
| <b>6</b> | <b>General Network</b> . . . . .              | <b>93</b> |
| 6.0.1    | General Case . . . . .                        | 93        |
| <b>7</b> | <b>Conclusion and Future work</b> . . . . .   | <b>95</b> |
|          | <b>Bibliography</b> . . . . .                 | <b>96</b> |

## List of Figures

|      |                                                                                                          |    |
|------|----------------------------------------------------------------------------------------------------------|----|
| 2.1  | A m*n regular network(m = 5, n = 5) . . . . .                                                            | 4  |
| 2.2  | A toroidal rectangle network with grid unit cores . . . . .                                              | 5  |
| 2.3  | A hypercube network . . . . .                                                                            | 5  |
| 3.1  | The 2*2 regular network and the root processor is $P_0$ . . . . .                                        | 10 |
| 3.2  | The timing diagram for 2*2 regular network and the root processor is $P_0$ . . . . .                     | 11 |
| 3.3  | 2*2 regular network. $\alpha_0, \alpha_1, \alpha_2, \alpha_3$ value curve . . . . .                      | 13 |
| 3.4  | The 2*3 regular network and the data injection happens on corner processor $P_0$ . . . . .               | 14 |
| 3.5  | The timing diagram for a 2*3 regular network and the data injection happens on processor $P_0$ . . . . . | 15 |
| 3.6  | 2*2 regular network. $\alpha_0, \alpha_1, \alpha_3, \alpha_5$ data fraction value . . . . .              | 18 |
| 3.7  | The 2*n (n = 10) regular network and the workload happens on $P_0$ .                                     | 19 |
| 3.8  | The timing diagram for 2*10 regular network and the data injection happens on $P_0$ . . . . .            | 20 |
| 3.9  | 3*8 regular network. The data injection position is $P_0$ . . . . .                                      | 26 |
| 3.10 | 5*5 regular network. The data injection position is $P_0$ . . . . .                                      | 26 |

|      |                                                                                                                   |    |
|------|-------------------------------------------------------------------------------------------------------------------|----|
| 3.11 | The $3*3$ regular network and the root processor is $P_0$                                                         | 28 |
| 3.12 | The timing diagram for $3*3$ regular network and the data injection occurs on $P_0$                               | 29 |
| 3.13 | The data fraction simulation result of $3*3$ regular network and the data injection happens on the boundary $P_0$ | 32 |
| 3.14 | $3*3$ regular network. The data injection position is inner grid point $P_0$                                      | 33 |
| 3.15 | The timing diagram for $3*3$ regular network and the data injection is inner grid $P_0$                           | 34 |
| 3.16 | $3*3$ regular network. The data injection position is inner grid point $P_0$                                      | 36 |
| 3.17 | The timing diagram for $2*2$ regular network without front end.                                                   | 38 |
| 3.18 | The data fraction deployed based on the radius value                                                              | 40 |
| 3.19 | The timing diagram for $2*3$ regular network without front end.                                                   | 42 |
| 3.20 | The data fraction deployed based on the radius value                                                              | 44 |
| 3.21 | The timing diagram for $3*3$ boundary data injection on $P_0$                                                     | 49 |
| 3.22 | The fraction curve for $3*3$ boundary data injection on $P_0$                                                     | 51 |
| 3.23 | The timing diagram for $3*3$ inner grid injection $P_0$                                                           | 53 |
| 3.24 | The data fraction deployed based on the radius value                                                              | 55 |
| 3.25 | The comparing result between front-end processor with without front-end processor in $2*2$ regular network        | 57 |
| 3.26 | The comparing result between front-end processor with without front-end processor in $2*4$ regular network        | 58 |

|      |                                                                                                                                                      |    |
|------|------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| 3.27 | The comparing result between front-end processor with without front-end processor in 3*3 regular network injection on boundary processor . . . . .   | 59 |
| 3.28 | The comparing result between front-end processor with without front-end processor in 3*3 regular network injection on inner grid processor . . . . . | 60 |
| 3.29 | Sensitivity analysis result of 3*8 regular network and the injection position on boundary processor $P_2$ . . . . .                                  | 61 |
| 3.30 | Sensitivity analysis result of data injection position on inner grid processor . . . . .                                                             | 62 |
| 3.31 | Sensitivity analysis result of 2*10 regular network result . . . . .                                                                                 | 63 |
| 3.32 | Sensitivity analysis result of 3*8 regular network result . . . . .                                                                                  | 63 |
| 3.33 | Sensitivity analysis result of data injection position on inner grid processor . . . . .                                                             | 64 |
| 3.34 | Speedup difference between corner injection and inner grid injection                                                                                 | 65 |
| 3.35 | Speedup difference between front end and without front end in 5*5 regular network . . . . .                                                          | 66 |
| 4.1  | The rectangular toroidal network . . . . .                                                                                                           | 69 |
| 4.2  | The rectangular toroidal network . . . . .                                                                                                           | 69 |
| 4.3  | The $m \times n$ toroidal network and the data injection is $P_{4,2}$ . . . . .                                                                      | 71 |
| 4.4  | The data fraction deployed based on the radius value . . . . .                                                                                       | 73 |
| 4.5  | The comparing result between front-end processor with without front-end processor in 6*6 regular mesh injection on inner grid processor . . . . .    | 74 |

|      |                                                                                          |    |
|------|------------------------------------------------------------------------------------------|----|
| 4.6  | Sensitivity analysis result of 2*10 regular mesh result . . . . .                        | 75 |
| 4.7  | Sensitivity analysis result of data injection position on inner grid processor . . . . . | 76 |
| 5.1  | Data injection consists of a subgraph of $G$ . . . . .                                   | 79 |
| 5.2  | Data injection consists of a subgraph of $G$ . . . . .                                   | 79 |
| 5.3  | Speedup vs $\sigma$ . . . . .                                                            | 81 |
| 5.4  | 10 Voronoi Cells . . . . .                                                               | 82 |
| 5.5  | 10 Voronoi Cells's speedup curves . . . . .                                              | 83 |
| 5.6  | 10 Voronoi Cells . . . . .                                                               | 85 |
| 5.7  | 10 Voronoi Cells's speedup curves . . . . .                                              | 86 |
| 5.8  | Reduced Voronoi Division Algorithm average processors' percentage                        | 87 |
| 5.9  | How to calculate torus Voronoi Diagram . . . . .                                         | 88 |
| 5.10 | Initial Voronoi Digram . . . . .                                                         | 89 |
| 5.11 | Torus Voronoi Diagram . . . . .                                                          | 90 |
| 5.12 | Voronoi Diagram Casting to the torus model . . . . .                                     | 90 |
| 5.13 | Voronoi Diagram Casting to the torus model . . . . .                                     | 91 |
| 5.14 | Torus Reduced Voronoi Diagram Casting to the Torus Model . .                             | 91 |
| 5.15 | Torus Reduced Voronoi Diagram . . . . .                                                  | 92 |

## **List of Tables**

|     |                                                 |    |
|-----|-------------------------------------------------|----|
| 4.1 | The processor number of various $D_i$ . . . . . | 72 |
| 4.2 | The processor number of various $D_i$ . . . . . | 76 |

## **Acknowledgements**

Thank you for Professor Thomas G. Robertazzi who gives me strong academic and emotional support in the past hard time period.

# **Chapter 1**

## **Introduction**

Four chapters are organized as follows :

- A problem introduction, notation and definitions are shown in Chapter II.
- Chapter III considers about the regular network. We take into account of the processor equivalence problem with front-end scenario and without front-end situation. The sensitivity analysis is another topic. Considering about the multi-source workload assignment problem, we propose a reduced Voronoi diagram assignment.
- Chapter IV presents the toroidal rectangle network situation. We consider the closed-form of processor equivalence in unique load injection, in different injection positions, corner processor, boundary processor and inner grid processor, respectively. The sensitivity analysis and multi-source assignment problem are also referred.
- Chapter V investigates the hypercube network situation. The front-end

and without front-end assumption are discussed. The sensitivity analysis is referred and the multi-source assignment in hypercube environment is also an interesting problem.

- Chapter VI figures out the conclusion and future work.

# **Chapter 2**

## **Problem Description**

### **2.1 Problem Background**

In large-scale data intensive problems with geographically distributed resources, load is generated from multiple sources[1] for a class of problems. It is assumed that the problem representation can be divided amongst the processors. Thus the problem representation is said to be “divisible”. The processing of massive amounts of data on distributed and parallel networks is becoming more and more common. The problem of minimizing the processing time of extensive loads originating from a multiplicity of sources and being processed on a multiplicity of nodes presents a challenge.

In this chapter, the closed-form processor equivalence[2][3] problem in the grid network of regular network and toroidal rectangle network is discussed. Also, the multi-source workload assignment is also taken into account.

In this thesis, we investigate two problems. One is the processor equivalence. The other one is scheduling divisible workloads from multiple sources in regular network Fig. 3.10, toroidal rectangle network Fig. 2.2 and hypercube

network Fig. 2.3.



Figure 2.1: A  $m \times n$  regular network ( $m = 5, n = 5$ )



Figure 2.2: A toroidal rectangle network with grid unit cores



Figure 2.3: A hypercube network

## 2.2 Definitions and Assumption

### Definition 2.2.1. Equal Computation

*Equal computation is a technique, which considers combining a cluster of processors as one whole processor to process the unit 1 workload.*

The following assumptions are used throughout the paper:

- The virtual cut through [4] switching is used to transmit the assigned workload between processors.
- For simplicity, we do not consider return communications.
- The communication delays are taken into consideration.
- The time costs of computation and communication are assumed to be linear function of the data size.
- The network environment is homogeneous, that is, all the processors have the same computation capacity. The link speeds between any two unit cores are identical.
- The number of outgoing ports in each processor is limited. In NOC(network on chip), the port number is fixed 4 or 5.
- The general graph's grid node's in-degree and out-degree is 4 or 5.

The optimization objective functions are as follows :

- Equal computation : the problem's objective function is how to partition and schedule the workloads amongst the processors to get the minimum finish time.
- Multi-source assignment : how to finish the unit 1 workload at the same time utilizing smaller processor.

To achieve the minimum solution is obtained by forcing the processors over a network to stop processing simultaneously. Intuitively, this is because the solution could be improved by transfer load from some busy processor to idle ones.

### 2.2.1 Notions

The following notations and definitions are utilized:

- $P_i$ : The  $i$ th processor.  $0 \leq i \leq m * n - 1$ .
- $L_i$ : The  $i$ th work load.  $1 \leq i \leq k$ .
- $D_i$ : The minimum number of hops from the processor  $P_i$  to the data load injection  $L$ .
- $level_i$ : The processors have  $i$  minimum Manhattan distance to the data injection.
- $\alpha_0$ : The load fraction assigned to the root processor.
- $\alpha_i$ : The load fraction assigned to the  $i$ th processor.
- $\omega_i$ : The inverse computing speed on the  $i$ th processor.
- $\omega_{eq}$ : The inverse computing speed on an equivalent node collapsed from a cluster of processors.
- $z_i$ : The inverse link speed on the  $i$ th link.

- $T_{cp}$ : Computing intensity constant. The entire load is processed in  $\omega_i T_{cp}$  on the  $i$ th processor.
- $T_{cm}$ : Communication intensity constant. The entire load is transmitted in  $z_i T_{cm}$  seconds over the  $i$ th link.
- $T_{f,n}$ : The finish time of the whole processor network. Here  $T_{f,n}$  is equal to  $\omega_{eq} T_{cp}$ .
- $T_{f,0}$ : The finish time for the entire divisible load solved on the root processor. Here  $T_{f,0}$  is equal to  $1 \times \omega_0 T_{cp}$ , that is  $\omega_0 T_{cp}$ .
- $\sigma = \frac{z T_{cm}}{\omega T_{cp}}$ : The ratio between the communication speed to the computation speed,  $0 < \sigma < 1$  [5] [6].
- In multi-source situation,  $\sum_{i=1}^k L_i = 1$
- $\sum_{i=0}^{m*n-1} \alpha_i = 1$
- $Speedup = \frac{T_{f,0}}{T_{f,n}} = \frac{\omega T_{cp}}{\alpha_0 \omega T_{cp}} = \frac{1}{\alpha_0}$

# **Chapter 3**

## **Regular network**

### **3.1 With Front-end Scenario**

In the front-end environment, the communication and the computation is executed simultaneously. That is, upon receiving their respective load fractions, the processors start processing their own workload and rely all the other fractions to the neighbor processors at the same time.

First we consider about the  $2 * 2$  regular network,  $2 * n$  regular network. After, we analyze a more general case  $m * n$  regular network and obtain a general closed-form matrix presentation. Finally, we give a key principle to address this type of question. In addition, different data injection position, such as the corner, boundary and inner grid are also discussed.

### 3.1.1 Data Injection on The Corner Processor

#### 2\*2 Regular Network

The  $L$  is assigned on the corner processor  $P_0$  Fig. 3.1. The whole task is tackled by four processors  $P_0, P_1, P_2, P_3$  together.



Figure 3.1: The 2\*2 regular network and the root processor is  $P_0$

The processor  $P_0, P_1$  and  $P_2$  start to process its respective fraction at the same time. The processor  $P_3$  starts to work until the  $\alpha_1$  and  $\alpha_2$  are completed transmission.

According to the divisible load theory[7], we obtain the timing diagram Fig. 3.2.

Based on the timing diagram, we get a group of equations to deploy the fraction workload:



Figure 3.2: The timing diagram for  $2^*2$  regular network and the root processor is  $P_0$

$$\left\{ \begin{array}{l} \alpha_0 \omega T_{cp} = T_{f,m} \\ \alpha_1 \omega T_{cp} = T_{f,m} \\ \alpha_2 \omega T_{cp} = T_{f,m} \\ \alpha_1 z T_{cm} + \alpha_3 \omega T_{cp} = T_{f,m} \\ \alpha_0 + \alpha_1 + \alpha_2 + \alpha_3 = 1 \\ \sigma = \frac{z T_{cm}}{\omega T_{cp}} \\ 0 < \sigma < 1 \\ 0 < \alpha_0 \leq 1 \\ 0 \leq \alpha_1, \alpha_2, \alpha_3 < 1 \end{array} \right. \quad \begin{array}{l} (3.1) \\ (3.2) \\ (3.3) \\ (3.4) \\ (3.5) \\ (3.6) \\ (3.7) \\ (3.8) \\ (3.9) \end{array}$$

The group of equations are represented by the matrix form:

$$\begin{bmatrix} 1 & 2 & 1 \\ 1 & -1 & 0 \\ 0 & \sigma - 1 & 1 \end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_3 \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \quad (3.10)$$

The matrix is represented as  $A \times \alpha = b$ .  $A$  is named as ***flow matrix***.

Finally, the explicit solution is:

$$\sigma = \frac{zT_{cm}}{\omega T_{cp}} \quad (3.11)$$

$$\alpha_0 = \frac{1}{4 - \sigma} \quad (3.12)$$

$$\alpha_1 = \frac{1}{4 - \sigma} \quad (3.13)$$

$$\alpha_3 = \frac{1 - \sigma}{4 - \sigma} \quad (3.14)$$

The simulation result is illustrated:



Figure 3.3:  $2^*2$  regular network.  $\alpha_0, \alpha_1, \alpha_2, \alpha_3$  value curve

In Fig. 3.3,  $P_0, P_1, P_2$  three processors have the same data fraction workload, so the curve of  $\alpha_0$  and  $\alpha_1$  coincide. The figure says that  $\sigma$  grows and  $\alpha_3$  drops simultaneously. In other words, if the communication capacity drops down, there is less data workload assigned to  $P_3$ . Further, it means it will be economical to keep the load local nor distribute to other processors.

The speedup is:

$$Speedup = \frac{T_{f,0}}{T_{f,n}} = \frac{\omega T_{cp}}{\alpha_0 \omega T_{cp}} = \frac{1}{\alpha_0} = 4 - \sigma$$

## 2\*3 Regular Network

In Fig. 3.4 regular network,  $L$  happens on processor  $P_0$ . There are 6 processors to take the responsibility.



Figure 3.4: The 2\*3 regular network and the data injection happens on corner processor  $P_0$

$P_0$ ,  $P_1$  and  $P_2$  start processing at the same time. Processor  $P_3$  and  $P_4$  start to work until they receive the data from processor  $P_1$ ,  $P_2$ . That is,  $P_3$  and  $P_4$  have to wait the fraction of  $\alpha_1$  and  $\alpha_2$  are transmitted completely. The last processor  $P_5$  starts to execute until the work load fraction  $\alpha_0$ ,  $\alpha_1$ ,  $\alpha_2$ ,  $\alpha_3$ ,  $\alpha_4$  are transmitted completed. According to the divisible load theory[7], we obtain the timing diagram Fig. 3.5.



Figure 3.5: The timing diagram for a  $2^*3$  regular network and the data injection happens on processor  $P_0$

The equations as follows:

$$\left. \begin{array}{l} \alpha_0 \omega T_{cp} = T_{f,m} \\ \alpha_1 \omega T_{cp} = T_{f,m} \\ \alpha_2 \omega T_{cp} = T_{f,m} \\ \alpha_1 z T_{cm} + \alpha_3 \omega T_{cp} = T_{f,m} \\ \alpha_2 z T_{cm} + \alpha_4 \omega T_{cp} = T_{f,m} \\ (\alpha_1 + \alpha_3) z T_{cm} + \alpha_5 \omega T_{cp} = T_{f,m} \\ \alpha_0 + \alpha_1 + \alpha_2 + \alpha_3 + \alpha_4 + \alpha_5 = 1 \\ \sigma = \frac{z T_{cm}}{\omega T_{cp}} \\ 0 < \sigma < 1 \\ 0 < \alpha_0 \leq 1 \\ 0 \leq \alpha_1, \alpha_2, \alpha_3, \alpha_4, \alpha_5 < 1 \end{array} \right\} \quad (3.15)$$

$$\alpha_1 \omega T_{cp} = T_{f,m} \quad (3.16)$$

$$\alpha_2 \omega T_{cp} = T_{f,m} \quad (3.17)$$

$$\alpha_1 z T_{cm} + \alpha_3 \omega T_{cp} = T_{f,m} \quad (3.18)$$

$$\alpha_2 z T_{cm} + \alpha_4 \omega T_{cp} = T_{f,m} \quad (3.19)$$

$$(\alpha_1 + \alpha_3) z T_{cm} + \alpha_5 \omega T_{cp} = T_{f,m} \quad (3.20)$$

$$\alpha_0 + \alpha_1 + \alpha_2 + \alpha_3 + \alpha_4 + \alpha_5 = 1 \quad (3.21)$$

$$\sigma = \frac{z T_{cm}}{\omega T_{cp}} \quad (3.22)$$

$$0 < \sigma < 1 \quad (3.23)$$

$$0 < \alpha_0 \leq 1 \quad (3.24)$$

$$0 \leq \alpha_1, \alpha_2, \alpha_3, \alpha_4, \alpha_5 < 1 \quad (3.25)$$

The flow matrix closed-form formula is:

$$\begin{bmatrix} 1 & 2 & 2 & 1 \\ 1 & -1 & 0 & 0 \\ 0 & \sigma - 1 & 1 & 0 \\ 0 & \sigma - 1 & \sigma & 1 \end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_3 \\ \alpha_5 \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \end{bmatrix} \quad (3.26)$$

The explicit solution is:

$$\left\{ \begin{array}{l} \sigma = \frac{zT_{cm}}{\omega T_{cp}} \end{array} \right. \quad (3.27)$$

$$\left\{ \begin{array}{l} \alpha_0 = \frac{1}{\sigma^2 - 4 \times \sigma + 6} \end{array} \right. \quad (3.28)$$

$$\left\{ \begin{array}{l} \alpha_1 = \frac{1}{\sigma^2 - 4 \times \sigma + 6} \end{array} \right. \quad (3.29)$$

$$\left\{ \begin{array}{l} \alpha_3 = \frac{1 - \sigma}{\sigma^2 - 4 \times \sigma + 6} \end{array} \right. \quad (3.30)$$

$$\left\{ \begin{array}{l} \alpha_5 = \frac{\sigma^2 - 2 \times \sigma + 1}{\sigma^2 - 4 \times \sigma + 6} \end{array} \right. \quad (3.31)$$

The simulation result are shown in Fig. 3.6.  $P_0$ ,  $P_1$  have the same fraction so the curve of  $\alpha_0$  and  $\alpha_1$  coincide.



Figure 3.6:  $2^*2$  regular network.  $\alpha_0, \alpha_1, \alpha_3, \alpha_5$  data fraction value

### 2\*N Regular Network

The  $2 * n$  Fig. 3.7 homogeneous regular network address  $L_1$  at the same time and  $L$  happens on  $P_0$ .



Figure 3.7: The  $2^*n$  ( $n = 10$ ) regular network and the workload happens on  $P_0$

Similarly to the analysis of Fig. 3.2 and Fig. 3.5, the timing diagram for

Fig. 3.7 is shown in Fig. 3.8



Figure 3.8: The timing diagram for  $2 \times 10$  regular network and the data injection happens on  $P_0$

The equations are presented as:

$$\left\{ \begin{array}{l} \alpha_0 \omega T_{cp} = T_{f,m} \quad (3.32) \\ \alpha_1 \omega T_{cp} = T_{f,m} \quad (3.33) \\ \alpha_2 \omega T_{cp} = T_{f,m} \quad (3.34) \\ \alpha_1 z T_{cm} + \alpha_3 \omega T_{cp} = T_{f,m} \quad (3.35) \\ \alpha_2 z T_{cm} + \alpha_4 \omega T_{cp} = T_{f,m} \quad (3.36) \\ (\alpha_1 + \alpha_3) z T_{cm} + \alpha_5 \omega T_{cp} = T_{f,m} \quad (3.37) \\ \vdots \quad (3.38) \\ (\alpha_1 + \alpha_3 + \cdots + \alpha_{2 \times n-1}) z T_{cm} + \alpha_{2 \times n-1} \omega T_{cp} = T_{f,m} \quad (3.39) \\ \alpha_0 + \cdots + \alpha_{2 \times n-1} = 1 \quad (3.40) \\ \sigma = \frac{z T_{cm}}{\omega T_{cp}} \quad (3.41) \\ 0 < \sigma < 1 \quad (3.42) \\ 0 < \alpha_0 \leq 1 \quad (3.43) \\ 0 \leq \alpha_1 \alpha_2 \cdots \alpha_{2 \times n-1} < 1 \quad (3.44) \end{array} \right.$$

The flow matrix closed-form is shown:

$$\begin{bmatrix} 1 & 2 & 2 & \cdots & 2 & 2 & 1 \\ 1 & -1 & 0 & \cdots & 0 & 0 & 0 \\ 0 & \sigma - 1 & 1 & \cdots & 0 & 0 & 0 \\ 0 & \sigma - 1 & \sigma & 1 & 0 & \cdots & 0 \\ 0 & \sigma - 1 & \sigma & \sigma & 1 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \\ 0 & \sigma - 1 & \sigma & \cdots & \sigma & \sigma & 1 \end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_3 \\ \alpha_5 \\ \vdots \\ \alpha_{2 \times n-3} \\ \alpha_{2 \times n-1} \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ \vdots \\ 0 \\ 0 \end{bmatrix} \quad (3.45)$$

According to the *Cramer's rule*, the explicit solution for the group of equations is:

$$\left\{ \alpha_i = \frac{\det A_i^*}{\det A} \right\} \quad (3.46)$$

where  $A_i^*$  is the matrix formed by replacing the  $i$ -th column of  $A$  by the column vector  $b$ .

Especially,

$$A_0^* = \begin{bmatrix} 1 & 2 & 2 & \cdots & 2 & 2 & 1 \\ 0 & -1 & 0 & \cdots & 0 & 0 & 0 \\ 0 & \sigma - 1 & 1 & \cdots & 0 & 0 & 0 \\ 0 & \sigma - 1 & \sigma & 1 & 0 & \cdots & 0 \\ 0 & \sigma - 1 & \sigma & \sigma & 1 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \\ 0 & \sigma - 1 & \sigma & \cdots & \sigma & \sigma & 1 \end{bmatrix} \quad (3.47)$$

$$\alpha_0 = \left| \frac{\det A_0^*}{\det A} \right|$$

$$\det A_0^* = -1$$

Finally, the speedup is:

$$Speedup = \frac{T_{f,0}}{T_{f,n}} = \frac{\omega T_{cp}}{\alpha_0 \omega T_{cp}} = \frac{1}{\alpha_0} = |-\det A|$$

Further, we prove the matrix  $\det A \neq 0$ .

$$C = \begin{bmatrix} -1 & 0 & \cdots & 0 & 0 & 0 \\ \sigma - 1 & 1 & \cdots & 0 & 0 & 0 \\ \sigma - 1 & \sigma & 1 & 0 & \cdots & 0 \\ \sigma - 1 & \sigma & \sigma & 1 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \ddots \\ \sigma - 1 & \sigma & \cdots & \sigma & \sigma & 1 \end{bmatrix} \quad (3.48)$$

$C$  is a lower triangular matrix and the diagonal elements are not 0. So  $C$  is non-degenerate, that is, the matrix is column linear independence.

After a series of column reduction and row reduction actions, we get

$$A = \begin{bmatrix} 1 & 2 & 2 & \cdots & 2 & 2 & 1 \\ 1 & -1 & 0 & \cdots & 0 & 0 & 0 \\ 0 & \sigma - 1 & 1 & \cdots & 0 & 0 & 0 \\ 0 & \sigma - 1 & \sigma & 1 & 0 & \cdots & 0 \\ 0 & \sigma - 1 & \sigma & \sigma & 1 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \vdots \\ 0 & \sigma - 1 & \sigma & \cdots & \sigma & \sigma & 1 \end{bmatrix} \xrightarrow{\text{Column Reduction}} \begin{bmatrix} 1 & 0 & 0 & \cdots & 0 & 0 & 0 \\ 1 & -3 & -2 & \cdots & -2 & -2 & -1 \\ 0 & \sigma - 1 & 1 & \cdots & 0 & 0 & 0 \\ 0 & \sigma - 1 & \sigma & 1 & 0 & \cdots & 0 \\ 0 & \sigma - 1 & \sigma & \sigma & 1 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \vdots & \ddots & \ddots \\ 0 & \sigma - 1 & \sigma & \cdots & \sigma & \sigma & 1 \end{bmatrix}$$

$$\xrightarrow{\text{Row Reduction}} \begin{bmatrix} 1 & 0 & 0 & \cdots & 0 & 0 & 0 \\ 0 & -3 & -2 & \cdots & -2 & -2 & -1 \\ 0 & \sigma - 1 & 1 & \cdots & 0 & 0 & 0 \\ 0 & \sigma - 1 & \sigma & 1 & 0 & \cdots & 0 \\ 0 & \sigma - 1 & \sigma & \sigma & 1 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \vdots \\ 0 & \sigma - 1 & \sigma & \cdots & \sigma & \sigma & 1 \end{bmatrix}$$

Considering the matrix  $\hat{C}$

$$\hat{C} = \begin{bmatrix} -3 & -2 & \cdots & -2 & -2 & -1 \\ \sigma - 1 & 1 & \cdots & 0 & 0 & 0 \\ \sigma - 1 & \sigma & 1 & 0 & \cdots & 0 \\ \sigma - 1 & \sigma & \sigma & 1 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \ddots \\ \sigma - 1 & \sigma & \cdots & \sigma & \sigma & 1 \end{bmatrix} \quad (3.49)$$

, which is still column linear independence. Considering  $0 < \sigma < 1$ , the flow matrix is full rank. So  $\det A \neq 0$ .

After three user cases' investigation, we find a crucial rule:

$$\forall D_i = D_j, \quad \text{then} \quad \alpha_i = \alpha_j, \quad 0 \leq i, j \leq m * n - 1$$

## $m*n$ Regular Network

Considering a general  $m * n$  regular network, such as Fig. 3.9 Fig. 3.10.



Figure 3.9:  $3*8$  regular network. The data injection position is  $P_0$



Figure 3.10:  $5*5$  regular network. The data injection position is  $P_0$

Utilizing the rule, we obtain the closed-form flow matrix equations for

Fig. 3.9:

$$\begin{bmatrix}
 1 & 2 & 3 & 3 & 3 & 3 & 3 & 2 & 1 \\
 1 & -1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
 0 & \sigma - 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\
 0 & \sigma - 1 & \sigma & 1 & 0 & 0 & 0 & 0 & 0 \\
 0 & \sigma - 1 & \sigma & \sigma & 1 & 0 & 0 & 0 & 0 \\
 0 & \sigma - 1 & \sigma & \sigma & \sigma & 1 & 0 & 0 & 0 \\
 0 & \sigma - 1 & \sigma & \sigma & \sigma & \sigma & 1 & 0 & 0 \\
 0 & \sigma - 1 & \sigma & \sigma & \sigma & \sigma & \sigma & 1 & 0 \\
 0 & \sigma - 1 & \sigma & \sigma & \sigma & \sigma & \sigma & \sigma & 1
 \end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_3 \\ \alpha_6 \\ \alpha_9 \\ \alpha_{12} \\ \alpha_{15} \\ \alpha_{18} \\ \alpha_{21} \\ \alpha_{23} \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \quad (3.50)$$

Also, the flow matrix equations for Fig. 3.10:

$$\begin{bmatrix}
 1 & 2 & 3 & 4 & 5 & 4 & 3 & 2 & 1 \\
 1 & -1 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\
 0 & \sigma - 1 & 1 & 0 & 0 & 0 & 0 & 0 & 0 \\
 0 & \sigma - 1 & \sigma & 1 & 0 & 0 & 0 & 0 & 0 \\
 0 & \sigma - 1 & \sigma & \sigma & 1 & 0 & 0 & 0 & 0 \\
 0 & \sigma - 1 & \sigma & \sigma & \sigma & 1 & 0 & 0 & 0 \\
 0 & \sigma - 1 & \sigma & \sigma & \sigma & \sigma & 1 & 0 & 0 \\
 0 & \sigma - 1 & \sigma & \sigma & \sigma & \sigma & \sigma & 1 & 0 \\
 0 & \sigma - 1 & \sigma & \sigma & \sigma & \sigma & \sigma & \sigma & 1
 \end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_3 \\ \alpha_6 \\ \alpha_{10} \\ \alpha_{15} \\ \alpha_{19} \\ \alpha_{22} \\ \alpha_{24} \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \quad (3.51)$$

We use the similar method to prove  $\det A \neq 0$ , so the speedup is:

$$\text{Speedup} = \frac{T_{f,0}}{T_{f,n}} = \frac{\omega T_{cp}}{\alpha_0 \omega T_{cp}} = \frac{1}{\alpha_0} = |-\det A|$$

### Data Injection On the Boundary Processor

After the corner scenario, we extend the rule to boundary processor condition.

If the single data injection roots on the boundary processor, for example Fig. 3.11.



Figure 3.11: The  $3 \times 3$  regular network and the root processor is  $P_0$

The timing diagram is Fig. 3.12:



Figure 3.12: The timing diagram for  $3 \times 3$  regular network and the data injection occurs on  $P_0$

The equations are:

$$\left\{ \begin{array}{l} \alpha_0 \omega T_{cp} = T_{f,m} \\ \alpha_1 \omega T_{cp} = T_{f,m} \\ \alpha_2 \omega T_{cp} = T_{f,m} \\ \alpha_3 \omega T_{cp} = T_{f,m} \\ \alpha_1 z T_{cm} + \alpha_4 \omega T_{cp} = T_{f,m} \\ \alpha_2 z T_{cm} + \alpha_5 \omega T_{cp} = T_{f,m} \\ \alpha_3 z T_{cm} + \alpha_6 \omega T_{cp} = T_{f,m} \\ (\alpha_1 + \alpha_4) z T_{cm} + \alpha_7 \omega T_{cp} = T_{f,m} \\ (\alpha_2 + \alpha_5) z T_{cm} + \alpha_8 \omega T_{cp} = T_{f,m} \\ \alpha_0 + \cdots + \alpha_8 = 1 \\ \sigma = \frac{z T_{cm}}{\omega T_{cp}} \\ 0 < \sigma < 1 \\ 0 < \alpha_0 \leq 1 \\ 0 \leq \alpha_1 \quad \alpha_2 \quad \cdots \quad \alpha_8 < 1 \end{array} \right. \begin{array}{l} (3.52) \\ (3.53) \\ (3.54) \\ (3.55) \\ (3.56) \\ (3.57) \\ (3.58) \\ (3.59) \\ (3.60) \\ (3.61) \\ (3.62) \\ (3.63) \\ (3.64) \\ (3.65) \\ (3.66) \end{array}$$

And the flow matrix form is :

$$\begin{bmatrix} 1 & 3 & 3 & 2 \\ 1 & -1 & 0 & 0 \\ 0 & \sigma - 1 & 1 & 0 \\ 0 & \sigma - 1 & \sigma & 1 \end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_4 \\ \alpha_7 \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \end{bmatrix} \quad (3.67)$$

The explicit solution is:

$$\left\{ \begin{array}{l} \alpha_0 = \frac{1}{9 - 7 \times \sigma + 2 \times \sigma^2} \end{array} \right. \quad (3.68)$$

$$\left\{ \begin{array}{l} \alpha_1 = \frac{1}{9 - 7 \times \sigma + 2 \times \sigma^2} \end{array} \right. \quad (3.69)$$

$$\left\{ \begin{array}{l} \alpha_4 = \frac{1 - \sigma}{9 - 7 \times \sigma + 2 \times \sigma^2} \end{array} \right. \quad (3.70)$$

$$\left\{ \begin{array}{l} \alpha_7 = \frac{(1 - \sigma)^2}{9 - 7 \times \sigma + 2 \times \sigma^2} \end{array} \right. \quad (3.71)$$

$$\left\{ \begin{array}{l} \alpha_0 = \frac{1}{9 - 7 \times \sigma + 2 \times \sigma^2} \\ \alpha_1 = \frac{1}{9 - 7 \times \sigma + 2 \times \sigma^2} \\ \alpha_4 = \frac{1 - \sigma}{9 - 7 \times \sigma + 2 \times \sigma^2} \\ \alpha_7 = \frac{(1 - \sigma)^2}{9 - 7 \times \sigma + 2 \times \sigma^2} \end{array} \right. \quad (3.72)$$

The simulation result is shown:  $P_0$  and  $P_1$  have the same  $\alpha$ , so the curve of  $\alpha_0$  and  $\alpha_1$  coincide.



Figure 3.13: The data fraction simulation result of  $3 \times 3$  regular network and the data injection happens on the boundary  $P_0$

### Data Injection On The Inner Grid Processor

Fig. 3.14 shows that  $L$  loads on the inner grid processor  $P_0$ ,



Figure 3.14:  $3 \times 3$  regular network. The data injection position is inner grid point  $P_0$

The timing diagram for this user case is illustrated as Fig. 3.15:



Figure 3.15: The timing diagram for  $3 \times 3$  regular network and the data injection is inner grid  $P_0$

The group of equations are:

$$\left\{ \begin{array}{l} \alpha_0 \omega T_{cp} = T_{f,m} \\ \alpha_1 \omega T_{cp} = T_{f,m} \\ \alpha_2 \omega T_{cp} = T_{f,m} \\ \alpha_3 \omega T_{cp} = T_{f,m} \\ \alpha_4 \omega T_{cp} = T_{f,m} \\ \alpha_1 z T_{cm} + \alpha_5 \omega T_{cp} = T_{f,m} \\ \alpha_1 z T_{cm} + \alpha_6 \omega T_{cp} = T_{f,m} \\ \alpha_1 z T_{cm} + \alpha_7 \omega T_{cp} = T_{f,m} \\ \alpha_1 z T_{cm} + \alpha_8 \omega T_{cp} = T_{f,m} \\ \sigma = \frac{z T_{cm}}{\omega T_{cp}} \\ 0 < \sigma < 1 \\ 0 < \alpha_0 \leq 1 \\ 0 \leq \alpha_1 \quad \alpha_2 \quad \dots \quad \alpha_8 < 1 \end{array} \right. \begin{array}{l} (3.73) \\ (3.74) \\ (3.75) \\ (3.76) \\ (3.77) \\ (3.78) \\ (3.79) \\ (3.80) \\ (3.81) \\ (3.82) \\ (3.83) \\ (3.84) \\ (3.85) \end{array}$$

The flow matrix form is :

$$\begin{bmatrix} 1 & 4 & 4 \\ 1 & -1 & 0 \\ 0 & \sigma - 1 & 1 \end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_5 \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \quad (3.86)$$

The simulation result is Fig. 3.16:  $P_0$  and  $P_1$  have the same  $\alpha$  value, so the curve of  $\alpha_0$  and  $\alpha_1$  coincide.



Figure 3.16:  $3 \times 3$  regular network. The data injection position is inner grid point  $P_0$

## 3.2 Without Front End Scenario

### 3.2.1 Regular network

In the without front-end scenario, unlike the previous protocol, the processors now simultaneously receive the data and solely start to process it as soon as each processor receives its entire load assignment[8].

#### Data Injection on Corner Processor

##### 2\*2 Regular network

This subsection concerns the processors without front-end. Because of without front-end, the processors simultaneously receive the data and solely start to process it as soon as each processor receives its entire load assignment. We consider the timing diagram for Fig. 3.1, Fig. 3.4, Fig. 3.7 and so on. In addition, we also give the new closed-form matrix equations for the previous user cases.

Also, the rule also plays a dominate role in establish the mathematics model.

The timing diagram of Fig. 3.1 is shown:

$P_0$  starts to process the assigned workload and it starts to transfer the  $\alpha_1$ ,  $\alpha_2$  and  $\alpha_3$  fraction workload after it totally receive its  $\alpha_0$  task. That is,  $P_1$  and  $P_2$  are idle until the  $L_1$  finish its data injection to  $P_0$ . The similar situation happens to  $P_1$  and  $P_2$  and they both starts to transmit the  $\alpha_3$  after they totally receive the appropriate workload. In other words,  $P_3$  has to wait until the previous two level processors obtains their own data.



Figure 3.17: The timing diagram for  $2^*2$  regular network without front end.

The corresponding group of equations are as follows:

$$\left\{ \begin{array}{l} \alpha_0 \omega T_{cp} = T_{f,m} \\ \alpha_1 z T_{cm} + \alpha_1 \omega T_{cp} = T_{f,m} \\ \alpha_2 z T_{cm} + \alpha_2 \omega T_{cp} = T_{f,m} \\ (\alpha_1 + \alpha_3) z T_{cm} + \alpha_3 \omega T_{cp} = T_{f,m} \\ \sigma = \frac{z T_{cm}}{\omega T_{cp}} \\ \alpha_0 + \alpha_1 + \alpha_2 + \alpha_3 = 1 \\ 0 < \sigma < 1 \\ 0 < \alpha_0 \leq 1 \\ 0 \leq \alpha_1, \alpha_2, \alpha_3 < 1 \end{array} \right. \quad \begin{array}{l} (3.87) \\ (3.88) \\ (3.89) \\ (3.90) \\ (3.91) \\ (3.92) \\ (3.93) \\ (3.94) \\ (3.95) \end{array}$$

The group of equations imply that

$$\alpha_1 = \alpha_2$$

Further, the equations test and verify the rule again.

The matrix closed-form is presented as:

$$\begin{bmatrix} 1 & 2 & 1 \\ 1 & -(\sigma + 1) & 0 \\ 1 & -\sigma & -(\sigma + 1) \end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_3 \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \quad (3.96)$$

The simulation result for Fig. 3.1 is provided in Fig. 3.18:



Figure 3.18: The data fraction deployed based on the radius value

Fig. 3.18 explains that as the  $\sigma$  value grows up, the fraction is assigned to  $P_0$  increases and the fraction are distributed to  $level_1$  and  $level_2$  reduce. In other words, if the communication capability decreases, there are more data processed locally, which is reasonable. If the ability of the link degrades asymptotically equal to the processor computation capacity, there is solely 10% data is deployed to the  $level_2$ . In addition, if the  $\sigma > 1$ , it means the link transmitting power is less than the processor's processing ability. In this scenario, keeping the data locally is more economical than transmitting it.

## **2\*3 Regular network**

$P_0$  starts to process the assigned workload and it starts to transfer the  $\alpha_1$ ,  $\alpha_2$ ,  $\alpha_3$ ,  $\alpha_5$ ,  $\alpha_5$  fraction workload after it totally receive its  $\alpha_0$  task. That is,  $P_1$  and  $P_2$  are idle until the  $L_1$  finish its data injection to  $P_0$ . According to the level 1 The similar situation happens to  $P_1$  and  $P_2$  and they both starts to transmit the  $\alpha_3$  after they totally receive the appropriate workload. In other words,  $P_3$  has to wait until the previous three levels,  $level_0$ ,  $level_1$  and  $level_2$  processors obtain their own data.



Figure 3.19: The timing diagram for  $2 \times 3$  regular network without front end.

In addition, the group of equations are as follows:

$$\left. \begin{array}{l} \alpha_0 \omega T_{cp} = T_{f,m} \\ \alpha_1 z T_{cm} + \alpha_1 \omega T_{cp} = T_{f,m} \\ \alpha_2 z T_{cm} + \alpha_2 \omega T_{cp} = T_{f,m} \\ (\alpha_1 + \alpha_3) z T_{cm} + \alpha_3 \omega T_{cp} = T_{f,m} \\ (\alpha_1 + \alpha_4) z T_{cm} + \alpha_4 \omega T_{cp} = T_{f,m} \\ (\alpha_1 + \alpha_3 + \alpha_5) z T_{cm} + \alpha_5 \omega T_{cp} = T_{f,m} \\ \sigma = \frac{z T_{cm}}{\omega T_{cp}} \\ \alpha_0 + \alpha_1 + \alpha_2 + \alpha_3 + \alpha_4 + \alpha_5 = 1 \\ 0 < \alpha_0 \leq 1 \\ 0 \leq \alpha_1 \quad \alpha_2 \quad \alpha_3 \quad \alpha_4 \quad \alpha_5 < 1 \end{array} \right\} \quad (3.97)$$

$$\alpha_1 z T_{cm} + \alpha_1 \omega T_{cp} = T_{f,m} \quad (3.98)$$

$$\alpha_2 z T_{cm} + \alpha_2 \omega T_{cp} = T_{f,m} \quad (3.99)$$

$$(\alpha_1 + \alpha_3) z T_{cm} + \alpha_3 \omega T_{cp} = T_{f,m} \quad (3.100)$$

$$(\alpha_1 + \alpha_4) z T_{cm} + \alpha_4 \omega T_{cp} = T_{f,m} \quad (3.101)$$

$$(\alpha_1 + \alpha_3 + \alpha_5) z T_{cm} + \alpha_5 \omega T_{cp} = T_{f,m} \quad (3.102)$$

$$\sigma = \frac{z T_{cm}}{\omega T_{cp}} \quad (3.103)$$

$$\alpha_0 + \alpha_1 + \alpha_2 + \alpha_3 + \alpha_4 + \alpha_5 = 1 \quad (3.104)$$

$$0 < \alpha_0 \leq 1 \quad (3.105)$$

$$0 \leq \alpha_1 \quad \alpha_2 \quad \alpha_3 \quad \alpha_4 \quad \alpha_5 < 1 \quad (3.106)$$

$$\begin{bmatrix} 1 & 2 & 2 & 1 \\ 1 & -(\sigma + 1) & 0 & 0 \\ 1 & -\sigma & -(\sigma + 1) & 0 \\ 1 & -\sigma & -\sigma & -(\sigma + 1) \end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_3 \\ \alpha_5 \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \end{bmatrix} \quad (3.107)$$

The speedup is

$$Speedup = \frac{T_{f,0}}{T_{f,n}} = \frac{\omega T_{cp}}{\alpha_0 \omega T_{cp}} = \frac{1}{\alpha_0}$$

The simulation result is:



Figure 3.20: The data fraction deployed based on the radius value

### $2^n$ Regular network

Considering Fig. 3.7, the equations are demonstrated as follows:

$$\left. \begin{array}{l}
\alpha_0 \omega T_{cp} = T_{f,m} \\
\alpha_1 z T_{cm} + \alpha_1 \omega T_{cp} = T_{f,m} \\
\alpha_2 z T_{cm} + \alpha_2 \omega T_{cp} = T_{f,m} \\
(\alpha_1 + \alpha_3) z T_{cm} + \alpha_3 \omega T_{cp} = T_{f,m} \\
(\alpha_1 + \alpha_4) z T_{cm} + \alpha_4 \omega T_{cp} = T_{f,m} \\
(\alpha_1 + \alpha_3 + \alpha_5) z T_{cm} + \alpha_5 \omega T_{cp} = T_{f,m} \\
\vdots \\
(\alpha_1 + \alpha_3 + \cdots + \alpha_{2 \times n-1}) z T_{cm} + \alpha_{2 \times n-1} \omega T_{cp} = T_{f,m} \\
\sigma = \frac{z T_{cm}}{\omega T_{cp}} \\
0 < \sigma < 1 \\
0 < \alpha_0 \leq 1 \\
0 \leq \alpha_1 \alpha_3 \cdots \alpha_{2 \times n-1} < 1
\end{array} \right\} \quad \begin{array}{l} (3.108) \\ (3.109) \\ (3.110) \\ (3.111) \\ (3.112) \\ (3.113) \\ (3.114) \\ (3.115) \\ (3.116) \\ (3.117) \\ (3.118) \\ (3.119) \end{array}$$

The matrix form for the group of equations are:

$$\begin{bmatrix}
1 & 2 & 2 & \cdots & 2 & 2 & 1 \\
1 & -(\sigma+1) & 0 & \cdots & 0 & 0 & 0 \\
1 & -\sigma & -(\sigma+1) & \cdots & 0 & 0 & 0 \\
1 & -\sigma & -\sigma & -(\sigma+1) & 0 & \cdots & 0 \\
1 & -\sigma & -\sigma & -\sigma & -(\sigma+1) & 0 & 0 \\
\vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \\
1 & -\sigma & -\sigma & \cdots & -\sigma & -\sigma & -(\sigma+1)
\end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_3 \\ \alpha_5 \\ \vdots \\ \alpha_{2 \times n-3} \\ \alpha_{2 \times n-1} \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \quad (3.120)$$

$$-\sigma - 2 = \epsilon$$

$$\sigma^* - 2 = \beta$$

$$A = \begin{bmatrix}
1 & 2 & 2 & \cdots & 2 & 2 & 1 \\
1 & \sigma^* & 0 & 0 & 0 & 0 & 0 \\
1 & -\sigma & \sigma^* & 0 & 0 & 0 & 0 \\
1 & -\sigma & -\sigma & \sigma^* & 0 & 0 & 0 \\
1 & -\sigma & -\sigma & -\sigma & \sigma^* & 0 & 0 \\
1 & -\sigma & -\sigma & -\sigma & -\sigma & \sigma^* & 0 \\
1 & -\sigma & -\sigma & -\sigma & -\sigma & -\sigma & \sigma^*
\end{bmatrix} \xrightarrow{\text{Column Reduction}} \begin{bmatrix}
1 & 0 & 0 & \cdots & 0 & 0 & 0 \\
1 & \beta & -2 & \cdots & -2 & -2 & -1 \\
1 & \epsilon & \beta & \cdots & 0 & 0 & 0 \\
1 & \epsilon & \epsilon & \beta & 0 & \cdots & 0 \\
1 & \epsilon & \epsilon & \epsilon & \beta & 0 & 0 \\
\vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \\
1 & \epsilon & \epsilon & \cdots & \epsilon & \epsilon & \beta
\end{bmatrix}$$

$$\begin{array}{c}
\text{Row Reduction} \\
\left[ \begin{array}{ccccccc}
1 & 0 & 0 & \cdots & 0 & 0 & 0 \\
0 & \beta & -2 & \cdots & -2 & -2 & -1 \\
0 & \epsilon & \beta & \cdots & 0 & 0 & 0 \\
0 & \epsilon & \epsilon & \beta & 0 & \cdots & 0 \\
0 & \epsilon & \epsilon & \epsilon & \beta & 0 & 0 \\
\vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \ddots \\
0 & \epsilon & \epsilon & \cdots & \epsilon & \epsilon & \beta
\end{array} \right] \\
\\
C = \left[ \begin{array}{ccccccc}
\sigma^* & 0 & 0 & 0 & 0 & 0 & 0 \\
-\sigma & \sigma^* & 0 & 0 & 0 & 0 & 0 \\
-\sigma & -\sigma & \sigma^* & 0 & 0 & 0 & 0 \\
-\sigma & -\sigma & -\sigma & \sigma^* & 0 & 0 & 0 \\
-\sigma & -\sigma & -\sigma & -\sigma & \sigma^* & 0 & 0 \\
-\sigma & -\sigma & -\sigma & -\sigma & -\sigma & \sigma^* & 0
\end{array} \right]
\end{array}$$

$0 < \sigma < 1$ , then  $-2 < \sigma^* < -1$ , which means  $C$  is column linear independent, after column and row reduction.  $C^*$  is still full rank. So the matrix is full rank, that is,  $\det A \neq 0$  and  $\det A^* \neq 0$ .

So the speedup is

$$\text{Speedup} = \frac{T_{f,0}}{T_{f,n}} = \frac{\omega T_{cp}}{\alpha_0 \omega T_{cp}} = \frac{1}{\alpha_0} = \frac{\det A}{\det A^*} = \left| \frac{\det A}{(\sigma^*)^{n-1}} \right|$$

## **m\*n Regular network**

Referring to Fig. 3.10, we utilize  $\sigma^*$  to present the  $-(\sigma + 1)$ . The matrix closed-form is:

$$\begin{bmatrix} 1 & 2 & 3 & 4 & 5 & 4 & 3 & 2 & 1 \\ 1 & \sigma^* & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & -\sigma & \sigma^* & 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & -\sigma & -\sigma & \sigma^* & 0 & 0 & 0 & 0 & 0 \\ 1 & -\sigma & -\sigma & -\sigma & \sigma^* & 0 & 0 & 0 & 0 \\ 1 & -\sigma & -\sigma & -\sigma & -\sigma & \sigma^* & 0 & 0 & 0 \\ 1 & -\sigma & -\sigma & -\sigma & -\sigma & -\sigma & \sigma^* & 0 & 0 \\ 1 & -\sigma & -\sigma & -\sigma & -\sigma & -\sigma & -\sigma & \sigma^* & 0 \\ 1 & -\sigma & \sigma^* \end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_3 \\ \alpha_6 \\ \alpha_{10} \\ \alpha_{15} \\ \alpha_{19} \\ \alpha_{22} \\ \alpha_{24} \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix} \quad (3.121)$$

## **Data Injection on Boundary Processor**

Fig. 3.11 shows an example of boundary processor  $P_0$  receiving the  $L_1$ . The timing diagram for Fig. 3.11 is Fig. 3.21.



Figure 3.21: The timing diagram for  $3 \times 3$  boundary data injection on  $P_0$

$$\alpha_0 \omega T_{cp} = T_{f,m} \quad (3.122)$$

$$\alpha_1 z T_{cm} + \alpha_1 \omega T_{cp} = T_{f,m} \quad (3.123)$$

$$\alpha_2 z T_{cm} + \alpha_2 \omega T_{cp} = T_{f,m} \quad (3.124)$$

$$\alpha_3 z T_{cm} + \alpha_3 \omega T_{cp} = T_{f,m} \quad (3.125)$$

$$(\alpha_1 + \alpha_4) z T_{cm} + \alpha_4 \omega T_{cp} = T_{f,m} \quad (3.126)$$

$$(\alpha_2 + \alpha_5) z T_{cm} + \alpha_5 \omega T_{cp} = T_{f,m} \quad (3.127)$$

$$(\alpha_3 + \alpha_6) z T_{cm} + \alpha_6 \omega T_{cp} = T_{f,m} \quad (3.128)$$

$$(\alpha_1 + \alpha_4 + \alpha_7) z T_{cm} + \alpha_7 \omega T_{cp} = T_{f,m} \quad (3.129)$$

$$(\alpha_1 + \alpha_4 + \alpha_8) z T_{cm} + \alpha_8 \omega T_{cp} = T_{f,m} \quad (3.130)$$

$$\begin{bmatrix} 1 & 3 & 3 & 2 \\ 1 & -(\sigma+1) & 0 & 0 \\ 1 & -\sigma & -(\sigma+1) & 0 \\ 1 & -\sigma & -\sigma & -(\sigma+1) \end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_4 \\ \alpha_7 \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \end{bmatrix} \quad (3.135)$$



Figure 3.22: The fraction curve for  $3 \times 3$  boundary data injection on  $P_0$

## Data Injection on The Inner Grid Processor

The equations are:

$$\left\{ \begin{array}{l} \alpha_0\omega T_{cp} = T_{f,m} \\ \alpha_1zT_{cm} + \alpha_1\omega T_{cp} = T_{f,m} \\ \alpha_2zT_{cm} + \alpha_2\omega T_{cp} = T_{f,m} \\ \alpha_3zT_{cm} + \alpha_3\omega T_{cp} = T_{f,m} \\ \alpha_4zT_{cm} + \alpha_4\omega T_{cp} = T_{f,m} \\ (\alpha_1 + \alpha_5)zT_{cm} + \alpha_5\omega T_{cp} = T_{f,m} \\ (\alpha_2 + \alpha_6)zT_{cm} + \alpha_6\omega T_{cp} = T_{f,m} \\ (\alpha_3 + \alpha_7)zT_{cm} + \alpha_7\omega T_{cp} = T_{f,m} \\ (\alpha_4 + \alpha_8)zT_{cm} + \alpha_8\omega T_{cp} = T_{f,m} \\ \sigma = \frac{zT_{cm}}{\omega T_{cp}} \\ 0 < \sigma < 1 \\ 0 < \alpha_0 \leq 1 \\ 0 \leq \alpha_1 \quad \alpha_3 \quad \alpha_4 \quad \alpha_5 \quad \alpha_6 \quad \alpha_7 \quad \alpha_8 < 1 \end{array} \right. \begin{array}{l} (3.136) \\ (3.137) \\ (3.138) \\ (3.139) \\ (3.140) \\ (3.141) \\ (3.142) \\ (3.143) \\ (3.144) \\ (3.145) \\ (3.146) \\ (3.147) \\ (3.148) \end{array}$$

The matrix closed-form is:

$$\begin{bmatrix} 1 & 4 & 4 \\ 1 & -(\sigma + 1) & 0 \\ 1 & -\sigma & -(\sigma + 1) \end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_5 \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix} \quad (3.149)$$

The simulation result shows:



Figure 3.23: The timing diagram for 3\*3 inner grid injection  $P_0$

### 3.2.2 Toroidal network

We utilize the  $\sigma^*$  to present  $-(\sigma + 1)$ . The matrix closed-form is:

$$\begin{bmatrix} 1 & 4 & 8 & 10 & 8 & 4 & 1 \\ 1 & \sigma^* & 0 & 0 & 0 & 0 & 0 \\ 1 & -\sigma & \sigma^* & 0 & 0 & 0 & 0 \\ 1 & -\sigma & -\sigma & -\sigma^* & 0 & 0 & 0 \\ 1 & -\sigma & -\sigma & -\sigma & \sigma^* & 0 & 0 \\ 1 & -\sigma & -\sigma & -\sigma & -\sigma & \sigma^* & 0 \\ 1 & -\sigma & -\sigma & -\sigma & -\sigma & sigma & \sigma^* \end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_2 \\ \alpha_3 \\ \alpha_4 \\ \alpha_5 \\ \alpha_6 \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix} \quad (3.150)$$

The simulation result is :

### 3.2.3 General Case

$$\begin{bmatrix} 1 & m_1 & m_2 & \cdots & m_{n-2} & m_{n-1} & m_n \\ 1 & -(\sigma + 1) & 0 & \cdots & 0 & 0 & 0 \\ 1 & -\sigma & -(\sigma + 1) & \cdots & 0 & 0 & 0 \\ 1 & -\sigma & -\sigma & -(\sigma + 1) & 0 & \cdots & 0 \\ 1 & -\sigma & -\sigma & -\sigma & -(\sigma + 1) & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \\ 1 & -\sigma & -\sigma & \cdots & -\sigma & -\sigma & -(\sigma + 1) \end{bmatrix} \times \begin{bmatrix} \alpha_{l_0} \\ \alpha_{l_1} \\ \alpha_{l_2} \\ \alpha_{l_3} \\ \vdots \\ \alpha_{l_{n-1}} \\ \alpha_{l_n} \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \quad (3.151)$$



Figure 3.24: The data fraction deployed based on the radius value

The  $m_1, m_2, \dots, m_n$  are the number of processors on the  $level_1, level_2, \dots, level_n$ . Also, the  $\alpha_{l0}, \alpha_{l1}, \dots, \alpha_{ln}$  are corresponding workload fraction.

### 3.3 Comparison Result Between Front End Processor and Without Front End Processor

In the legend of figures, we use  $F$  to present the front-end situation and use  $NF$  to presents the without front-end processors setting, for example  $F\alpha_0$  means the  $\alpha_0$  data fraction deployed to  $P_0$ , if the processor has front-end.  $NF\alpha_0$  means the  $\alpha_0$  data fraction deployed to  $P_0$ , if the processor is without front-end setting.

#### 3.3.1 Regular network

##### On the Corner Processor

Fig. 3.25 says that  $P_0$  takes more assigned task in without front-end scenario than front-end processor situation. As the  $\sigma$  value goes up, the fractions are deployed to the deeper levels dropping down. In the limit condition, for example,  $\sigma = 1$ , there is no data transmitted to  $P_3$  in the front-end assumption, yet in the without front-end situation, there is still about 10% data fraction are communicated to  $P_3$ .



Figure 3.25: The comparing result between front-end processor with without front-end processor in  $2^*2$  regular network

Fig. 3.26 says that  $P_0$  takes more assigned task in without front-end scenario than front-end processor situation. As the  $\sigma$  value goes up, the fractions are deployed to the deeper levels degrading. In the limit condition, for example,  $\sigma = 1$ , there is no data transmitted to *level*<sub>3</sub>, that is,  $P_5$  in the front-end assumption. Yet in the without front-end situation, there is still about 5% data fraction is communicated to  $P_5$ .

Fig. 3.27 says that  $P_0$  takes more assigned task in without front-end scenario than front-end processor situation. As the  $\sigma$  value goes up, the fractions



Figure 3.26: The comparing result between front-end processor with without front-end processor in  $2^*4$  regular network

are deployed to the deeper levels dropping down. In the limit condition, for example,  $\sigma = 1$ , there is no data transmitted to  $level_3$ , that is,  $P_7$  and  $P_8$  in the front-end assumption. Yet in the without front-end situation, there is still about 5% data fraction is communicated to  $P_7$  and  $P_8$ .

Fig. 3.28 says that  $P_0$  takes more assigned task in without front-end scenario than front-end processor situation. As the  $\sigma$  value goes up, the fractions are deployed to the deeper levels dropping down. In the limit condition, for example,  $\sigma = 1$ , there is no data transmitted to  $level_2$ , that is,  $P_5$ ,  $P_6$ ,  $P_7$  and  $P_8$ .



Figure 3.27: The comparing result between front-end processor with without front-end processor in  $3 \times 3$  regular network injection on boundary processor

$P_8$  in the front-end assumption. Yet in the without front-end situation, there is still about 5% data fraction is communicated to  $P_5$ ,  $P_6$ ,  $P_7$  and  $P_8$ .

Comparing with Fig. 3.27,  $P_0$  takes less workload in inner grid position than boundary data injection. The reason is there are 4 neighbor processors on the  $level_1$ , yet there is solely three processors on  $level_1$  in boundary plan. In other words, the  $level_1$  plays a critical role in data deployed.



Figure 3.28: The comparing result between front-end processor with without front-end processor in  $3 \times 3$  regular network injection on inner grid processor

### 3.3.2 Data Injection on Boundary Processor

Talking to Fig. 3.9, the data injection happens on boundary processor  $P_2$ .

Fig. 3.29 shows that  $\sigma > 0.2$  the speedup simulation effect is impacted obviously. If  $\sigma < 0.1$ , the number of cores has linear impact on the speedup performance. If the number of cores is bigger than 5, the bottom effect with front end, the cluster at least get about 4 times speedup. It is because, a boundary processor has 3  $level_1$  processors.



Figure 3.29: Sensitivity analysis result of  $3 \times 8$  regular network and the injection position on boundary processor  $P_2$

### 3.3.3 Data Injection on Inner Grid Processor

Scaling Fig. 3.14, to 25 processors, the simulation result says:

If the number of processor large than 5, the cluster equivalence computation ability is at least 5 time speedup. It is the inner grid has 4 *level<sub>1</sub>* neighbor processors.



Figure 3.30: Sensitivity analysis result of data injection position on inner grid processor

## 3.4 Without Front End Scenario

### 3.4.1 Data Injection on Corner Processor

The simulation result of sensitivity analysis of  $2 * n$  regular network Fig. 3.7 is as follows:

The figure illustrates that if  $\sigma < 0.1$ , the number of cores grow up, the speedup efficiency is likely linear increasing. Alternatively speaking, if  $\sigma < 0.1$ , the number of cores dominate the efficiency. If the  $\sigma > 0.2$ , the efficiency drops dramatically. That is, the  $\sigma$  value plays more critical role in the speedup simulation. This important investigation benefit the multi-source assignment problem. In addition, if the number of cores is bigger than 4, the bottom speedup effect is about 3 time.



Figure 3.31: Sensitivity analysis result of  $2 \times 10$  regular network result

### 3.4.2 Data Injection on Boundary Processor



Figure 3.32: Sensitivity analysis result of  $3 \times 8$  regular network result

Fig. 3.32 tells the simulation result for the data injection on the boundary.

### 3.4.3 Data Injection on Inner Grid Processor



Figure 3.33: Sensitivity analysis result of data injection position on inner grid processor

Fig. 3.33 displays the simulation result for the data injection position  $P_{12}$ . The best speedup is 24, which happens on the  $\sigma < 0.1$ . If  $\sigma < 0.1$ , the speedup linearly grows up.

### 3.5 Comparison Result With Corner Processor and Inner Grid Processor

For a  $5 * 5$  regular network, the inner grid position is  $P_{12}$  and the corner data injection position is  $P_0$ . The comparing result is Fig. 3.34.



Figure 3.34: Speedup difference between corner injection and inner grid injection

Generally speaking, Fig. 3.34 says the inner grid position scenario has better performance than the corner injection option. If the grid node is 25 and  $\sigma = 0.5$ , the speedup difference is largest, which is 4.

### 3.6 Comparison Result With Front End and Without Front End



Figure 3.35: Speedup difference between front end and without front end in 5\*5 regular network

Fig. 3.35 shows the speedup difference between the front end situation and without front end scenario.

### **3.7 Store and Froward Switching**

In this chapter, we mainly discuss the virtual cut through [4] switching. In addition, the store and froward [9] schema is a mature data processing technique. In future works, we will consider this switching.

# Chapter 4

## Toroidal Network

### 4.0.1 Torus mesh

According to the paper[10], there are three different torus mesh. In this paper, our intent is not to propose one model to "fit all" problems but rather to indicate one normal case. Other situations can be extended using the same rule. The toroidal network is also a regular complex in that each polygonal face has the same number of links and each node is connected to the same number of links.

We consider the toroidal network in Fig. 4.1 and Fig. 4.2.

Based on the rule, this question is transferred to that given a  $D_i$  and I find the number of instances.

- We assume the toroidal processor number is  $m * n$ , in other words, the width is  $m$  and the height is  $n$ .
- $L$  to present the load injection.
- $L_x$  is the  $L$ 's  $X$  coordinate.



Figure 4.1: The rectangular toroidal network



Figure 4.2: The rectangular toroidal network

- $L_y$  is the  $L$ 's  $Y$  coordinate.
- $D_i$  to present the shortest Manhattan distance between the grid node  $i$  and  $L$ .
- $D_x$  is the  $X$  coordinate shortest Manhattan distance.
- $D_y$  is the  $Y$  coordinate shortest Manhattan distance.
- $N_x$  is the node  $i$ 's  $X$  coordinate.

- $N_y$  is the node  $i$ 's  $Y$  coordinate.

$$\left\{ \begin{array}{l} D = D_x + D_y \\ D_x = \min\{\|N_x - L_x\|, m - \|N_x - L_x\|\} \\ D_y = \min\{\|N_y - L_y\|, n - \|N_y - L_y\|\} \end{array} \right. \quad (4.1)$$

$$\left\{ \begin{array}{l} D = D_x + D_y \\ D_x = \min\{\|N_x - L_x\|, m - \|N_x - L_x\|\} \\ D_y = \min\{\|N_y - L_y\|, n - \|N_y - L_y\|\} \end{array} \right. \quad (4.2)$$

$$\left\{ \begin{array}{l} D = D_x + D_y \\ D_x = \min\{\|N_x - L_x\|, m - \|N_x - L_x\|\} \\ D_y = \min\{\|N_y - L_y\|, n - \|N_y - L_y\|\} \end{array} \right. \quad (4.3)$$

In the  $m*n$  ( $m = 6, n = 6$ ) toroidal networks,  $L$  happens on grid position (4, 2). We calculate the  $D_i$  matrix Table 4.1 by breadth first search(**BFS**) algorithm.

|  | (1,1) | (1,2) | (1,3) | (1,4) | (1,5) | (1,6) |  |
|--|-------|-------|-------|-------|-------|-------|--|
|  | (2,1) | (2,2) | (2,3) | (2,4) | (2,5) | (2,6) |  |
|  | (3,1) | (3,2) | (3,3) | (3,4) | (3,5) | (3,6) |  |
|  | (4,1) | (4,2) | (4,3) | (4,4) | (4,5) | (4,6) |  |
|  | (5,1) | (5,2) | (5,3) | (5,4) | (5,5) | (5,6) |  |
|  | (6,1) | (6,2) | (6,3) | (6,4) | (6,5) | (6,6) |  |

Figure 4.3: The  $m*n$  toroidal network and the data injection is  $P_{4,2}$

| $D_i$ | Number |
|-------|--------|
| 0     | 1      |
| 1     | 4      |
| 2     | 8      |
| 3     | 10     |
| 4     | 8      |
| 5     | 4      |
| 6     | 1      |

Table 4.1: The processor number of various  $D_i$

And the  $D_i$  matrix as follow in table Table 4.1

The matrix closed-form is

$$\begin{bmatrix} 1 & 4 & 8 & 10 & 8 & 4 & 1 \\ 1 & -1 & 0 & 0 & 0 & 0 & 0 \\ 0 & \sigma - 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & \sigma - 1 & \sigma & 1 & 0 & 0 & 0 \\ 0 & \sigma - 1 & \sigma & \sigma & 1 & 0 & 0 \\ 0 & \sigma - 1 & \sigma & \sigma & \sigma & 1 & 0 \\ 0 & \sigma - 1 & \sigma & \sigma & \sigma & \sigma & 1 \end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_2 \\ \alpha_3 \\ \alpha_4 \\ \alpha_5 \\ \alpha_6 \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix} \quad (4.4)$$

The simulation result is :



Figure 4.4: The data fraction deployed based on the radius value

From Fig. 4.4, we see as the  $\sigma$  value grows, more and more workload are assigned to the  $P_{4,2}$  and its one hop neighbors. That is, as the communication ability goes down, the economical method is to process the data locally.

#### 4.0.2 Toroidal

This figure says the simulation difference result happens between front-end processors and without front-end processors.



Figure 4.5: The comparing result between front-end processor with without front-end processor in  $6 \times 6$  regular mesh injection on inner grid processor

## 4.1 Front End Scenario

### 4.1.1 Data Injection on Corner

From Chapter 2, we know the speedup is

$$Speedup = \frac{T_{f,0}}{T_{f,n}} = \frac{\omega T_{cp}}{\alpha_0 \omega T_{cp}} = \frac{1}{\alpha_0} = -\det A$$

The simulation result of sensitivity analysis of  $2 * n$  regular mesh Fig. 3.7 is as follows:



Figure 4.6: Sensitivity analysis result of  $2 \times 10$  regular mesh result

The figure illustrates that if  $\sigma < 0.1$ , the number of cores grow up, the speedup efficiency is likely linear increasing. Alternatively speaking, if  $\sigma < 0.1$ , the number of cores dominate the efficiency. If the  $\sigma > 0.2$ , the efficiency drops dramatically. That is, the  $\sigma$  value plays more critical role in the speedup simulation. This important investigation benefit the multi-source assignment problem. In addition, if the number of cores is bigger than 4, the bottom speedup effect is about 3 time.

| $D_i$ | Number |
|-------|--------|
| 0     | 1      |
| 1     | 4      |
| 2     | 8      |
| 3     | 8      |
| 4     | 4      |

Table 4.2: The processor number of various  $D_i$

#### 4.1.2 Toroidal

The  $level_i$  table shows Table 4.2:

So the simulation result illustrates in Fig. 4.7



Figure 4.7: Sensitivity analysis result of data injection position on inner grid processor

## 4.2 Comparison Result Between Regular Mesh and Toroidal

Considering a regular mesh Fig. 3.10, the best position for inner grid data injection is  $P_{12}$ . Other data injection position, for example  $P_8$ ,  $P_{13}$  they don't have the same high speedup impact. Yet, for toroidal  $5 \times 5$  regular mesh, each position is equal, which means each position hit the best choice for regular mesh. Fig. 3.34 explores the comparison result between the toroidal and corner scenario difference.

## **Chapter 5**

### **Hypercube Network**

Chapter 2 and Chapter 3 address the equal computation capability problem. Further, we extend the single source assignment to multi-source assignment problem[11].

According to each processor, we concentrate on the processors' geographical location, data fraction assigned.

- The processors are front end.
- The processors execute the cut through switching.
- The network is homogeneous.

#### **5.1 Even Data Fraction**

We assume the data fraction is even. For example, the workload is unit 1 and there are  $n$  different data injection options. So each data injection is assigned  $\frac{1}{n}$  workload.

### 5.1.1 Subgraph Data Injection

If the data injection consists of a subgraph of the whole graph, we use  $G_{data}$  to present it.



Figure 5.1: Data injection consists of a subgraph of  $G$



Figure 5.2: Data injection consists of a subgraph of  $G$

Fig. 5.1 and Fig. 5.2 illustrate two situations that the data injections consist of subgraph  $G_{data}$  of the whole graph  $G$ .

The heuristic algorithm tackling this scenario is divided into two stages,  $global_s$  and  $local_s$  stages.

- $global_s$  : Considering the cluster of data injection is a whole data source.  
Utilize the equal computation technique in Chapter 2.
- $local_s$  : Re-distributing workload between the data injection processors.

For example, Fig. 5.1's ***flow matrix*** is :

$$\begin{bmatrix} 4 & 8 & 12 & 10 & 6 & 2 \\ 1 & -1 & 0 & 0 & 0 & 0 \\ 0 & \sigma - 1 & 1 & 0 & 0 & 0 \\ 0 & \sigma - 1 & \sigma & 1 & 0 & 0 \\ 0 & \sigma - 1 & \sigma & \sigma & 1 & 0 \\ 0 & \sigma - 1 & \sigma & \sigma & \sigma & 1 \end{bmatrix} \times \begin{bmatrix} \alpha_0 \\ \alpha_1 \\ \alpha_2 \\ \alpha_3 \\ \alpha_4 \\ \alpha_5 \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix} \quad (5.1)$$

The simulation result illustrates as follows:

This simulation says the best performance happens on  $\sigma \leq 0.05$ , it hits about 40 times speedup. If  $\sigma \approx 1$ , the next work achieve almost 12 times performance.



Figure 5.3: Speedup vs  $\sigma$

### 5.1.2 Individual Data Injection Positions

Considering another situation, the data injection doesn't consist of a subgraph of  $G$ . The job scheduling is considered in BlueGene [12].

Jia's paper [13] inspire me to utilize Voronoi tool to tackle this problem. In the dividable load application, for example, big file transmission, GPU scientific computation or Hadoop job, the total job time depends on the last fraction finish time. So in this section, our goal is to use less processors to obtain the same or similar finish time. The critical idea comes from the investigation from Chapter 2, which means, if  $\sigma > 0.3$ , the speedup performance

drops quickly.



Figure 5.4: 10 Voronoi Cells

Fig. 5.4 shows 10 voronoi cells division and Fig. 5.5 figures out the speedup of each cells.



Figure 5.5: 10 Voronoi Cells's speedup curves

The heuristic algorithm is named as ***Reduced Voronoi Division Algorithm:***

Fig. 5.6 and Fig. 5.7 show the algorithm's result and speedup curve. That is, we obtain the same running time yet occupy 30% processors.

Fig. 5.5 shows the ratio  $\frac{\text{maxspeedup}}{\text{minspeedup}} = \frac{500}{100} = 5$  and Fig. 5.7 shows the ratio is  $\frac{\text{maxspeedup}}{\text{minspeedup}} = \frac{270}{100} = 2.7$ . It displays the computation power is more balanced, further, the cluster finish the task with the same time by less processors.

After 1000 round random sampling the data injection position, we obtain the average saved processors result.

---

**Algorithm 1** Reduced Voronoi Division Algorithm(RVDA)

---

Calculate  $n$  Voronoi cells with Manhattan distance

Calculate  $n$  radius  $R_i$  in  $n$  Voronoi cells

Set  $R_{min} = \min R_i$

**while**  $0 < i \leq n$  **do**

    Calculate the Reduced Voronoi cells by setting the  $R_i = R_{min}$  in each cell.

    Calculate Voronoi cell's flow matrix  $A_i$ .

$i = i + 1$

**end while**

Display each Reduced Voronoi Cells

Illustrate each Reduced Voronoi Cells' Speedup curves

---

From Fig. 5.8, it shows the average percentage of saved processor is about 35%.



Figure 5.6: 10 Voronoi Cells

In real case, it usually combine the subgraph situation and sparse scenario. We will discuss it in future work.

### 5.1.3 Toroidal



Figure 5.7: 10 Voronoi Cells's speedup curves

Fig. 5.9 [14] provides a torus Voronoi method, which extends the original domain to 8 copy and calculate the Voronoi Diagram as planner algorithm [15]. Then the corner part is the torus Voronoi diagram.



Figure 5.8: Reduced Voronoi Division Algorithm average processors' percentage

In the reduced model, reduced toroidal Voronoi diagram save 27% processors hit the same processing capacity. The ratio of original method is about  $\frac{490}{98} = 5$ , after the reduced action, the ratio is  $\frac{290}{98} \approx 2.96$ . That is the reduced heuristic algorithm obtaining more balanced computation capacity distribution.



Figure 5.9: How to calculate torus Voronoi Diagram



Figure 5.10: Initial Voronoi Diagram



Figure 5.11: Torus Voronoi Diagram



Figure 5.12: Voronoi Diagram Casting to the torus model



Figure 5.13: Voronoi Diagram Casting to the torus model



Figure 5.14: Torus Reduced Voronoi Diagram Casting to the Torus Model



Figure 5.15: Torus Reduced Voronoi Diagram

# Chapter 6

## General Network

### 6.0.1 General Case

$$\begin{bmatrix} 1 & m_1 & m_2 & \cdots & m_{n-2} & m_{n-1} & m_n \\ 1 & -1 & 0 & \cdots & 0 & 0 & 0 \\ 0 & \sigma - 1 & 1 & \cdots & 0 & 0 & 0 \\ 0 & \sigma - 1 & \sigma & 1 & 0 & \cdots & 0 \\ 0 & \sigma - 1 & \sigma & \sigma & 1 & 0 & 0 \\ \vdots & \vdots & \vdots & \vdots & \ddots & \ddots & \\ 0 & \sigma - 1 & \sigma & \cdots & \sigma & \sigma & 1 \end{bmatrix} \times \begin{bmatrix} \alpha_{l_0} \\ \alpha_{l_1} \\ \alpha_{l_2} \\ \alpha_{l_3} \\ \vdots \\ \alpha_{l_{n-1}} \\ \alpha_{l_n} \end{bmatrix} = \begin{bmatrix} 1 \\ 0 \\ 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix} \quad (6.1)$$

The  $m_1, m_2, \dots, m_n$  are the number of processors on the  $level_1, level_2, \dots, level_n$ . Also, the  $\alpha_{l_0}, \alpha_{l_1}, \dots, \alpha_{l_n}$  are corresponding workload fraction.

Finally, the speedup is:

$$Speedup = \frac{T_{f,0}}{T_{f,n}} = \frac{\omega T_{cp}}{\alpha_0 \omega T_{cp}} = \frac{1}{\alpha_0} = -\det A$$



## **Chapter 7**

### **Conclusion and Future work**

- Slice OMT
- Different Data Fraction Voronoi
- Different Data Fraction OMT

## Bibliography

- [1] M. Moges, D. Yu, and T. G. Robertazzi, “Grid scheduling divisible loads from two sources,” *Computers & Mathematics with Applications*, vol. 58, no. 6, pp. 1081–1092, 2009.
- [2] T. G. Robertazzi, “Processor equivalence for daisy chain load sharing processors,” *IEEE Transactions on Aerospace and Electronic Systems*, vol. 29, no. 4, pp. 1216–1221, 1993.
- [3] X. Liu, H. Zhao, and X. Li, “Scheduling divisible workloads from multiple sources in linear daisy chain networks.”
- [4] P. Kermani and L. Kleinrock, “Virtual cut-through: A new computer communication switching technique,” *Computer Networks (1976)*, vol. 3, no. 4, pp. 267–286, 1979.
- [5] V. Bharadwaj, *Scheduling divisible loads in parallel and distributed systems*, vol. 8. John Wiley & Sons, 1996.
- [6] J. T. Hung and T. G. Robertazzi, “Switching in sequential tree networks,” *IEEE Transactions on Aerospace and Electronic Systems*, vol. 40, no. 3, pp. 968–982, 2004.
- [7] V. Bharadwaj, D. Ghose, and T. G. Robertazzi, “Divisible load theory: A new paradigm for load scheduling in distributed systems,” *Cluster Computing*, vol. 6, no. 1, pp. 7–17, 2003.
- [8] C. F. Gamboa and T. Robertazzi, “Simple performance bounds for multicore and parallel channel systems,” *Parallel Processing Letters*, vol. 21, no. 04, pp. 439–460, 2011.
- [9] G. R. Kanthraj, C. R. Srinivas, *et al.*, “Store and forward teledermatology,” *Indian Journal of Dermatology, Venereology, and Leprology*, vol. 73, no. 1, p. 5, 2007.

- [10] T. G. Robertazzi, “Toroidal networks,” *IEEE Communications Magazine*, vol. 26, no. 6, pp. 45–50, 1988.
- [11] J. Jia, B. Veeravalli, and J. Weissman, “Scheduling multisource divisible loads on arbitrary networks,” *IEEE Transactions on Parallel and Distributed Systems*, vol. 21, no. 4, pp. 520–531, 2010.
- [12] E. Krevat, J. G. Castaños, and J. E. Moreira, “Job scheduling for the bluegene/l system,” in *Workshop on Job Scheduling Strategies for Parallel Processing*, pp. 38–54, Springer, 2002.
- [13] S. Fortune, “Voronoi diagrams and delaunay triangulations,” in *Computing in Euclidean geometry*, pp. 225–265, World Scientific, 1995.
- [14] C. I. Grima and A. Márquez, *Computational Geometry on Surfaces: Performing Computational Geometry on the Cylinder, the Sphere, the Torus, and the Cone*. Springer Science & Business Media, 2013.
- [15] S. Fortune, “A sweepline algorithm for voronoi diagrams,” *Algorithmica*, vol. 2, no. 1-4, p. 153, 1987.