DisCostiC: Distributed Cost in Cluster

💡Description
- ⚙️ Framework workflow
- 🖍 Advantages over existing tools
Compilation and build
📶 DisCostiC output
💻 System model
👩‍💻 Automating DSEL generation through static analysis
🧐 MPI-parallelized DisCostiC implementation
🚀 Planned features for further development
📚 References about the theory of potential application scenarios for DisCostiC
🔒 License
⚠️ Disclaimer
🔗 Acknowledgement
📱Contact

💡 Description

A cross-architecture resource-based parallel simulation framework that can efficiently predict the performance of real or hypothetical massively parallel MPI programs on current and future highly parallel heterogeneous systems.✨

The runtime predictions are carried out in a controlled environment through straightforward, portable, scalable, and efficient simulations.⭐

The simulation framework is an automated version of analytical first-principle performance models at a full-scale scope, including cores, chips, nodes, networks, clusters, and their mutual interactions plus inherent bottlenecks such as memory bandwidth.🌟

The application blueprint created using Domain Specific Embedded Language (DSEL) enables key concepts and elements embedded in the modern C++ language, enhancing readability for domain experts and accurately encoding inter-process dependencies without introducing unknown consequences from the actual systems. 💥

⚙️ Framework workflow💬

- Application model

    * Domain knowledge of the application expressed in the DisCostiC DSEL language provides information on call dependencies and properties for computation (data access pattern, flop count) and communication (volume, mode and protocol)

- Analytic first-principle performance models
    
    * Computation models: Fundamental analytic Roofline and refined ECM models including memory bandwidth bottleneck concept at the system level
    * Communication models: Fundamental latency-bandwidth and refined LogGP models

- System models
    
    * Cluster, network and node models for the full topology of the target system, including multiple chips, nodes, network interfaces and clusters
 
+ DisCostiC combines all of them to simulate the runtime cost of distributed applications.

🖍 Advantages over existing tools💬

- DSEL approach for creating application blueprint 

    * Accurately encoding inter-process dependencies without introducing unknown consequences from the actual systems 

- A full-scale first-principle-model-based simulator 

    * Automating analytical first-principle performance models on the entire hierarchy of parallel systems

- Efficient speed 

    * No intermediate tracing files requirement like any offline, trace-driven tools 
    * No high memory requirement like any online tools that use the host architectures to execute code
	
+ Last but not least, an open-source low-entry cost lightweight simulator enabling model-based design-space exploration

Compilation and build

First, use the following command to clone the git repository:

git clone git@github.com:RRZE-HPC/DisCostiC-Sim.git && cd DisCostiC-Sim

⚡ Installation💭

Before proceeding, make sure the environment is prepared for the compilation. The installation steps are listed below:

module load python git intel intelmpi cmake itac vampir
conda create --name XYZ
conda activate XYZ
conda install pip 
cmake -DCMAKE_INSTALL_PREFIX=~/.local . && make all install

One way to check the installation is to print the version and the help of the DisCostiC using ./discostic --version and ./discostic --help.

⏱️ Configuration settings💭

The test folder in DisCostiC offers multiple MPI-parallelized programs (benchmark_kernel) in distinct functionality for computation (kernel_mode) and communication (exchange_mode). For illustration, a few examples are given below:

`benchmark_kernel`	Description
`HEAT`	Two-dimensional five-point Jacobi kernel with communication
`SOR`	Gauss-Seidel Successive Over-Relaxation solver with communication
`DMMM`	Dense Matrix Matrix Multiplication kernel with communication
`DMVM`	Dense Matrix Vector Multiplication kernel with communication
`DMVM-TRANSPOSE`	Dense Matrix Transpose Vector Multiplication kernel with communication
`HEATHEAT`	Back-to-back two HEAT kernels with communication
`HEATSOR`	Back-to-back HEAT and SOR kernels with communication
`HEATDIVIDE`	Back-to-back HEAT and DIVIDE kernels with communication
`HPCG`	High Performance Conjugate Gradient
`STENCIL-3D-7PT`	Three-dimensional seven point stencil kernel with communication
`STENCIL-3D-27PT`	Three-dimensional twenty seven point stencil kernel with communication
`STENCIL-UXX`	UXX stencil kernel with communication
`STENCIL-3D-LONGRANGE`	3D long range stencil kernel with communication
`STENCIL-1D-3PT`	One-dimensional three point stencil kernel with communication
`STREAM`	STREAM Triad kernel with communication
`SCHOENAUER`	SCHOENAUER Triad kernel with communication
`SCHOENAUER-DIV`	SCHOENAUER divide Triad kernel with communication
`WAXPY`	WAXPY kernel with communication
`DAXPY`	DAXPY kernel with communication
`SUM`	SUM kernel with communication
`VECTOR-SUM`	Vector SUM kernel with communication
`ADD`	ADD kernel with communication
`DIVIDE`	DIVIDE kernel with communication
`SCALE`	SCALE kernel with communication
`COPY`	COPY kernel with communication
`KAHAN-DOT`	KAHAN-DOT kernel with communication
`SCALAR-PRODUCT`	Scalar Product kernel with communication

The following kernel_mode is exclusively offered for the flexibility of the framework; the simulator's prediction remains unaffected by the selection of the kernel_mode. Further explanation of this mode can be found in the Essential DisCostiC routines section.

`kernel_mode`	Integrated tool	Description
`COMP`	no external tool integrated	This directly embeds the single-core pre-recorded ECM performance model data of the computational kernel into the simulator.
`LBL`	no external tool integrated	This reads the pre-recorded ECM performance model data for the computational kernel from an external file located at nodelevel/configs folder.
`SRC`	Kerncraft integrated	This directly embeds the source code of the computational kernel into the simulator.
`FILE`	Kerncraft integrated	This reads the source code for the computational kernel from an external file located at nodelevel/kernels folder.

The following exchange_mode is provided in LBL mode to enable model-based exploration through experimenting with various communication patterns in MPI applications.

`exchange_mode`	Description
`message_size`	It specifies the size of the message to be exchanged in bytes.
`step_size`	It describes the step size of the message exchange as an int (1: distance one communication, 2: distance two communication, ...).
`direction`	It specifies the direction of message exchange as an int (1: uni-directional upwards shift in only positive direction, 2: bi-directional upwards and downwards shift in both positive and negative directions).
`periodic`	It enables or disables the communication periodicity as a bool (0: false, 1: true).

The config.cfg file can be edited to select the benchmark_kernel and kernel_mode. More information about this config.cfg file is available in the DisCostiC help documentation.

🥅 Compilation💭

The compilation offers the following choices to enable the output report's data format generation and to enable the tracing of the simulator's own implementation. An executable will be generated after compilation.

Command	Description
`make`	This enables JSON data format without ITAC profiling of simulator implementation.
`make otf2`	This enables both JSON and OTF2 data format without ITAC profiling of simulator implementation.
`make trace_MPI`	This enables JSON data format and the standard ITAC tracing mode of simulator implementation with the information about MPI call functions (enabled flag: `-trace`).
`make trace_all`	This enables JSON data format and the verbose ITAC tracing mode of simulator implementation with the information on both user-defined and MPI call functions (enabled flag: `-trace -tcollect flag`).

🟢 Run💭

In the batch script, the number of the simulator processes is configured as the number of simulated processes plus one. To run the batch script on any system, ITAC profiling can be enabled or disabled, which will or will not dump the simulator's own trace in ITAC:

Command	Description
`sbatch Run_Simulation.sh`	This performs the simulation without tracing the own implementation of the simulator using ITAC.
`sbatch Run_Simulation_ITAC.sh`	This dumps the simulator's own trace into ITAC to investigate the implementation of the simulator.

The Run_Simulation_ITAC.sh script only generates a single STF file (simulation.single.stf) in the main directory due to the export of the following variables:

export VT_LOGFILE_FORMAT=SINGLESTF
export VT_LOGFILE_NAME=simulation
export VT_LOGFILE_PREFIX=$SLURM_SUBMIT_DIR
export VT_FLUSH_PREFIX=/tmp

The formats, names and locations of output files in these environmental variables can be adjusted as desired.

🔁 Clean and uninstall💭

Command	Description
`make clear`	This cleans up the working directory by removing all unnecessary DisCostiC files, such as .dms, .otf, *.csv files.
`make uninstall`	This uninstalls the DisCostiC framework, including installed files and CMake-specific files.

📶 DisCostiC output

🌐 1. Web interface with Google Chrome browser (DMS file format)💬

Upon completion of the run, DisCostiC generates a report referred to as DisCostiC.dms. DisCostiC.dms is a straightforward JSON object data format file that can be viewed using the Google Chrome browser. Use the following Google Chrome web browser to load the generated JSON file:

chrome://tracing

📊 2. Graphical user interface with Vampir (OTF2 file format)💬

Upon completion of a run that was compiled with make otf2, DisCostiC generates a report called DisCostiC/traces.otf2. The DisCostiC/traces.otf2 file is an OTF2 object data format file and can be viewed using third-party tools like ITAC and Vampir. This format is produced by the ChromeTrace2Otf2 converter, which converts the JSON object data format file (DisCostiC.dms) to the OTF2 object data format file (DisCostiC/traces.otf2).

Use the following commond to open an OTF2 small trace file (DisCostiC/traces.ot2) in Vampir:

vampir DisCostiC/traces.otf2

📊 3. Graphical user interface with ITAC (STF file format)💬

To convert an OTF2 file to a single Structured Trace File (STF) file format DisCostiC.single.stf and to open it in ITAC, invoke the Intel Trace Analyzer GUI and follow the below steps:

(1) traceanalyzer &
(2) Go to File > Open; from the Files of a type field, select Open Trace Format, navigate to the `DisCostiC/traces.otf2` file, and double-click to open it
(3) The OTF2 to STF conversion dialog appears. Review the available fields and checkboxes, and click Start to start the conversion. As a result, the OTF2 file will be converted to STF (DisCostiC/traces.otf2.single.stf), and you will be able to view it in the Intel Trace Analyzer.

🗄️ 4. Statistical or log data💭

    ----------------------------------------------------------------
    DisCostiC
    ----------------------------------------------------------------
    Full form: Distributed Cost in Cluster
    Version: v1.0.0
    Timestamp: Tue Jan 20 11:16:42 2024
    Author: Ayesha Afzal <ayesha.afzal@fau.de>
    Copyright: © 2024 HPC, FAU Erlangen-Nuremberg. All rights reserved

    HEAT program in FILE mode of kernel is simulated for 100000 iterations and 72 processes

    ----------------------------------------------------------------
    COMPUTATIONAL PERFORMANCE MODEL
    ----------------------------------------------------------------
    Code simulating on IcelakeSP_Platinum-8360Y chip of cores: {1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25 , 26 , 27 , 28 , 29 , 30 , 31 , 32 , 33 , 34 , 35 , 36 }
    Single-precision maximum chip performance [GF/s]: {614.4 , 1228.8 , 1843.2 , 2457.6 , 3072 , 3686.4 , 4300.8 , 4915.2 , 5529.6 , 6144 , 6758.4 , 7372.8 , 7987.2 , 8601.6 , 9216 , 9830.4 , 10444.8 , 11059.2 , 11673.6 , 12288 , 12902.4 , 13516.8 , 14131.2 , 14745.6 , 15360 , 15974.4 , 16588.8 , 17203.2 , 17817.6 , 18432 , 19046.4 , 19660.8 , 20275.2 , 20889.6 , 21504 , 22118.4 }
    Double-precision maximum chip performance [GF/s]: {307.2 , 614.4 , 921.6 , 1228.8 , 1536 , 1843.2 , 2150.4 , 2457.6 , 2764.8 , 3072 , 3379.2 , 3686.4 , 3993.6 , 4300.8 , 4608 , 4915.2 , 5222.4 , 5529.6 , 5836.8 , 6144 , 6451.2 , 6758.4 , 7065.6 , 7372.8 , 7680 , 7987.2 , 8294.4 , 8601.6 , 8908.8 , 9216 , 9523.2 , 9830.4 , 10137.6 , 10444.8 , 10752 , 11059.2 }

    ----------------------------------------------------------------
    COMMUNICATION PERFORMANCE MODEL
    ----------------------------------------------------------------
    Code simulating on InfiniBand with IntelMPI library
    Hockney model parameters (intra chip): 41185.7 [ns] latency, 0.079065 [s/GB] inverse bandwidth 
    Hockney model parameters (inter chip): 64567.6 [ns] latency, 0.124122 [s/GB] inverse bandwidth 
    Hockney model parameters (inter node): 47628.7 [ns] latency, 0.091516 [s/GB] inverse bandwidth 
    Hockney model parameters (inter cluster): 47628.7 [ns] latency, 0.091516 [s/GB] inverse bandwidth 
    LogGP model additional parameters (intra chip): 24429.9 [ns] o, 199.008 [ns] g, 0.065 [ns] O
    LogGP model additional parameters (inter chip): 39748.3 [ns] o, 338.249 [ns] g, 0.085 [ns] O
    LogGP model additional parameters (inter node): 47938.9 [ns] o, 4980.51 [ns] g, 0.084 [ns] O
    LogGP model additional parameters (inter cluster): 47938.9 [ns] o, 4980.51 [ns] g, 0.084 [ns] O
    Eager threshold: 65535 bytes

    ----------------------------------------------------------------
    Simulation runtime in DisCosTiC: 15.7239 [s]
    ----------------------------------------------------------------

    ----------------------------------------------------------------
    FULL APPLICATION PERFORMANCE (for all MPI processes)
    rank                         runtime [s]
    ----------------------------------------------------------------
    0                              2154.83
    1                              2154.83
    .                              .......
    .                              .......
    .                              .......
    71                             2154.67
    ----------------------------------------------------------------

💻 System model

The system parameters, either hypothetical or actual, can be tuned by editing the config.cfg file available in the current directory. For the detailed documentation of the system model, please take a look at the config.cfg file and the YAML files available in the machine files and network files directories.

🔌 Cluster model💭

The resource allocation, intercluster characteristics, and runtime modalities are all included in the cluster model.

Metadata information	Description
`number_of_nodes`	Number of utilizing nodes on a cluster
`task_per_node`	Number of utilizing cores on the node of a cluster
`number_of_processes`	Number of running processes on the cluster
`inter_cluster or heterogeneous`	Communication across clusters (0: inter cluster disabled; 1: inter cluster enabled)
`number_of_iterations`	Number of iterations for the program run
`dim_x, dim_y, dim_z`	Problem size for the program run; high-dimensional parameters will be disregarded for low-dimensional problems.

🔌 Interconnect model💭

The YAML formatted network files and the choice of performance model and mode of communication are included in the network model.

Metadata information	Description
`name`	Name of the YAML formatted interconnect file
`intra_chip`	Communication inside chip
`inter_chip`	Communication across chips
`inter_node`	Communication across nodes
`latency`	Latency in sec for various kinds of interconnects
`bandwidth`	Bandwidth in GB/s for various kind of interconnects
`eager_limit`	Data size in bytes at which communication mode changes from eager to rendezvous protocol
`waitio_mode`	Mode of the WaitIO MPI library (socket, file or hybrid)
`comm_model`	Performance model for communication (0: LogGP, 1: HOCKNEY)

For illustration, a few examples of available network files are given below:

Network files (YAML format)	Description
InfiniBand_WaitIO_intercluster	Communication across clusters for WaitIO MPI library and InfiniBand interconnect
InfiniBand_WaitIO_internode	Communication across nodes for WaitIO MPI library and InfiniBand interconnect
InfiniBand_WaitIO_interchip	Communication across chips for WaitIO MPI library and InfiniBand interconnect
InfiniBand_WaitIO_intrachip	Communication inside the chip for WaitIO MPI library and InfiniBand interconnect
Tofu-D_WaitIO_internode	Communication across nodes for WaitIO MPI library and Tofu-D interconnect
Tofu-D_WaitIO_interchip	Communication across chips for WaitIO MPI library and Tofu-D interconnect
Tofu-D_WaitIO_intrachip	Communication inside the chip for WaitIO MPI library and Tofu-D interconnect
InfiniBand_IntelMPI_internode	Communication across nodes for Intel MPI library and InfiniBand interconnect
InfiniBand_IntelMPI_interchip	Communication across chips for Intel MPI library and InfiniBand interconnect
InfiniBand_IntelMPI_intrachip	Communication inside the chip for Intel MPI library and InfiniBand interconnect
OmniPath_IntelMPI_internode	Communication across nodes for Intel MPI library and Omni-Path interconnect
OmniPath_IntelMPI_interchip	Communication across chips for Intel MPI library and Omni-Path interconnect
OmniPath_IntelMPI_intrachip	Communication inside the chip for Intel MPI library and Omni-Path interconnect

🔌 Node model💭

The machine files in YAML format, the selection of compiler settings and the performance model of computation are included in the node model.

Metadata information	Description
`name`	Name of the YAML formatted processor file
`sockets_per_node`	Number of sockets in one node
`ccNUMA_domains_per_socket`	Number of ccNUMA domains in one socket
`cores_per_ccNUMA_domain`	Number of cores in one ccNUMA domain
`FP_instructions_per_cycle`	Floating point instructions (ADD, MUL) per cycle
`FP_operations_per_instruction_(SP/DP)`	Single or double precision floating point operations per instruction
`clock_frequency`	Clock frequency in GHz
`memory_bandwidth`	Memory bandwidth in GB/s
`compiler-flags`	STD and SIMD optimizations of the compiler
`pmodel`	Performance model for computation (Roofline, ECM)

For illustration, a few examples of available machine files are given below:

Machine files (YAML format)	Description
A64FX	48 core Fujitsu A64FX FX1000 CPU @ 1.8 GHz
IcelakeSP_Platinum-8360Y	36 core Intel(R) Xeon(R) Icelake SP Platinum 8360Y CPU @ 2.4 GHz
CascadelakeSP_Gold-6248_SNC	20 core Intel(R) Xeon(R) Cascadelake SP Gold 6248 CPU @ 2.5 GHz
SkylakeSP_Gold-5122	4 core Intel(R) Xeon(R) Skylake SP Gold 5122 CPU @ 3.6 GHz
SkylakeSP_Gold-6148	20 core Intel(R) Xeon(R) Skylake SP Gold 6148 CPU @ 2.4 GHz
SkylakeSP_Gold-6148_SNC	20 core Intel(R) Xeon(R) Skylake SP Gold 6148 CPU with SNC enabled @ 2.4 GHz
SkylakeSP_Platinum-8147	24 core Intel(R) Xeon(R) Skylake SP Platinum 8174 CPU @ 3.1 GHz (2.7 GHz used)
BroadwellEP_E5-2630v4	18 core Intel(R) Xeon(R) Broadwell EN/EP/EX E5-2697 v4 CPU @ 2.3 GHz
BroadwellEP_E5-2697v4_CoD	18 core Intel(R) Xeon(R) Broadwell EN/EP/EX E5-2697 v4 CPU with CoD enabled @ 2.3 GHz
HaswellEP_E5-2695v3	14 core Intel(R) Xeon(R) Haswell EN/EP/EX E5-2695 v3 CPU @ 2.3 GHz
HaswellEP_E5-2695v3_CoD	14 core Intel(R) Xeon(R) Haswell EN/EP/EX E5-2695 v3 CPU with CoD enabled @ 2.3 GHz
IvyBridgeEP_E5-2660v2	10 core Intel(R) Xeon(R) IvyBridge EN/EP/EX E5-2660 v2 CPU @ 2.2 GHz
IvyBridgeEP_E5-2690v2	10 core Intel(R) Xeon(R) IvyBridge EN/EP/EX E5-2690 v2 CPU @ 3.0 GHz
SandyBridgeEP_E5-2680	8 core Intel(R) Xeon(R) SandyBridge EN/EP E5-2680 CPU @ 2.7 GHz

DisCostiC help💭

The help of DisCostiC (./discostic --help) lists as:

Distributed Cost in Cluster (DisCostiC)
Version : v1.0.0
Author : Ayesha Afzal <ayesha.afzal@fau.de>
Copyright : 2024 HPC, FAU Erlangen-Nuremberg. All rights reserved

------------------------------------
 Arguments for ./discostic
------------------------------------
     --version, -v          show simulator's version information and exit
     --help, -h             show this help message and exit

------------------------------------
 Application model for config.cfg
------------------------------------
     benchmark_kernel       name of the kernel used in the program
     kernel_mode            mode of the kernel (FILE, SRC, LBL, COMP)

------------------------------------
 Cluster model for config.cfg
------------------------------------
     heteregeneous          a bool flag to enable or disable the second system (1: enabled, 0: disabled)
     number_of_iterations   number of iterations of the program
     dim_x, dim_y, dim_z    problem size; high-dimensional parameters will be disregarded for low-dimensional problems.
     task_per_node          number of running processes on one node
     number_of_processes    total number of running processes

------------------------------------
 Interconnect model for config.cfg
------------------------------------
     interconnect           name of the interconnect
     MPI_library            name of the MPI library for the first system (IntelMPI, WaitIO)
     comm_model             performance model of communication (0: LogGP, 1: HOCKNEY)
     waitio_mode            mode of the WaitIO MPI library (socket, file or hybrid)

------------------------------------
 Node model for config.cfg
------------------------------------
     micro_architecture     name of the YAML machine file
     compiler-flags         STD and SIMD optimizations (-03 -xCORE-AVX512 -qopt-zmm-usage=high, -03 -xHost -xAVX, -Kfast -DTOFU); If not set, flags are taken from the YAML formatted machine file.
     pmodel                 performance model of computation (ECM, Roofline)
     vvv                    a bool flag to enable or disable the verbose node output (0: disabled, 1: enabled)
     cache-predictor        cache prediction with layer conditions or cache simulation with pycachesim (LC, SIM)
     penalty                runtime penalty for the computation model in nanoseconds, used only in LBL or COMP mode

------------------------------------
 Delay injection mode for config.cfg
------------------------------------
     delay                  a bool flag to enable or disable the delay injection (0: disabled, 1: enabled)
     delay_intensity        intensity of delay as a multiple of computation time of one iteration
     delay_rank             process rank of the injected delay
     delay_timestep         iteration for the occurrence of the injected delay

------------------------------------
 Noise injection mode for config.cfg
------------------------------------
     noise                  a bool flag to enable or disable the noise injection (0: disabled, 1: enabled)
     noise_intensity        intensity of random noise, i.e., rand() % noise_intensity 
     noise_start_timestep   starting iteration for the injected noise
     noise_stop_timestep    ending iteration for the injected noise

------------------------------------
 Output for config.cfg
------------------------------------
     filename               output file name that contains details for the debug purpose
     chromevizfilename      output file name that contains all time-rank tracing data for visualization with Chrome tracing browser
     Verbose                a bool flag to enable or disable the verbose output (0: disabled, 1: enabled)

👩‍💻 Automating DSEL generation through static analysis

To perform the static analysis, the following procedure must be followed before running the DisCostiC batch script:

pip install -r staticanalysis/requirements.txt
python staticanalysis/Convert-<benchmark_kernel>.py

Files Description

Convert-<benchmark_kernel>.py A helper script that takes the original code and, through annotation, locates code loops and communication and identifies user-defined variables, such as dim_x and dim_y in the Cartesian stencil heat.c code, and ultimately generates DSEL code.

requirements.txt Each (sub)dependency is listed and pinned using "==" to specify a particular package version; see requirements.txt. This project makes use of the lightweight Python tree data structure anytree==2.8.0 and type hints for Python 3.7+ typing_extensions==4.4.0. These dependencies are later installed (normally in a virtual environment) through pip using the pip install -r staticanalysis/requirements.txt command. The generated tree's syntax is the same as the original C/C++ code.

For the specified program, this will produce the following files:

Files	Location	Description
`<benchmark_kernel>.c`	nodelevel/kernels	It comprises only the generated computational loop kernel of the MPI parallel program.
`<benchmark_kernel>_<kernel_mode>.hpp`	test	It includes the entire generated code expressed in DisCostiC DSEL language.

📝 Essential DisCostiC routines💭

The goal is to offer convenient, compact and practically usable application programming interfaces (APIs) with appropriate abstractions. It provides information about the call tree and attributes for communication (volume, mode, and protocol) and computation (data access pattern, flop count).

1. DisCostiC->Rank_Init(DisCostiC::Indextype rank);

2. DisCostiC->SetNumRanks(DisCosTiC::Datatype numrank);

3. DisCostiC->Exec("LBL:STREAM_TRIAD", 
					DisCostiC::Event depending_operations);
   DisCostiC->Exec("COMP:TOL=2.0||TnOL=1.0|TL1L2=3.0|TL2L3=6.0|TL3Mem=14.2", 
   					DisCostiC::Event depending_operations);
   DisCostiC->Exec("FILE:copy.c//BREAK:COPY//./BroadwellEP_E5-2697v4_CoD.yml//18//-D N 100000", 
   					DisCostiC::Event depending_operations);
   DisCostiC->Exec("SRC:double s, a[N],b[N],c[N];\n\nfor(int i=0; i<N; i++)\n\ta[i]=b[i]+s*c[i];\n//BREAK:DAXPY//./BroadwellEP_E5-2697v4_CoD.yml//18//-D N 100000",
   					DisCostiC::Event depending_operations);

4. DisCostiC->Send(const void *message, 
                    DisCosTiC::Datatype sending_message_size_in_bytes, 
                    DisCostiC::Datatype sending_message_datatype, 
                    DisCostiC::Indextype destination_rank,
                    DisCostiC::Commtype communicator, 
                    DisCosTiC::Event depending_operations);

5. DisCostiC->Isend(const void *message, 
                    DisCosTiC::Datatype sending_message_size_in_bytes, 
                    DisCostiC::Datatype sending_message_datatype, 
                    DisCostiC::Indextype destination_rank,
                    DisCostiC::Commtype communicator,
                    DisCostiC::Request *request,
                    DisCosTiC::Event depending_operations);

6. DisCostiC->Recv(const void *message, 
                    DisCosTiC::Datatype receiving_message_size_in_bytes, 
                    DisCostiC::Datatype receiving_message_datatype, 
                    DisCostiC::Indextype source_rank,
                    DisCostiC::Commtype communicator, 
                    DisCosTiC::Event depending_operations);

7. DisCostiC->Irecv(const void *message,
                    DisCosTiC::Datatype receiving_message_size_in_bytes, 
                    DisCostiC::Datatype receiving_message_datatype, 
                    DisCostiC::Indextype source_rank,
                    DisCostiC::Commtype communicator,
                    DisCostiC::Request *request,
                    DisCosTiC::Event depending_operations);
					
8. DisCostiC->Rank_Finalize();

📝 Code blueprint as DisCostiC input💭

The following DSEL code example shows the simplest illustration of how the domain knowledge of the applications is expressed in the DisCostiC language:

DisCostiC::Event recv, send, comp;  
for (auto rank : DisCostiC::getRange(systemsize))
{
   DisCostiC->Rank_Init(rank);
   DisCosTiC->SetNumRanks(systemsize);
   /* initialization */
   comp = DisCostiC->Exec("STREAM_TRIAD", recv); 
   send = DisCostiC->Send(res, 8, ((rank + 1) % NP), MPI_COMM_WORLD, comp); 
      if(rank != 0)
      {
         recv = DisCostiC->Recv(res, 8, MPI_DOUBLE, rank - 1, MPI_COMM_WORLD, comp);
      } 
      else 
      {
         recv = DisCostiC->Recv(res, 8, MPI_DOUBLE, systemsize - 1, MPI_COMM_WORLD, comp);
      }	
   DisCostiC->Rank_Finalize(); 
}

📝 Code documentation: suite of C++ data structures and enumerated types💭

The HTML and LaTeX documentation can be generated by doxygen. To build the documentation, navigate to the Doxyfile file available in the current directory and run the command:

doxygen Doxyfile

To read the HTML and LaTeX documentation, point to a web browser at doc/html/index.html and read doc/DisCostiC.pdf, respectively.

Single Operation: accessors for local operations and their individual information at a certain grid point

Metadata information	Description
`bufSize`	Number of bytes (data size) transmitted in the communication operation (no real buffer size for comp, just added for completeness)
`DepOperations`	Dependencies for blocking routines, i.e., other operations that depend on this current operation
`IdepOperations`	Dependencies for non-blocking routines, i.e., other operations that depend on the current operation
`depCount`	Number of dependencies for this current operation
`label`	Index/identifier of this current operation for each rack
`target`	Rank of target/partner (source for recv or destination for send or no real target for comp but added for completeness)
`rank`	Owning rank of this current operation
`tag`	Tag of this current operation (no real tag for comp, just added for completeness)
`node`	Node or processing element for this current operation
`network`	Type of network for this current operation
`time`	Ending time of this current operation
`starttime`	Starting time of this current operation
`type`	Type of this current operation

enum Operation_t {		| Operation_t defines different operation types of entities
	SEND = 1,         	| Send operation type
        RECV = 2,        	| Recv operation type
        COMP = 3,        	| Computation operation type
        MSG = 4         	| Message operation type
};

Metadata information	Description
`mode`	mode of this current calling operation

enum Mode_t {		| Mode_t defines the operation type of communication
        NONBLOCKING, 	| Routines returning with the start of an operation so the next operation doesnot execute before starting the previous one; such as Isend, Irecv.
        BLOCKING     	| Routines returning only on completion of operation so the next operation doesnot execute before finishing the previous one; such as Send, Recv.
};

Performance model

Metadata information	Description
`model`	Analytic first-principle performance model for computation and communication

enum Model_t {		| Mode_t defines the used performance model 
        Roofline,	| Simple computation model type
        ECM,		| Advanced computation model type
        LOGGP,		| Simple communication model type
        HOCKNEY		| Advanced communication model type
};

Custom data types, keywords and high-level classes functionality💭
- accessors for local vectors and matrices and their individual components without the need for index accesses
- abstract base classes for the AST-generated grid to access the operations and their associated features
- solver-specific data types, e.g., time steps, operations, identifiers, etc., are declared globally (or part of a special global declaration block)

    * DisCostiC::Timetype
    * DisCostiC::Datatype
    * DisCostiC::Indextype
    * DisCostiC::Networktype

🧐 MPI-parallelized DisCostiC implementation

The underlying principle of parallel simulation is that each operation's entire data is transmitted via blocking or non-blocking MPI routines to the processes it communicates with.

Terms	Description
`simulated processes P_i`	processes from the application point of view
`simulator processes Q_i`	processes from the parallel simulation framework point of view
`master simulator process Q_0`	master process from the parallel simulation framework point of view

Initialization💭 In the MPI implementation, only the master process of the default communicator makes any print calls for debug and verbose output reporting, and all other processes of the new communicator initialize root operations only once in each run.

MPI-parallelized

Serialized

int ierr, process_Rank, size_Of_Cluster, sucesss, N_process_Rank, N_size_Of_Cluster;
MPI_Comm newcomm;
ierr = MPI_Init(NULL, NULL);
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank);

MPI_Comm_split(MPI_COMM_WORLD, (process_Rank) < 1 ? 1 : 2, 1, &newcomm);
ierr = MPI_Comm_rank(newcomm, &N_process_Rank);
ierr = MPI_Comm_size(newcomm, &N_size_Of_Cluster);
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank);
ierr = MPI_Comm_size(MPI_COMM_WORLD, &size_Of_Cluster);
std::queue<int> test;
MPI_Status status;
MPI_Request request;
    
DisCostiC_Timetype starttimeProcess;
DisCostiC::DisCostiC_OP operation, operation_og;		
int iter = 0;

iter = ((rankLocalIt->getNumOps()) / 3);

if (rank == N_process_Rank)
{
 operation_og = copy(op, operation_og);
 operation = op;
 starttimeProcess = operation.starttime;
}

op.numOpsInQueue = numOpsInQueue++;
queue.emplace(op);

Start of simulation💭 Do loop over the number of iterations and the number of steps per iteration. Processes connected to the new communicator, or "newcomm," are only used in the main simulation.

MPI-parallelized	Serialized
rank = N_process_Rank; bool cycle = true; int iter_num = 1; do{ //!< over the number of iterations do{ //!< over the number of steps per iteration	do{ queue.pop();

Within each ccNUMA domain, runtime corrections are made by monitoring the number of processes that are concurrently engaged in each computation and communication step.

SEND operation💭 The currently active process sends the operation object or array to the process listed as "operation.target," which is specific to the communication pattern simulation program.

MPI-parallelized

Serialized

operation.type = DisCostiC::Operation_t::RECV;
operation.starttime = operation.time;

int temp = oSuccessor[operation.rank][operation.node];
                        
double arr[13];                   
*serialize(operation, temp, arr);
if (operation.mode == DisCostiC::Mode_t::NONBLOCKING){
  MPI_Isend(&arr, 13, MPI_DOUBLE, operation.target, operation.tag, newcomm, &request);
}else{
  MPI_Send(&arr, 13, MPI_DOUBLE, operation.target, operation.tag, newcomm);
}

finishedRankList.push_back(operation.rank);
operation.type = DisCostiC::Operation_t::RECV;
cycle = true;

std::swap(operation.rank,operation.target);
queue.emplace(operation);

RECV operation💭 The currently running process receives the operation object or array from the process listed as "operation.target".

MPI-parallelized

Serialized

double brr[13];
if (operation.mode == DisCostiC::Mode_t::NONBLOCKING){
  MPI_Irecv(&brr, 13, MPI_DOUBLE, MPI_ANY_SOURCE, operation.tag, newcomm, &request);
}else{
  MPI_Irecv(&brr, 13, MPI_DOUBLE, MPI_ANY_SOURCE, operation.tag, newcomm, &request);
}
*deserialize(operation, temp, brr);                   
oSuccessor[operation.rank][operation.node] = brr[12];
cycle=true;
operation.numOpsInQueue = numOpsInQueue++;
finishedRankList.push_back(operation.rank);

operation.type = DisCostiC::Operation_t::MSG;

finishedRankList.push_back(operation.rank);
DisCostiC::DisCostiC_queueOP matchedOP;
if(M.listmatch(operation, &UMQ[operation.rank], &matchedOP))

else { 
 DisCostiC::DisCostiC_queueOP nOp;
 nOp.bufSize = operation.bufSize;
 nOp.src = operation.target;
 nOp.tag = operation.tag;
 nOp.label = operation.label;
 recvQueue[operation.rank].push_back(nOp);
}

MSG operation💭 If MSG is not found, the cycle will continue (cycle = true) until it is found.

MPI-parallelized

Serialized

if ((earliestFinish = operation.time) <= ...){
 ...
 operation.numOpsInQueue = numOpsInQueue++;
 ...
 iter_num++;		
 cycle = false;		        		
}else{
 ...
 cycle = true;
}

if ((earliestFinish = operation.time) <= ...){
 ...
 DisCostiC::DisCostiC_queueOP matchedOP;
 if(M.listmatch(operation, &recvQueue[operation.rank], &matchedOP)) { 
 ...
 }else { 
 #ifdef USE_VERBOSE 
 std::cout << "[MSG ==> NOT FOUND] in receive queue - add to unexpected queue"<< std::endl;
 #endif
 DisCostiC::DisCostiC_queueOP nOp;
 nOp.bufSize = operation.bufSize;
 nOp.src = operation.target;
 nOp.tag = operation.tag;
 nOp.label = operation.label;
 nOp.starttime = operation.time; // when it was started
 UMQ[operation.rank].push_back(nOp);
 }
}else{
 ...
 queue.emplace(operation);
}

Adding new sorted operations in queue in order💭

MPI-parallelized	Serialized
	newOps=false; newOps = true; op.numOpsInQueue=numOpsInQueue++; queue.emplace(op);

End of while statement💭 The end of the do loops over the number of iterations and the number of steps per iteration. The necessary information gathered from all processes of the new communicator is communicated to the default communicator's master process.

MPI-parallelized

Serialized

while (cycle);
 cycle = true;
 operation_og.time = oSuccessor[operation.rank][operation.node];
 operation = copy(operation_og, operation);
 } while (iter_num <= iter);
 double arr[13];
 double brr[13];
 *serialize(operation, oSuccessor[rank][operation.node], arr);
 *finalize(oSuccessor[rank][operation.node], numRanks, numOpsInQueue, rank, operation.node, brr);
 MPI_Send(&brr, 13, MPI_DOUBLE, 0, 1, MPI_COMM_WORLD);
 }
 else if (process_Rank == 0)
 {
 bool cycle = true;
 double brr[13];
 while (test.size() < numRanks)
 {
   MPI_Recv(&brr, 13, MPI_DOUBLE, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
   if (status.MPI_TAG == 1)
   {
     test.emplace(status.MPI_SOURCE);
     oSuccessor[brr[3]][brr[4]] = brr[0];
     numOpsInQueue = brr[2];
   }
 }
}
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &process_Rank);
MPI_Barrier(MPI_COMM_WORLD);

while(!queue.empty() || newOps)

🚀 Planned features for further development

Threading model beyond message passing
Networking-level contention model
Energy consumption model

📚 References about the theory of potential application scenarios for DisCostiC

[1]: A. Afzal et al.: Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study. DOI:10.1109/CLUSTER.2019.8890995

[2]: A. Afzal et al.: Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs. DOI:10.1007/978-3-030-50743-5_20

[3]: A. Afzal et al.: Analytic Modeling of Idle Waves in Parallel Programs: Communication, Cluster Topology, and Noise Impact. DOI:10.1007/978-3-030-78713-4_19

[4]: A. Afzal et al.: The Role of Idle Waves, Desynchronization, and Bottleneck Evasion in the Performance of Parallel Programs. DOI:10.1109/TPDS.2022.3221085

[5]: A. Afzal et al.: Analytic performance model for parallel overlapping memory-bound kernels. DOI:10.1002/cpe.6816

[6]: A. Afzal et al.: Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications. DOI:10.1007/978-3-031-30442-2_12

[7]: A. Afzal et al.: Making applications faster by asynchronous execution: Slowing down processes or relaxing MPI collectives. DOI:10.1016/j.future.2023.06.017

🔒 License

AGPL-3.0

⚠️ Disclaimer

Note

A note to the reader: Please report any bugs to the issue tracker or contact ayesha.afzal@fau.de.

🔗 Acknowledgement

This work is funded by the KONWHIR project OMI4PAPPS.

📱Contact

_{@AyeshaAfzal91}

Ayesha Afzal, Erlangen National High Performance Computing Center (NHR@FAU)

mailto: ayesha.afzal@fau.de

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
doc		doc
include		include
interconnectlevel/network-files		interconnectlevel/network-files
kerncraftintegration		kerncraftintegration
nodelevel		nodelevel
src		src
staticanalysis		staticanalysis
test		test
visualization		visualization
.zenodo.json		.zenodo.json
CMakeLists.txt		CMakeLists.txt
Doxyfile		Doxyfile
LICENSE		LICENSE
README.md		README.md
Run_Simulation.sh		Run_Simulation.sh
Run_Simulation_ITAC.sh		Run_Simulation_ITAC.sh
cmake_uninstall.cmake.in		cmake_uninstall.cmake.in
config.cfg		config.cfg

License

RRZE-HPC/DisCostiC-Sim

Folders and files

Latest commit

History

Repository files navigation

DisCostiC: Distributed Cost in Cluster

Table of Contents

💡 Description

Compilation and build

📶 DisCostiC output

💻 System model

👩‍💻 Automating DSEL generation through static analysis

🧐 MPI-parallelized DisCostiC implementation

🚀 Planned features for further development

📚 References about the theory of potential application scenarios for DisCostiC

🔒 License

⚠️ Disclaimer

🔗 Acknowledgement

📱Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Languages