Last revision!

SigmaJahan · SigmaJahan · commit 9b7607d7ddbc · 2025-01-28T20:06:29.000-04:00
diff --git a/0_Artifact_Testing/.DS_Store b/0_Artifact_Testing/.DS_Store
diff --git a/0_Artifact_Testing/README.md b/0_Artifact_Testing/README.md
@@ -1,9 +1,9 @@
 # Case Study Artifacts for ICSE 2025 Artifact Evaluation
 
 ## Overview
-This repository contains the necessary artifacts to reproduce the case study results presented in the associated research paper (Pre-print: https://arxiv.org/abs/2501.12560). The artifacts provided include data files, pre-trained models, and scripts used for the evaluation.
+This repository contains the necessary artifacts to reproduce the case study results presented in the associated research paper (Pre-print: https://arxiv.org/abs/2501.12560). The artifacts provided include data (extracted features), pre-trained models, and scripts used for the evaluation.
 
-**Note:** This package focuses on reproducing the case study results (included in the paper) due to the computational and time-intensive nature of processing the entire dataset.
+**Note:** This package focuses on reproducing the case study results (included in the paper) due to the computational and time-intensive nature of processing the entire dataset. To reduce computational overhead, we provided extracted dynamic and static features from the sample PixelCNN model. These pre-processed features allow the scripts to run efficiently while producing the same results as those discussed in the paper.
 
 ---
 
@@ -106,13 +106,5 @@ Upon running the provided scripts, the following outputs will be generated:
 
 ---
 
-## Known Limitations
-- Full experiment replication requires access to high-performance computing resources (e.g., Compute Canada).  
-- To reduce computational overhead, we provided extracted dynamic and static features from the sample PixelCNN model. These pre-processed features allow the scripts to run efficiently while producing the same results as those discussed in the paper.
-
----
-
 ## Verifying the Results
-The provided artifacts enable reviewers to reproduce and verify the results of the case study without referring to the paper. Simply run the scripts as instructed above, and compare the generated outputs (fault detection, fault categorization, and root cause analysis) with the expected results outlined in this document.
-
-By following the steps and using the provided artifacts, reviewers can confirm the validity of the methodologies and results presented in the research paper.
+The provided artifacts enable reviewers to reproduce and verify the results of the case study without referring to the paper. Simply run the scripts as instructed above, and compare the generated outputs (fault detection, fault categorization, and root cause analysis) with the expected results outlined in this document. By following the steps and using the provided artifacts, reviewers can confirm the validity of the methodologies and results presented in the research paper.
diff --git a/README.md b/README.md
@@ -1,9 +1,11 @@
 # **Replication Package for DEFault**
 
-Welcome to the replication package for **DEFault**, a framework designed to improve the detection and diagnosis of faults in Deep Neural Networks (DNNs). This repository provides all the necessary code and data to reproduce the experiments from our paper, which has been accepted in ICSE - Research Track 2025. Pre-print: https://arxiv.org/abs/2501.12560
+Welcome to the replication package for **DEFault**, a framework designed to improve the detection and diagnosis of faults in Deep Neural Networks (DNNs). This repository provides all the necessary code and data to reproduce the experiments from our paper, which has been accepted in ICSE - Research Track 2025.
 
 **"Improved Detection and Diagnosis of Faults in Deep Neural Networks using Hierarchical and Explainable Classification."**
 
+Pre-print of the paper can be found: https://arxiv.org/abs/2501.12560
+
 ---
 ## **How DEFault Works**
 
@@ -69,9 +71,10 @@ The figure below illustrates the workflow of DEFault, showing its fault detectio
   - **C_RootCauseAnalysis/**: Root cause analysis scripts.
 - **`e_Evaluation/`**: Scripts to evaluate DEFault on real-world and seeded faults.
 - **`f_Figures/`**: Figures used in the paper.
-- **`g_Dataset/`**: Labeled datasets for training and evaluation.
+- **`g_Dataset/`**: Labeled datasets for training and testing.
 - **`h_CohenKappaAnalysis/`**: Scripts for dataset consistency validation using Cohen's Kappa.
 - **`i_CaseStudy/`**: Scripts for case studies on real-world models (e.g., PixelCNN).
+- **`j_HPC_Slurm/`**: Example Script for Slurm job on Compute Canada with all the configuration.
 
 ---
 
@@ -80,7 +83,7 @@ The figure below illustrates the workflow of DEFault, showing its fault detectio
 ### **Operating System**
 Tested on:
 - Ubuntu 20.04 LTS or later
-- CentOS 7+ (for HPC environments such as Compute Canada)
+- HPC environments such as Compute Canada (Graham Cluster)
 
 Compatible with:
 - Windows 10/11 (via Windows Subsystem for Linux - WSL2)
@@ -95,11 +98,11 @@ Compatible with:
 
 **Recommended:**
 - GPU: NVIDIA with CUDA support
-- HPC access (e.g., Compute Canada) for large-scale execution
+- HPC access (e.g., Compute Canada) for complete experiment
 
 ### **Software Requirements**
 
-- **Python Version:** 3.8 or later  
+- **Python Version:** 3.10 or later  
 - **Dependencies:** Install via `requirements.txt`:  
   ```bash
   pip install -r requirements.txt
@@ -114,15 +117,15 @@ default_env\Scripts\activate      # On Windows
 
 ---
 
-## **Usage: Whole Dataset vs. Sample Data**
+## **Usage: Complete Experiment vs. Lightweight Verification**
 
 **Important:**  
-- Running the **whole dataset** requires significant computational resources and time.
-- Running the **sample data** is **recommended**, as it provides a quick and effective way to verify the framework's functionality.
+- Running the **Complete Experiment** on the whole dataset requires significant computational resources and time.
+- Running the **Lightweight Verification** on a sample DNN program is **recommended**, as it provides a quick and effective way to verify the framework's functionality.
 
 ---
 
-## **Usage: Sample Data**
+## **Usage: Lightweight Verification**
 
 The 0_Artifact_Testing directory provides all necessary artifacts to reproduce case study results with minimal computational overhead. It includes:
 
@@ -137,7 +140,7 @@ The expected result for the sample data is provided inside the directory.
 
 ---
 
-## **Usage: Whole Dataset**
+## **Usage: Complete Experiment**
 
 ### **1. Data Collection**
 
diff --git a/j_HPC_Slurm/README.md b/j_HPC_Slurm/README.md
@@ -0,0 +1,60 @@
+# **SLURM Job Script for HPC Execution**
+
+## **Overview**
+This repository includes a **SLURM job script** (`run_script.slurm`) designed to execute scripts efficiently on **Compute Canada’s HPC clusters** (e.g., **Graham, Narval, Beluga**). The script automates **job scheduling, dependency setup, execution, and runtime tracking**, making it suitable for running our workflow.
+
+---
+
+## **Usage Instructions**
+
+### **1. Preparing the Environment**
+Before submitting the job, ensure:
+- You have access to a **Compute Canada account**.
+- You are working within a **Linux-based HPC environment** that supports **SLURM job scheduling**.
+- Your scripts and dependencies are ready to run.
+
+---
+
+### **2. Creating the SLURM Script**
+We provided `run_script.slurm` as an example script used in our experiment. Modify the script based on your specific requirements:
+- **Job Name**: Update `--job-name=example_task`.
+- **Email Notifications**: Replace `your_email@domain.com` with your email.
+- **Account Name**: Update `--account=your_account_name` to your Compute Canada account.
+- **Script Execution**: Change `python your_script.py your_parameters` to match your script and arguments.
+- **Dependencies**: Modify the `pip install` command as needed.
+
+---
+
+### **4. Submitting the Job**
+Once configured, submit the SLURM script to the cluster:
+```bash
+sbatch run_script.slurm
+```
+
+To check job status:
+```bash
+squeue --me
+```
+
+To cancel a running job:
+```bash
+scancel JOB_ID
+```
+
+---
+
+### **5. Viewing Job Output**
+SLURM automatically generates log files for job execution. These can be accessed using:
+```bash
+cat slurm-<job_id>.out
+```
+Replace `<job_id>` with the actual job number assigned by SLURM.
+
+---
+
+### **6. Expected Runtime**
+- **Full-scale replication** (complete dataset): Can take several days, depending on workload and hardware availability. For optimal performance, **running on Compute Canada’s GPU-enabled nodes (P100, V100, or T4) is recommended**.
+
+---
+
+For further assistance, refer to Compute Canada’s **[SLURM Job Submission Guide](https://docs.computecanada.ca/wiki/Running_jobs)**.
diff --git a/j_HPC_Slurm/run_script.slurm b/j_HPC_Slurm/run_script.slurm
@@ -0,0 +1,41 @@
+#!/bin/bash
+
+#SBATCH --time=72:00:00                # Maximum execution time (hh:mm:ss)
+#SBATCH --gpus-per-node=1              # Number of GPUs per node
+#SBATCH --cpus-per-task=1              # Number of CPU cores per task
+#SBATCH --constraint=[p100|v100|t4]    # GPU types (P100, V100, or T4)
+#SBATCH --mem-per-cpu=80G              # Memory per CPU core
+#SBATCH --job-name=deepcrime_task      # Job name
+#SBATCH --mail-user=your_email@domain.com   # Email for notifications
+#SBATCH --mail-type=ALL                # Notify on job start, end, and fail
+#SBATCH --account=your_account_name    # Compute Canada account
+
+# Record the start time
+start_time=$(date +%s)
+
+# Load the required Python module
+module load python/3.10
+
+# Create and activate a virtual environment in the temporary directory
+virtualenv --no-download $SLURM_TMPDIR/env
+source $SLURM_TMPDIR/env/bin/activate
+
+# Install dependencies
+pip install numpy tensorflow==2.15.1 matplotlib progressbar scikit-learn \
+            termcolor h5py pandas statsmodels networkx patsy scipy psutil --no-index
+
+# Running any scripts (e.g., run the DeepCrime script from Fault Seeding)
+python run_deepcrime.py FNN_72328867_correct.py change_dropout_rate
+
+# Record the end time and calculate duration
+end_time=$(date +%s)
+duration=$((end_time - start_time))
+hours=$((duration / 3600))
+minutes=$(( (duration % 3600) / 60 ))
+seconds=$((duration % 60))
+
+# Display runtime
+echo "The script took $hours hours, $minutes minutes, and $seconds seconds to run."
+
+# Deactivate the virtual environment
+deactivate