Skip to content

Self-Supervised Deep Learning based Surrogate Models for Fault-Tolerant Edge Computing

License

Notifications You must be signed in to change notification settings

imperial-qore/DeepFT

Repository files navigation

DeepFT

The emergence of latency-critical AI applications has been supported by the evolution of the edge computing paradigm. However, edge solutions are typically resource-constrained, posing reliability challenges due to heightened contention for compute and communication capacities and faulty application behavior in the presence of overload conditions. Although a large amount of generated log data can be mined for fault prediction, labeling this data for training is a manual process and thus a limiting factor for automation. Due to this, many companies resort to unsupervised fault-tolerance models. Yet, failure models of this kind can incur a loss of accuracy when they need to adapt to non-stationary workloads and diverse host characteristics. To cope with this, we propose a novel modeling approach, called DeepFT, to proactively avoid system overloads and their adverse effects by optimizing the task scheduling and migration decisions. DeepFT uses a deep surrogate model to accurately predict and diagnose faults in the system and co-simulation based self-supervised learning to dynamically adapt the model in volatile settings. It offers a highly scalable solution as the model size scales by only 3 and 1 percent per unit increase in the number of active tasks and hosts. Extensive experimentation on a Raspberry-Pi based edge cluster with DeFog benchmarks shows that DeepFT can outperform state-of-the-art baseline methods in fault-detection and QoS metrics. Specifically, DeepFT gives the highest F1 scores for fault-detection, reducing service deadline violations by up to 37% while also improving response time by up to 9%.

Quick Test

Clone repo.

git clone https://github.com/imperial-qore/DeepFT.git
cd PreGAN/

Install dependencies.

sudo apt -y update
python3 -m pip --upgrade pip
python3 -m pip install matplotlib scikit-learn
python3 -m pip install -r requirements.txt
python3 -m pip install torch==1.7.1+cpu torchvision==0.8.2+cpu -f https://download.pytorch.org/whl/torch_stable.html
export PATH=$PATH:~/.local/bin

Change line 118 in main.py to use one of the implemented fault-tolerance techniques: DeepFTRecovery, PCFTRecovery, DFTMRecovery, ECLBRecovery, AWGGRecovery or TopoMADRecovery and run the code using the following command.

python3 main.py

External Links

Items Contents
Pre-print (coming soon)
Contact Shreshth Tuli (@shreshthtuli)
Funding Imperial President's scholarship

Cite this work

Our work is accepted in IEEE Conference on Computer Communications (INFOCOM) 2023. Cite our work using the bibtex entry below.

@inproceedings{tuli2022deepft,
  title={{DeepFT: Fault-Tolerant Edge Computing using a Self-Supervised Deep Surrogate Model}},
  author={Tuli, Shreshth and Casale, Giuliano and Cherkasova, Ludmila and Jennings, Nicholas R},
  booktitle={IEEE Conference on Computer Communications (INFOCOM)},
  year={2023},
  organization={IEEE}
}

License

BSD-3-Clause. Copyright (c) 2022, Shreshth Tuli. All rights reserved.

See License file for more details.