GitHub - Impelon/log-summarization: A thesis investigating the use of large language models for summarizing application logs.

About

This project aimed to investigate how good modern language models are at summarizing text from application logs. When compared to previous work LogSummary, which used a complex pipeline based on TextRank, language models outperformed LogSummary on their own dataset with only a little fine-tuning. Skim through the accompanying thesis for more details on the datasets, model architecture and results.

Installation

clone the repository via

git clone https://github.com/Impelon/log-rca-summarization.git

optionally use virtualenv to make a virtual environment for python-packages
open code-folder in terminal
make sure system dependencies are installed; the required packages can be found in code/apt_requirements.txt and code/apt_optional_requirements.txt; if you use the apt package manager you can directly install these via
```
sudo apt-get update
xargs -o sudo apt-get install < apt_requirements.txt
xargs -o sudo apt-get install < apt_optional_requirements.txt
```

install dependencies via

pip3 install -r requirements.txt
pip3 install -r optional_requirements.txt

if you want to compare with LogSummary, you also need to install it; you can initialize it here as a submodule:
```
git submodule init
git submodule update
pip3 install -r LogSummary/requirements.txt
```
The dependencies are not well kept for LogSummary, and requirements.txt misses some of them. Furthermore some code changes are necessary to include missing functions. For more details check LogSummary's project page. Additionally a word2vec model is required, which can be trained for log-data with their other framework Log2Vec. Note: Paths longer than 100 characters cause buffer overflows when training Log2Vec, if the limit is not changed manually. (LogSummary is only used for the purpose of comparison and accessing their dataset; it is not needed otherwise.)

Prepare datasets

install repository
download a suitable dataset e.g. Hadoop dataset from
open code-folder in terminal

run dataset pre-processing via

python3 -m "preprocess_dataset" <dataset-type> "<original-dataset-path>" "<destination-path>"

e.g.

python3 -m "preprocess_dataset" hadoop "../data/Hadoop/raw" "../data/Hadoop/processed"

This also works for preprocessing LogSummary's dataset:

python3 -m "preprocess_dataset" logsummary "../LogSummary/data/summary/logs" "../data/LogSummary/processed"

Analyze logs

install repository and prepare datasets
open code-folder in terminal

run log-analysis via

python3 -m "util.loganalysis"

e.g.

python3 -m "util.loganalysis" ../data/Hadoop/processed --category "Disk full" --group-by-columns "File" --partition-minimum-size 200 --output-type common-events

optionally prepare a preset for log-analysis, see the different configurations used in the code/log_analysis_configs-folder; the analysis directly with such a configuration, e.g.
```
python3 -m "util.loganalysis" ../data/Hadoop/processed @../data/Hadoop/loganalysis_configs/disk-full.json --output-type common-events
```

Pre-train models

install repository and prepare datasets
open code-folder in terminal

run pre-training via

python3 -m "pretrain_model"

e.g.

python3 -m "pretrain_model" --model_class "BartForConditionalGeneration" --model_name_or_path "facebook/bart-base" --masking_algorithm_preset "text-infilling" --csv_paths "../data/Hadoop/processed/log-instances"/* --input_field_name "SimplifiedMessage" --output_dir "../models/bart-base.hadoop/trained" --do_train

optionally prepare a preset for pretraining-configuration, see pretraining_config.py used in the model-folder; the pre-training can then be run directly with such a configuration, e.g.
```
python3 -m "pretrain_model" @../models/bart-base.hadoop/pretraining_config.py --do_train
```

under an environment with multiple GPUs you may want to use DDP for training. (DDP referes to PyTorch's DistributedDataParallel); in that case the pre-training should be started with torchrun

torchrun --nproc_per_node <number_of_gpu_you_have> -m "pretrain_model" [arguments_for_pretrain_model]...

For example like this:

torchrun --nproc_per_node 2 -m "pretrain_model" @../models/bart-base.hadoop/pretraining_config.py --do_train

Fine-tune models

install repository, prepare datasets and optionally pre-train the model
open code-folder in terminal

run pre-training via

python3 -m "finetune_model"

e.g.

python3 -m "finetune_model" --model_name_or_path "facebook/bart-base" --dataset_path "../data/Hadoop/processed" --loganalysis_arguments="@../data/Hadoop/loganalysis_configs/disk-full.json" --excluded_events_loganalysis_arguments="@../data/Hadoop/loganalysis_configs/normal.json" --input_field_name "SimplifiedMessage" --output_dir "../models/bart-base.hadoop/finetuned" --do_train

optionally prepare a preset for finetuning-configuration, see finetuning_config.py used in the model-folder; the pre-training can then be run directly with such a configuration, e.g.
```
python3 -m "finetune_model" @../models/bart-base.hadoop/finetuning_config.py --do_train
```

under an environment with multiple GPUs you may want to use DDP for training. (DDP referes to PyTorch's DistributedDataParallel); in that case the fine-tuning should be started with torchrun

torchrun --nproc_per_node <number_of_gpu_you_have> -m "finetune_model" [arguments_for_finetune_model]...

For example like this:

torchrun --nproc_per_node 2 -m "finetune_model" @../models/bart-base.hadoop/finetuning_config.py --do_train

Monitoring training

Given a suitable platform for monitoring of training is installed that is compatible with 🤗 Transformers, the training-scripts will automatically create logs for those platforms. The default location is at runs in the output_dir, but can be controlled with the --logging_dir option of training-scripts.

For example with tensorboard installed, visualizing previous training-runs could look like this:

tensorboard --logdir models/bart-base.hadoop/trained/runs

Use models

Trained models can comfortably be used for making predictions using pipelines.

To use a pre-trained model for mask-filling (the pipeline is limited to a mask-length of 1):

>>> import transformers
>>> mask_filler = transformers.pipeline("fill-mask", "models/bart-base.hadoop/trained")
>>> sentence = "Scheduled snapshot <mask> at 10 second(s)."
>>> for prediction in mask_filler(sentence):
        print("{:6.2%} {}".format(prediction["score"], prediction["sequence"]))

 9.89% Scheduled snapshot count at 10 second(s).
 5.18% Scheduled snapshot snapshot at 10 second(s).
 4.78% Scheduled snapshot size at 10 second(s).
 2.88% Scheduled snapshot update at 10 second(s).
 2.74% Scheduled snapshot start at 10 second(s).

To use a fine-tuned model for summarization:

>>> import transformers
>>> summarizer = transformers.pipeline("summarization", "models/bart-base.hadoop/finetuned")
>>> summary = summarizer("...")

Run trace visualization-tool

install repository
open code-folder in terminal
run via
```
python3 -m "tracesviz"
```
open any csv-file that contains structured log-data with traces; logs can be directly structured into csv via
```
python3 -m "util.logparsing" structure
```

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
LogSummary @ 469b1e2		LogSummary @ 469b1e2
code		code
data		data
models		models
writing		writing
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE.md		LICENSE.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

LogSummary @ 469b1e2

LogSummary @ 469b1e2

code

code

data

data

models

models

writing

writing

.gitignore

.gitignore

.gitmodules

.gitmodules

LICENSE.md

LICENSE.md

README.md

README.md

Repository files navigation

About

Installation

Prepare datasets

Analyze logs

Pre-train models

Fine-tune models

Monitoring training

Use models

Run trace visualization-tool

About

Releases

Packages

Languages

License

Impelon/log-summarization

Folders and files

Latest commit

History

Repository files navigation

About

Installation

Prepare datasets

Analyze logs

Pre-train models

Fine-tune models

Monitoring training

Use models

Run trace visualization-tool

About

Topics

Resources

License

Stars

Watchers

Forks

Languages