count-vector-clustering

The algorithm in this repository evaluates the semi-supervised count-vector-clustering approach described in [1] on the HDFS log data set [2]. In short, the approach creates count vectors for each event sequence in the training data set and predicts counts vectors of new sequences from the test data set as anomalous when they are not similar enough to any of the training sequences, where the similarity metric is based on the l1-norm. In addition, all new sequences that contain event types not seen during training are predicted to be anomalous.

Run the algorithm using the following command:

ubuntu@user-1:~/count-vector-clustering$ python3 count_vector_clustering.py
Threshold=0.14
TP=16826
FP=376
TN=552990
FN=12
TPR=R=0.9992873262857822
FPR=0.000679477958530159
TNR=0.9993205220414698
P=0.9781420765027322
F1=0.9886016451233842
ACC=0.9993195417780303

The algorithm achieves an F1-Score of 98.86% on the HDFS data set. This exceeds the detection performance achieved by n-gram-based detection (F1-Score is 95.14%) as well as many deep learning approaches. The following figure shows that most anomalous test sequences (lower half of the plot) have a high distance to the training sequences, while most normal test sequences (upper half of the plot) have a low distance. Note that the vertical axis that shows the numer of sequences for each class is scaled logarithmically. The vertical dashed line indicates the threshold used for classification; sequences on the left side of the line are predicted as normal, while sequences on the right side of the line are predicted as anomalous. Sequences involving event types that did not occur in the training data receive a distance of 1 and are thus always predicted as anomalous.

Feel free to change the algorithm parameters (see python3 count_vector_clustering.py --help). We also tested the approach on the HDFS log data set used in the deep-loglizer, where we achieve an F1-Score of 98.77%. Use the following command to run the script on that data set (note that the threshold is set for best detection performance).

ubuntu@user-1:~/count-vector-clustering$ python3 count_vector_clustering.py --data_dir "data/hdfs_loglizer/" --threshold 0.03

If you use any of the scripts provided in this repository, please cite the following publication:

[1] Landauer M., Skopik F., Höld G., Wurzenberger M. (2022): A User and Entity Behavior Analytics Log Data Set for Anomaly Detection in Cloud Computing. 2022 IEEE International Conference on Big Data - 6th International Workshop on Big Data Analytics for Cyber Intelligence and Defense (BDA4CID 2022), December 17-20, 2022, Osaka, Japan. IEEE. [PDF]

[2] HDFS log data set taken without changes from the DeepLog implementation by wuyifan18

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
LICENSE		LICENSE
README.md		README.md
count_vector_clustering.py		count_vector_clustering.py
plot.png		plot.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

LICENSE

LICENSE

README.md

README.md

count_vector_clustering.py

count_vector_clustering.py

plot.png

plot.png

Repository files navigation

count-vector-clustering

About

Releases

Packages

Languages

License

ait-aecid/count-vector-clustering

Folders and files

Latest commit

History

Repository files navigation

count-vector-clustering

About

Resources

License

Stars

Watchers

Forks

Languages