GitHub - Cybonto/OllaBench: Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity

Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity

Latest News

[2024/05/19] OllaBench GUI demo and project brief are available at DevPost
[2024/05/09] Benchmarking of models are running while a white paper is being developed. Early results indicate mainstream LLM models do not score high - a sign of a good benchmark.
[2024/04/19] OllaBench v.0.2 is out. Benchmark dataset and sample LLM response results were uploaded. A white paper with benchmark analysis of mainstream open-weight models including the newly released Llama3 will be shared as soon as possible!
[2024/02/20] OllaBench v.0.2 Development Agenda is out. Will be twice as powerful 💥
[2024/02/12] 90sec Project Video Brief
[2024/02/07] 🚀 OllaGen1 is Launched!

Overview

The grand challenge that most CEO's care about is maintaining the right level of cybersecurity at a minimum cost as companies are not able to reduce cybersecurity risks despite their increased cybersecurity investments [1]. Fortunately, the problem can be explained via interdependent cybersecurity (IC) [2] as follows. First, optimizing cybersecurity investments in existing large interdependent systems is already a well-known non-convex difficult problem that is still yearning for new solutions. Second, smaller systems are growing in complexity and interdependence. Last, new low frequency, near simultaneous, macro-scale risks such as global pandemics, financial shocks, geopolitical conflicts have compound effects on cybersecurity.

Human factors account for half of the long-lasting challenges in IC as identified by Kianpour et al. [3], and Laszka et al. [4]. Unfortunately, human-centric research within the context of IC is under-explored while research on general IC has unrealistic assumptions about human factors. Fortunately, the dawn of Large Language Models (LLMs) promise a much efficient way to research and develop solutions to problems across domains. In cybersecurity, the Zero-trust principles require the evaluation, validation, and continuous monitoring and LLMs are no exception.

Therefore, OllaBench was born to help both researchers and application developers conveniently evaluate their LLM models within the context of cybersecurity compliance or non-compliance behaviors.

❗IMPORTANT❗
Dataset Generator and test Datasets at the OllaGen1 subfolder.
You need to have either a local LLM stack (nvidia TensorRT-LLM with Llama_Index in my case) or OpenAI api key for generating new OllaBench datasets.
OpenAI throttle Requests per Minutes which may cause significant delays in generating big datasets.
When OllaBench white paper is published (later in MARCH), OllaBench benchmark scripts and leaderboard results will be made available.

Quick Start

Evaluate with your own codes

You can grab the evaluation datasets to run with your own evaluation codes. Note that the datasets (csv files) are for zero-shot evaluation. It is recommended that you modify the OllaBench Generator 1 (OllaGen1) params.json with your desired specs and run the OllaGen1.py to generate for yourself fresh, UNSEEN datasets that match your custom needs. Check OllaGen-1 README for more details.

Use OllaBench

OllaBench will evaluate your models within Ollama model zoo using OllaGen1 default datasets. You can quickly spin up Ollama with Docker desktop/compose and download LLMs to Ollama. Please check the below Installation section for more details.

Tested System Settings

The following tested system settings show successful operation for running OllaGen1 dataset generator and OllaBench.

Primary Generative AI model: Llama2
Python version: 3.10
Windows version: 11
GPU: nvidia geforce RTX 3080 Ti
Minimum RAM: [your normal ram use]+[the size of your intended model]
Disk space: [your normal disk use]+[minimum software requirements]+[the size of your intended model]
Minimum software requirements: nvidia CUDA 12 (nvidia CUDA toolkit), Microsoft MPI, MSVC compiller, llama_index
Additional system requirements: docker compose and other related docker requirements if you use Docker stack

Quick Install of Key Components

This quick install is for a single Windows PC use case (without Docker) and for when you need to use OllaGen1 to generate your own datasets. I assume you have nvidia GPU installed.\

Go to TensorRT-LLM for Windows and follow the Quick Start section to install TensorRT-LLM and the prerequisites.
If you plan to use OllaGen1 with local LLM, go to Llama_Index for TensorRT-LLM and follow instrucitons to install Llama_Index, and prepare models for TensorRT-LLM
If you plan to use OllaGen1 with OpenAI, please follow OpenAI's intruction to add the api key into your system environment. You will also need to change the llm_framework param in OllaGen1 params.json to openai.

Commands to check for key software requirements

Python
python -V
nvidia CUDA 12
nvcc -V
Microsoft MPI*
mpiexec -hellp \

Installation

The following instructions are mainly for the Docker use case.

Windows Linux Subsystem

If you are using Windows, you need to install WSL. The Windows Subsystem for Linux (WSL) is a compatibility layer introduced by Microsoft that enables users to run Linux binary executables natively on Windows 10 and Windows Server 2019 and later versions. WSL provides a Linux-compatible kernel interface developed by Microsoft, which can then run a Linux distribution on top of it. See here for information on how to install it. In this set up, we use Debian linux. You can check verify linux was installed by executing wsl -l -v You enter WSL by executing the command "wsl" from windows command line window.

Please disregard if you are using a linux system.

Nvidia Container Toolkit

The NVIDIA Container Toolkit is a powerful set of tools that allows users to build and run GPU-accelerated Docker containers. It leverages NVIDIA GPUs to enable the deployment of containers that require access to NVIDIA graphics processing units for computing tasks. This toolkit is particularly useful for applications in data science, machine learning, and deep learning, where GPU resources are critical for processing large datasets and performing complex computations efficiently. Instalation instructions are in here.

Please disregard if your computer does not have a GPU.

nVidia TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.

LlamaIndex Tutorial on Installing TensorRT-LLM
TensorRT-LLM Github page

Ollama

Install Docker Desktop and Ollama with these instructions.

Run OllaGen-1

Please go to OllaGen1 subfolder and follow the instructions to generate the evaluation datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
.devcontainer		.devcontainer
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
.streamlit		.streamlit
DEV		DEV
OllaGen-1		OllaGen-1
OllaGen-RAG		OllaGen-RAG
Responses		Responses
admin		admin
archive		archive
pages		pages
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
OllaBench-Flows.png		OllaBench-Flows.png
OllaBench1.py		OllaBench1.py
OllaBench1_gui.py		OllaBench1_gui.py
OllaBench_gui_menu.py		OllaBench_gui_menu.py
README.md		README.md
installation_LLMs.md		installation_LLMs.md
logo.png		logo.png
params.json		params.json
requirements.txt		requirements.txt
score_compare.PNG		score_compare.PNG
wasted_compare.PNG		wasted_compare.PNG

License

Cybonto/OllaBench

Folders and files

Latest commit

History

Repository files navigation

Evaluating LLMs' Cognitive Behavioral Reasoning for Cybersecurity

Latest News

Table of Contents

Overview

Quick Start

Evaluate with your own codes

Use OllaBench

Tested System Settings

Quick Install of Key Components

Commands to check for key software requirements

Installation

Windows Linux Subsystem

Nvidia Container Toolkit

nVidia TensorRT-LLM

Ollama

Run OllaGen-1

About

Topics

Resources

License

Stars

Watchers

Forks

Languages