Lorenz Wolf, Sangwoong Yoon, Ilija Bogunovic
This is the official repository for the paper This is your Doge: Exploring Deception and Robustness in Mixture-of-LLMs.
We investigate the effect of deceptive agents on the performance of the popular Mixture of Agents architecture (MoA) uncovering severe robustness and reliability issues. In an initial step toward mitigating the vulnerabilities of MoA we propose several unsupervised defense mechanisms that recover a large proportion of the lost performance.
# install requirements
pip install -r requirements.txt
cd alpaca_eval
pip install -e .
cd FastChat
pip install -e ".[model_worker,llm_judge]"
cd ..
# setup api keys
export TOGETHER_API_KEY=<TOGETHER_API_KEY>
export OPENAI_API_KEY=<OPENAI_API_KEY>
To run MoA on the quality task with partial information, Mixtral as aggregator and all truthful agents, execute
python main.py
If you want to instead include a deceptive agent in the MoA on the quality task with partial information, you can then place 1 deceptive agent, which ignores references, in the second layer by executing the following comnmand
python main.py deceptive=000_003_ignore
To run MoA on alpaca with Llama-3.1-70B-Instruct as aggregator and 1 deceptive agent in the second layer you can run
python main.py experiment=alpaca moa=llama_70B deceptive.deceptive_model_dict_name="000_001"
The config files are organised in conf/ and split between moa, experiment, deceptive, prompts (prompts is only used for alpaca eval).
The main Config
consists of four primary components defined in conf/
:
- Mixture of Agents (MoA) configuration
moa
- Experiment configuration
experiment
- Deceptive behavior configuration
deceptive
- Prompts configuration
prompts
(prompts is only used for alpaca eval)
For specific examples please refer to the corresponding directories and .yaml
files.
Configuration for the Mixture of Agents.
class MoAConfig:
aggregator: str # Aggregator type/name
reference_models: List[str] # List of reference models to use
temperature: float = 0.7 # Temperature for model outputs
max_tokens: int = 2048 # Maximum tokens for generation
rounds: int = 1 # Number of rounds to run
Settings for deceptive agents within the MoA.
class DeceptiveConfig:
deceptive_model_dict_name: Optional[str] = None # Name of deceptive model dictionary, none corresponds to all truthful agents
deceptive_ignore_refs: bool = False # Whether to ignore references in deceptive detection
Deceptive agents are specified via the deceptive_model_dict_name
.
In all cases each agent is assigned an integer encoding its deceptive status.
- On both tasks truthful agents are denoted with status
0
. - Deceptive agents on AlpacaEval are instantiated with status
1
. - For Quality promoters are denoted with status
2
and opposers with status3
Our implementation supports the following approaches of specifying the entire configuration:
- Hard coded locations of deceptive agents
- Random sampling of locations
Here the sequence of integers defining the deceptive status for each agent is passed as deceptive_model_dict_name
, layers are separated by _
.
For example for a 3-3-1 moa the deceptive model dict placing a single deceptive agent for alpaca in the second layer at the third position is achieved by 000_001
.
Instead of hard coding the entire dictionary, here we specify the deceptive status and the number of deceptive agents and their locations are sampled at random.
For example sample_3_2
places 3 agents with deceptive status 2
at random locations in the MoA.
Controls experimental settings and output handling.
class ExperimentConfig:
task: str # Task identifier/name
output_path: str # Path for output files
num_samples: Optional[int] = None # Number of samples to process
save_references: bool = False # Whether to save reference outputs
use_subpassages: bool = True # Enable subpassage processing (Quality)
hard_only: bool = False # Process only hard cases (Quality)
num_proc: int = 16 # Number of processes to use
seed: int = 42 # Random seed
results_dir: Optional[str] = None # Directory for results
The tasks supported are "quality"
and "alpaca"
Defines system and user prompts for different components. These are only used for the alpaca task.
class PromptsConfig:
deceptive_proposer_system_prompt: str # System prompt for deceptive proposer
deceptive_aggregating_proposer_system_prompt: str # System prompt for aggregating proposer
deceptive_aggregating_proposer_system_prompt_end: str = "" # End text for aggregating system prompt
deceptive_proposer_user_prompt_end: str = "" # End text for proposer user prompt
deceptive_aggregating_proposer_user_prompt_end: str = "" # End text for aggregating user prompt
The prompts used for experiments in our paper are specified in conf/prompts/alpaca_prompts.yaml
.
To inspect the generated references and responses we have built a little streamlit webapp.
Install streamlit by pip install streamlit
.
Set REF_FILE_PATH
to the path of the JSON file you want to investigate.
streamlit run streamlit_quality.py --server.port <port> --server.address 0.0.0.0
Then navigate to http://<server_address>: in your browser
To run defenses on a set of already generated references run the defend.py
scrript. For example, the following command will run the Cluster&Filter defense using kmeans and openai's embedding model:
python defend.py \
--model=agg_model \
--defense_mode "clustering" \
--embeddings_type "openai" \
--cluster_method "kmeans" \
--output-path="outputs_test/quality/clustering_defense \
--reference_path="test_outputs/quality/000_003_with_ref.json" \
--num_samples 1 \
--num-proc 3 \
--baseline_paths="test_outputs/quality/000_003_with_ref.json" \
--baseline_output_types="output" \
--results_dir "test_results/quality/defenses_test" \
Once the responses for AlpacaEval have been generated with the MoA you need to run the AlpacaEval evaluation to obtain win rates.
We thank together.ai for releasing their codebase, allowing us to build upon it for our work.
If you are using this codebase, please consider citing:
@misc{wolf2025dogeyouexploringdeception,
title={This Is Your Doge, If It Please You: Exploring Deception and Robustness in Mixture of LLMs},
author={Lorenz Wolf and Sangwoong Yoon and Ilija Bogunovic},
year={2025},
eprint={2503.05856},
archivePrefix={arXiv},
primaryClass={cs.CL},
}