ai-alignment

Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023

economics ai-safety gametheory experimental-economics behavioral-economics prisoners-dilemma ai-alignment experimental-psychology social-dilemmas gpt-3 gpt-4 llm principal-agent-problem

Updated Mar 1, 2024
Python

agencyenterprise / PromptInject

Star

PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022

machine-learning agi language-models ai-safety adversarial-attacks ai-alignment ml-safety gpt-3 large-language-models prompt-engineering chain-of-thought agi-alignment

Updated Feb 26, 2024
Python

tomekkorbak / pretraining-with-human-feedback

Star

Code accompanying the paper Pretraining Language Models with Human Preferences

reinforcement-learning gpt language-models ai-safety ai-alignment pretraining decision-transformers rlhf

Updated Feb 13, 2024
Python

wesg52 / sparse-probing-paper

Star

Sparse probing paper full code.

ai-safety interpretability ai-alignment mechanistic-interpretability

Updated Dec 17, 2023
Jupyter Notebook

Giskard-AI / awesome-ai-safety

Sponsor

Star

📚 A curated list of papers & technical articles on AI Quality & Safety

Updated Oct 13, 2023

EzgiKorkmaz / adversarial-reinforcement-learning

Star

Reading list for adversarial perspective and robustness in deep reinforcement learning.

Updated Sep 18, 2023

UCSC-VLAA / Sight-Beyond-Text

Star

This repository includes the official implementation of our paper "Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics"

alignment vlm ai-alignment vision-language vicuna llm mllm llava llama2

Updated Sep 15, 2023
Python

EveryOneIsGross / sinewCHAT

Star

sinewCHAT uses instanced chatbots to emulate neural nodes to enrich and generate positive weighted responses.

ml ai-alignment rrn openai-api

Updated Jul 16, 2023
Python

EveryOneIsGross / areteCHAT

Star

A persona chat based on the VIA Character Strengths. Reads emotional tone and summons appropriate virtue to respond.

sentiment-analysis chatbot via chatbot-framework ai-alignment virtues

Updated Jul 14, 2023
Python

dit7ya / awesome-ai-alignment

Star

A curated list of awesome resources for getting-started-with and staying-in-touch-with Artificial Intelligence Alignment research.

awesome awesome-list ai-safety ai-alignment

Updated Jul 14, 2023

EveryOneIsGross / bbBOT

Star

bbBOT is a felixble persona based branching binary sentiment chatbot.

openai tree-structure chatbot-framework ai-alignment python-ai openai-chatgpt

Updated Jul 13, 2023
Python

liondw / Signal-Alignment

Star

An initiative to create concise and widely shareable educational resources, infographics, and animated explainers on the latest contributions to the community AI alignment effort. Boosting the signal and moving the community towards finding and building solutions.

education design ai ai-alignment

Updated Jul 9, 2023

lets-make-safe-ai / make-safe-ai

Star

How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚

ai agi artificial-intelligence artificial-general-intelligence ai-safety ai-alignment

Updated Mar 29, 2023

ai-fail-safe / safe-reward

Star

a prototype for an AI safety library that allows an agent to maximize its reward by solving a puzzle in order to prevent the worst-case outcomes of perverse instantiation

failsafe ai-safety anomaly-detection ai-alignment fail-safe

Updated Nov 8, 2022
Python

ai-fail-safe / honeypot

Star

a project to detect environment tampering on the part of an agent

failsafe ai-safety anomaly-detection ai-alignment fail-safe

Updated Oct 31, 2022

ai-fail-safe / mulligan

Star

a library designed to shut down an agent exhibiting unexpected behavior providing a potential "mulligan" to human civilization; IN CASE OF FAILURE, DO NOT JUST REMOVE THIS CONSTRAINT AND START IT BACK UP AGAIN

failsafe ai-safety anomaly-detection ai-alignment fail-safe

Updated Oct 30, 2022

Improve this page

Add a description, image, and links to the ai-alignment topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-alignment topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-alignment

Here are 26 public repositories matching this topic...

IQTLabs / daisybell

RLHFlow / Directional-Preference-Alignment

MinghuiChen43 / awesome-trustworthy-deep-learning

riceissa / aiwatch

phelps-sg / llm-cooperation

agencyenterprise / PromptInject

tomekkorbak / pretraining-with-human-feedback

wesg52 / sparse-probing-paper

Giskard-AI / awesome-ai-safety

EzgiKorkmaz / adversarial-reinforcement-learning

UCSC-VLAA / Sight-Beyond-Text

EveryOneIsGross / sinewCHAT

EveryOneIsGross / areteCHAT

dit7ya / awesome-ai-alignment

EveryOneIsGross / bbBOT

liondw / Signal-Alignment

lets-make-safe-ai / make-safe-ai

ai-fail-safe / safe-reward

ai-fail-safe / honeypot

ai-fail-safe / mulligan

Improve this page

Add this topic to your repo