A curated list of trustworthy deep learning papers. Daily updating...
-
Updated
May 17, 2024
A curated list of trustworthy deep learning papers. Daily updating...
PromptInject is a framework that assembles prompts in a modular fashion to provide a quantitative analysis of the robustness of LLMs to adversarial prompt attacks. 🏆 Best Paper Awards @ NeurIPS ML Safety Workshop 2022
Code accompanying the paper Pretraining Language Models with Human Preferences
📚 A curated list of papers & technical articles on AI Quality & Safety
Sparse probing paper full code.
A curated list of awesome resources for getting-started-with and staying-in-touch-with Artificial Intelligence Alignment research.
Scan your AI/ML models for problems before you put them into production.
How to Make Safe AI? Let's Discuss! 💡|💬|🙌|📚
Website to track people, organizations, and products (tools, websites, etc.) in AI safety
Reading list for adversarial perspective and robustness in deep reinforcement learning.
This repository includes the official implementation of our paper "Sight Beyond Text: Multi-Modal Training Enhances LLMs in Truthfulness and Ethics"
An implementation of iterated distillation and amplification
Code and materials for the paper S. Phelps and Y. I. Russell, Investigating Emergent Goal-Like Behaviour in Large Language Models Using Experimental Economics, working paper, arXiv:2305.07970, May 2023
a project to detect environment tampering on the part of an agent
Q&A system with reflection and automation, similar to Patchwork, Affable, Mosaic
a prototype for an AI safety library that allows an agent to maximize its reward by solving a puzzle in order to prevent the worst-case outcomes of perverse instantiation
bbBOT is a felixble persona based branching binary sentiment chatbot.
Add a description, image, and links to the ai-alignment topic page so that developers can more easily learn about it.
To associate your repository with the ai-alignment topic, visit your repo's landing page and select "manage topics."