Farlamp

Note 2021-07-31: I decide to leave AI alignment research last year. See https://www.lesswrong.com/posts/HDXLTFnSndhpLj2XZ/i-m-leaving-ai-alignment-you-better-stay This repo shows the state of my research at the time I left.

BLUF: Read Overseer failures in SupAmp and ReAmp, then the interesting parts of Training a tiny SupAmp model on easy tasks.

Project definition (CoR):

I'm studying the impact of overseer failure on RL-based IDA,
    because I want to know under what conditions amplification and distillation
    increase or decrease the failure rate,
        in order to help my reader understand whether explicit reliability
        amplification is necessary for IDA to work in practice.

In this project I will:

Take the implementation of iterated distillation and amplification from Christiano et al.'s ‘Supervising strong learners by amplifying weak experts’, introduce overseer failures and see how they influence the overall failure rate.
Adapt the system to reinforcement learning. (It uses supervised learning now.)
Introduce overseer failures in the RL setting and see how they influence the overall failure rate.
Write a paper about the results.

Overseer failures in SupAmp and ReAmp contains a more extensive introduction, as well as an explanation of the relevant terms, concepts etc.

For the code see rmoehn/amplification, which is a fork of paulfchristiano/amplification.

Repository contents

Overseer failures in SupAmp and ReAmp – Start here. This is the most polished document so far.
Training a tiny SupAmp model on easy tasks – Report on first experiments.
How to turn SupAmp into ReAmp? – Less detailed and less polished analysis of how to approach the first project phase.
What I need for planning the Farlamp draft – Will contain all the information I need for planning a draft of the paper.
Literature overview – Work in progress of searching, skimming, filtering, summarizing literature for this project.
Current project outline – Overview of cases and estimates. Most up to date, but doesn't contain upcoming milestones.
Old project outline – Outdated, but does contain some upcoming milestones.

There are more files, but they are only useful for me. The code won't be published here, because it will be based on the code from CSASupAmp, which underlies some strict publication policy.

Glossary

Term	Definition
CoR	Booth et al.: The Craft of Research
CSASupAmp	Christiano et al.: Supervising strong learners by amplifying weak experts
Est. 5 %	5th percentile of my estimated duration distribution/leftmost point in triangle distribution
Est. mode	mode of my estimated duration distribution
Est. 95 %	95th percentile of my estimated duration distribution/rightmost point in triangle distribution
Farlamp	Failures in RL-based amplification (I just had to come up with a short project name.)
Draft Basis	A template derived from CoR, p. 175, which when filled in completely, provides all the information necessary for planning a draft. Includes the structure of the argument.
LW	LessWrong
MxD	MIRIxDiscord
RL	reinforcement learning
ReAmp	SupAmp adapted to RL
SL	supervised learning
SupAmp	The system from CSASupAmp for iterated distillation and amplification using supervised learning

For detailed bibliographical information see references.bib.

Thanks

Thanks to Paul Christiano for funding this project and giving me advice. Thanks also to William Saunders for providing his version of the CSASupAmp code.

Licence

To the extent possible under law, Richard Möhn has waived all copyright and related or neighboring rights to Farlamp documentation. This work is published from: Japan.

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
build		build
tiny-supfail-pics		tiny-supfail-pics
.gitignore		.gitignore
README.md		README.md
build.sh		build.sh
check.sh		check.sh
draft-basis.pdf		draft-basis.pdf
draft-basis.tex		draft-basis.tex
farlamp-fogbugz-plan.png		farlamp-fogbugz-plan.png
farlamp-omniplan.pdf		farlamp-omniplan.pdf
farlamp-plan.pdf		farlamp-plan.pdf
farlamp.cls		farlamp.cls
literature.pdf		literature.pdf
literature.tex		literature.tex
overfail.tex		overfail.tex
overfail2.pdf		overfail2.pdf
overfail2.tex		overfail2.tex
preamble.tex		preamble.tex
probabilities_sympy.py		probabilities_sympy.py
references.bib		references.bib
supamp-reamp.md		supamp-reamp.md
supamp-reamp.pdf		supamp-reamp.pdf
tex2layouts.py		tex2layouts.py
tiny-supfail.pdf		tiny-supfail.pdf
tiny-supfail.tex		tiny-supfail.tex
userdict.txt		userdict.txt

rmoehn/farlamp

Folders and files

Latest commit

History

Repository files navigation

Farlamp

Repository contents

Glossary

Thanks

Licence

About

Topics

Resources

Stars

Watchers

Forks

Languages