Some of my proposals for empirical AI safety work.
I sometimes write proposals for empirical AI safety work without publishing the proposal (as the doc is somewhat rough or hard to understand).
But, I thought it might be useful to at least link my recent project proposals publicly in case someone would find them useful.
I’ve actually already linked this (and the proposal in the next bullet) publicly from “7+ tractable directions in AI control”, but I thought it could be useful to link here as well.
Some of my proposals for empirical AI safety work.
I sometimes write proposals for empirical AI safety work without publishing the proposal (as the doc is somewhat rough or hard to understand). But, I thought it might be useful to at least link my recent project proposals publicly in case someone would find them useful.
Decoding opaque reasoning in current models
Safe distillation
Basic science of teaching AIs synthetic facts[1]
How to do elicitation without learning
If you’re interested in empirical project proposals, you might also be interested in “7+ tractable directions in AI control”.
I’ve actually already linked this (and the proposal in the next bullet) publicly from “7+ tractable directions in AI control”, but I thought it could be useful to link here as well.
would you consider sharing these as Related Papers on AI-Plans?