Working on a meta plan for solving alignment, I’d appreciate feedback & criticism please—the more precise the better. Feel free to use the emojis reactions if writing a reply you’d be happy with feels taxing.
Diagram for visualization—items in tables are just stand-ins, any ratings and numbers are just for illustration, not actual rankings or scores at this moment.
Red and Blue teaming Alignment evals Make lots of red teaming methods to reward hack alignment evals Use this to find actually useful alignment evals, then red team and reward hack them too, find better methods of reward hacking and red teaming, then finding better ways of doing alignment evals that are resistant to that and actually find the preferences of the model reliably too—keep doing this constantly to have better and better alignment evals and alignment evals reward hacking methods. Also hosting events for other people to do this and learn to do this.
Alignment methods then doing these different post training methods, implementing lots of different alignment methods, seeing which ones score how highly across the alignment evals we know are robust, using this to find patterns in which types seem to work more/less,
Theory then doing theory work, to try to figure out/guess why those methods are working more/less, make some hypothesises about new methods that willl work better and new methods that will work worse, based on this, then ranking the theory work based on how minimal the assumptions are, how well predicted the implementations are and how high they score on peer review
End goal: Alignment evals that no one can find a way to reward hack/red team and that really precisely measure the preferences of the models. Alignment methods that seem to score a high score on these evals. A very strong theoretical understanding for what makes this alignment method work, *why* it actually learns the preferences, theory understanding how those preferences will/won’t scale to result in futures where everyone dies or lives. The theoretical work should have as few assumptions as possible. Aim is a mathematical proof with minimal assumptions, that’s written very clearly and is easy to understand so that lots and lots of people can understand it and criticize it—robustness through obfuscation is a method of deception, intentional or not.
Current Work:
Hosting Alignment Evals hackathons and making Alignment Evals guides, to make better alignment evals and red teaming methods.
Making lots of Qwen model versions, whose only difference is the post training method.
Working on a meta plan for solving alignment, I’d appreciate feedback & criticism please—the more precise the better. Feel free to use the emojis reactions if writing a reply you’d be happy with feels taxing.
Diagram for visualization—items in tables are just stand-ins, any ratings and numbers are just for illustration, not actual rankings or scores at this moment.
Red and Blue teaming Alignment evals
Make lots of red teaming methods to reward hack alignment evals
Use this to find actually useful alignment evals, then red team and reward hack them too, find better methods of reward hacking and red teaming, then finding better ways of doing alignment evals that are resistant to that and actually find the preferences of the model reliably too—keep doing this constantly to have better and better alignment evals and alignment evals reward hacking methods. Also hosting events for other people to do this and learn to do this.
Alignment methods
then doing these different post training methods, implementing lots of different alignment methods, seeing which ones score how highly across the alignment evals we know are robust, using this to find patterns in which types seem to work more/less,
Theory
then doing theory work, to try to figure out/guess why those methods are working more/less, make some hypothesises about new methods that willl work better and new methods that will work worse, based on this,
then ranking the theory work based on how minimal the assumptions are, how well predicted the implementations are and how high they score on peer review
End goal:
Alignment evals that no one can find a way to reward hack/red team and that really precisely measure the preferences of the models.
Alignment methods that seem to score a high score on these evals.
A very strong theoretical understanding for what makes this alignment method work, *why* it actually learns the preferences, theory understanding how those preferences will/won’t scale to result in futures where everyone dies or lives. The theoretical work should have as few assumptions as possible. Aim is a mathematical proof with minimal assumptions, that’s written very clearly and is easy to understand so that lots and lots of people can understand it and criticize it—robustness through obfuscation is a method of deception, intentional or not.
Current Work:
Hosting Alignment Evals hackathons and making Alignment Evals guides, to make better alignment evals and red teaming methods.
Making lots of Qwen model versions, whose only difference is the post training method.