Buck comments on Research directions Open Phil wants to fund in technical AI safety

Buck 8 Feb 2025 16:47 UTC
LW: 7 AF: 6
3
AF
- Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)
I don’t know what this means, do you have any examples?
- Rohin Shah 8 Feb 2025 22:11 UTC
  LW: 3 AF: 3
  0
  AF Parent
  I don’t know of any existing work in this category, sorry. But e.g. one project would be “combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards”, which in theory could work better than either one of them alone.