Alignable systems design: Produce a design for an overall AI system that accomplishes something interesting, apply multiple safety techniques to it, and show that the resulting system is both capable and safe. (A lot of the value here is in figuring out how to combine various safety techniques together.)
I don’t know what this means, do you have any examples?
I don’t know of any existing work in this category, sorry. But e.g. one project would be “combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards”, which in theory could work better than either one of them alone.
I don’t know what this means, do you have any examples?
I don’t know of any existing work in this category, sorry. But e.g. one project would be “combine MONA and your favorite amplified oversight technique to oversee a hard multi-step task without ground truth rewards”, which in theory could work better than either one of them alone.