Why could a system that we optimize with RL develop power seeking drives?
Why might training an AI create weird unpredictable preferences in an AI?
Why would you expect something that is smarter than us to be very dangerous or why not?
Why should we expect a before and after transition/one critical shot at alignment or why not?
I don’t think these should be considered strong criteria. IMO “believes in X-risk” is not a necessary pre-requisite to do great work for reducing X-risk. E.g. building good tooling for alignment research doesn’t require this at all.
I’ve updated somewhat—it’s true that mentors should likely be given a large say in who they admit to their projects, but are also likely to be myopic (i.e. optimize solely for “get this project done”). MATS might want to counterbalance that by also optimizing for good long-term candidates (who will reduce x-risk long-term). And there probably is a lot of room to select highly value-aligned candidates without compromising much on technical skill, given that MATS receives 100x as many applications as they can accept. (Though I still think there are much better tests of value alignment, and the questions above are likely to be easy to game.)
Reflecting on this a little bit:
I’ve updated somewhat—it’s true that mentors should likely be given a large say in who they admit to their projects, but are also likely to be myopic (i.e. optimize solely for “get this project done”). MATS might want to counterbalance that by also optimizing for good long-term candidates (who will reduce x-risk long-term). And there probably is a lot of room to select highly value-aligned candidates without compromising much on technical skill, given that MATS receives 100x as many applications as they can accept. (Though I still think there are much better tests of value alignment, and the questions above are likely to be easy to game.)