A superintelligent AI will be better at breaking evals than humans, so I expect there is a big gap between “our top researchers have tried and failed to find any loopholes in our alignment evals” and “a superintelligence will not be able to find any loopholes”
Agreed. This is why this plan is ultimately just a tool to get good/useful theory work done faster, more efficiently, etc.
Agreed. This is why this plan is ultimately just a tool to get good/useful theory work done faster, more efficiently, etc.