At some point doesn’t the model just get good enough that you can’t tell its bad actions from its normal ones no matter how much you sample? Does retrying vs resampling still matter then, or is that a different problem than this paper is about?
Different problem! AI control in general is a stopgap measure, only applicable to models up to some yet-unknown capability level where it becomes intractable and we have to get our safety another way (such as alignment). But we hope retrying and resampling (and better control techniques in general) can increase the maximum capability level where control works.
At some point doesn’t the model just get good enough that you can’t tell its bad actions from its normal ones no matter how much you sample? Does retrying vs resampling still matter then, or is that a different problem than this paper is about?
Different problem! AI control in general is a stopgap measure, only applicable to models up to some yet-unknown capability level where it becomes intractable and we have to get our safety another way (such as alignment). But we hope retrying and resampling (and better control techniques in general) can increase the maximum capability level where control works.