Still reading the paper, but it seems like your main point is that if oversight is “meaningful,” then it should be able to stop bad behavior before it actually gets executed (it might fail, but failures should be somewhat rare). And that we don’t have “meaningful oversight” of high-profile models in this sense (and especially not of the systems built on top of these models, considered as a whole) because they don’t catch bad behavior before it happens.
Instead we have some weaker category of thing that lets the bad stuff happen, waits for the public to bring it to the attention of the AI company, and then tries to stop it.
On your summary, that is not quite the main point. There are a few different points in the paper, and they aren’t limited to frontier models. Overall, based on what we define, basically no one is doing oversight in a way that the paper would call sufficient, for almost any application of AI—if it’s being done, it’s not made clear how, what failure modes are being addressed, and what is done to mitigate the issues with different methods. As I said at the end of the post, if they are doing oversight, they should be able to explain how.
For frontier models, we can’t even clearly list the failure modes we should be protecting against in a way that would let a human trying to watch the behavior be sure if something qualifies or not. And that’s not even getting into the fact that there is no attempt to use human oversight—at best they are doing automated oversight of the vague set of things they want to model to refuse. But yes, as you pointed out, even their post-hoc reviews as oversight are nontransparent, if they occur at all, and the remediation when they are shown egregious failures by the public, like sycophancy or deciding to be MechaHitler, is largely doing further ad-hoc adjustments.
Link should go to arxiv.
Still reading the paper, but it seems like your main point is that if oversight is “meaningful,” then it should be able to stop bad behavior before it actually gets executed (it might fail, but failures should be somewhat rare). And that we don’t have “meaningful oversight” of high-profile models in this sense (and especially not of the systems built on top of these models, considered as a whole) because they don’t catch bad behavior before it happens.
Instead we have some weaker category of thing that lets the bad stuff happen, waits for the public to bring it to the attention of the AI company, and then tries to stop it.
Is this about right?
Thanks—link fixed.
On your summary, that is not quite the main point. There are a few different points in the paper, and they aren’t limited to frontier models. Overall, based on what we define, basically no one is doing oversight in a way that the paper would call sufficient, for almost any application of AI—if it’s being done, it’s not made clear how, what failure modes are being addressed, and what is done to mitigate the issues with different methods. As I said at the end of the post, if they are doing oversight, they should be able to explain how.
For frontier models, we can’t even clearly list the failure modes we should be protecting against in a way that would let a human trying to watch the behavior be sure if something qualifies or not. And that’s not even getting into the fact that there is no attempt to use human oversight—at best they are doing automated oversight of the vague set of things they want to model to refuse. But yes, as you pointed out, even their post-hoc reviews as oversight are nontransparent, if they occur at all, and the remediation when they are shown egregious failures by the public, like sycophancy or deciding to be MechaHitler, is largely doing further ad-hoc adjustments.