An argument I find somewhat compelling for why the overhang would be increasing is: Our current elicitation methods would probably fail for sufficiently advanced systems, so we should expect their performance to degrade at some point(s). This might happen suddenly, but it could also happen more gradually and continuously. A priori, it seems more natural to expect the continuous thing, if you expect most AI metrics to be continuous.
Normally, I think people would say, “Well, you’re only going to see that kind of a gap emerging if the model is scheming (and we think it’s basically not yet)”, but I’m not convinced it makes sense to treat scheming as such a binary thing, or to assume we’ll be able to catch the model doing it (since you can’t know what your missing, and even if you find it, it won’t necessarily show up with signatures of scheming vs. looking like some more mundane seeming elicitation failure).
This is a novel argument I’m elaborating, so don’t stand by it strongly, but I don’t feel like it’s been addressed that I’ve seen, and it seems like a big weakness of a lot of safety cases that they seem to be treating underelicitation as either 1) benign or 2) due to “scheming”, which is treated as binary and ~observable. “Roughly bound[ing]” an elicitation gap may be useful, but is not sufficient for high quality assurance—we’re in the realm of unknown unknowns (I acknowledge there’s some subtleties around unknown unknowns I’m eliding, and that the OP is arguing that this is “likely”, which is a higher standard than “likely enough that we can’t dismiss it”).
But, GPT-5 reasoning on high is already decently expensive and this is probably not a terrible use of inference compute, so we don’t have a ton of headroom here.
As written, this is clearly false/overconfident. It’s “probably not a terrible use of inference compute” is a very weak and defensible claim. But drawing the conclusion that “we don’t have a ton of headroom” from it is completely unwarranted and would require making a stronger, less defensible claim.
An argument I find somewhat compelling for why the overhang would be increasing is:
Our current elicitation methods would probably fail for sufficiently advanced systems, so we should expect their performance to degrade at some point(s). This might happen suddenly, but it could also happen more gradually and continuously. A priori, it seems more natural to expect the continuous thing, if you expect most AI metrics to be continuous.
Normally, I think people would say, “Well, you’re only going to see that kind of a gap emerging if the model is scheming (and we think it’s basically not yet)”, but I’m not convinced it makes sense to treat scheming as such a binary thing, or to assume we’ll be able to catch the model doing it (since you can’t know what your missing, and even if you find it, it won’t necessarily show up with signatures of scheming vs. looking like some more mundane seeming elicitation failure).
This is a novel argument I’m elaborating, so don’t stand by it strongly, but I don’t feel like it’s been addressed that I’ve seen, and it seems like a big weakness of a lot of safety cases that they seem to be treating underelicitation as either 1) benign or 2) due to “scheming”, which is treated as binary and ~observable. “Roughly bound[ing]” an elicitation gap may be useful, but is not sufficient for high quality assurance—we’re in the realm of unknown unknowns (I acknowledge there’s some subtleties around unknown unknowns I’m eliding, and that the OP is arguing that this is “likely”, which is a higher standard than “likely enough that we can’t dismiss it”).
As written, this is clearly false/overconfident. It’s “probably not a terrible use of inference compute” is a very weak and defensible claim. But drawing the conclusion that “we don’t have a ton of headroom” from it is completely unwarranted and would require making a stronger, less defensible claim.