My steelman of Conjecture’s position here would be:
Current evals orgs are tightly integrated with AGI labs. AGI labs can pick which evals org to collaborate with, control the model access, which kind of evals will be conducted, which kind of report will be public, etc. This is this power position that makes current evals feed into AGI orthodoxy.
We don’t have good ways to conduct evals. We have wide error bars over how much juice one can extract from models and we are nowhere close to having the tools to upper bound capabilities from evals. I remember this being a very strong argument internally: we are very bad at extracting capabilities from pre-trained models and unforeseen breakthroughs (like a mega-CoT, giving much more improvement than a fine-tuning baseline) could create improvement of several compute-equivalent OOM in the short term, rendering all past evals useless.
Evals draw attention away from other kinds of limits, in particular compute limits. Conjecture is much more optimistic about (stringent) compute limits as they are harder to game.
My opinion is:
For evals to be fully trusted, we need more independence such as third party auditing designated by public actors with a legal framework that gives modalities for access to the models. External accountability is the condition needed for evals not to feed into AGI orthodoxy. I’m quite optimistic that we’ll get there soon, e.g. thanks to the effort of the UK AI Safety Institute, the EU AI Act, etc.
Re point I: the field of designing scaffolding is still very young. I think it’s possible we can see surprising discontinuous progress in this domain such that current evals were in fact far from the upper bound of capabilities we can extract from models. If we base deployment / training actions on such evals and find out later a better technique, it’s really hard to revert (e.g. for open source, but also it’s much easier to stop a model halfway through training when finding a scary ability than deleting it after a first period of deployment). See https://www.lesswrong.com/posts/fnc6Sgt3CGCdFmmgX/we-need-a-science-of-evals
I agree with the point 3. I’m generally quite happy with what we learned from the conservative evals and the role they played in raising public awareness of the risks. I’d like to see evals org finding more robust ways to evaluate performances and go toward more independence from the AGI labs.
In section 5, I explain how CoEm is an agenda with relaxed constraints. It does try to reduce the alignment tax to make the safety solution competitive for lab to use. Instead it considers there’s enough advance in international governance that you have full control over how your AI get built and that there’s enforcement mechanism to ensure no competitive but unsafe AI can be built somewhere else.
That’s what the bifurcation of narrative is about: not letting lab implement only solution that have low alignment tax because this could just not be enough.