I’m not sure I understand the difference, in my model, the paperclip minimizer would suffer from the same pathology as Asimov’s First Law, i.e. it would, even in the best case, have to gain power to prevent anyone from making paperclips. Possibly it’s worse than Asimov’s approach, since killing all humans is probably easier here than building a super-nanny-state.
Even worse, in the limit it would have to transform all matter such that a paperclip cannot form by chance
Naively, I would say every impressive capability you observe without a sharp left turn is weak evidence against it. At this point, both hypotheses explain the world about equally well, whereas the SLT hypothesis is more complex.
Arguably, eval awareness is actually a pretty mundane LLM ability, because putting text into a milieu is something LLMs are superhuman at already. I.e., “ah, clearly a Wikipedia article / paper / Reddit post” entails a capability like “ah, clearly something made by these METR people.”
I’d say a sharp left turn is more likely to happen through something like a meta-skill, and less likely to happen by some sort of scheming.
I think the plot of intelligence vs. years since LUCA is still one of the strongest arguments here, and something like language or whatever else happened to the hominids is strong evidence for the existence of such a meta-skill.
Current efforts put more selection pressure towards “produce impressive-looking results” vs. “appear aligned to humans” than was assumed in optimistic “AI in a box” scenarios.[1] I.e. for lying to emerge, there must be a strong cost for speaking the truth in the first place.
From page 118 of “Superintelligence”. I don’t think e.g. “test results are as good as they could be” describes the METR Frontier Report, and yet we’ve let the models in questions out of their box already. (In general I think the whole page still reads as very prescient)