David Johnston comments on Buck’s Shortform

David Johnston 30 Oct 2025 11:47 UTC
10 points
0
I don’t know about 2020 exactly, but I think since 2015 (being conservative), we do have reason to make quite a major update, and that update is basically that “AGI” is much less likely to be insanely good at generalization than we thought in 2015.

Evidence is basically this: I don’t think “the scaling hypothesis” was obvious at all in 2015, and maybe not even in 2020. If it was, OpenAI could not have caught everyone with their pants down by investing early in scaling. But if people mostly weren’t expecting massive data scale-ups to be the road to AGI, what were they expecting instead? The alternative to reaching AGI by hyperscaling data is a world where we reach AGI with … not much data. I have this picture which I associate with Marcus Hutter – possibly quite unfairly – where we just find the right algorithm, teach it to play a couple of computer games and hey presto we’ve got this amazing generally intelligent machine (I’m exaggerating a little bit for effect). In this world, the “G” in AGI comes from extremely impressive and probably quite unpredictable feats of generalization, and misalignment risks are quite obviously way higher for machines like this. As a brute fact, if generalization is much less predictable, then it is harder to tell if you’ve accidentally trained your machine to take over the world when you thought you were doing something benign. A similar observation also applies to most of the specific mechanisms proposed for misalignment: surprisingly good cyberattack capabilities, gradient hacking, reward function aliasing that seems intuitively crazy—they all become much more likely to strike unexpectedly if generalization is extremely broad.

But this isn’t the world we’re in; rather, we’re in a world where we’re helped along by a bit of generalization, but to a substantial extent we’re exhaustively teaching the models everything they know (even the RL regime we’re in seems to involve sizeable amounts of RL teaching many quite specific capabilities). Sample efficiency is improving, but the rate of progress in capability vs the rate of progress in sample efficiency looks to me like it’s highly likely that we’re in qualitatively the same world by the time we have broadly superhuman machines. I’d even be inclined to say: human level data efficiency is the upper bound of the point at which we reach broadly superhuman capability, because it’s easy to feed machines much more (quality) data than it is to feed it to people, so by the time we get human level data efficiency we must have surpassed human level capability (well, probably).

Of course “super-AGI” could still end up hyper-data-efficient, but it seems like we’re well on track to get less-generalizing and very useful AGI before we get there.

I know you’re asking about goal structures and inductive biases, but I think generalization is another side of the same coin, and the thoughts above seem far simpler and thus more likely to be correct than anything I’ve ever thought specifically about inductive biases and goals. So I suppose my expectation is that correct thoughts about goal formation and inductive biases would also point away from 2015 era theories insofar as such theories predicted broad and unpredictable generalization, but I’ve little specific to contribute right now.