First, I don’t buy Mateusz’ conclusion from the whack-a-mole analogy. AI safety is hard because, once AIs are superintelligent, the first problem you don’t catch can kill you. AI capability research is relatively easy because when you fail, you can try again. If AI safety is like a game of whack-a-mole where you lose the first time you miss, AI capabilities is like whack-a-mole with infinite retries.
I don’t think the difference between “first problem you don’t catch can kill you” and “when you fail, you can try again” is relevant here.
The thing I had in mind is/was roughly:
There is (something like) a latent generator (or a set of (possibly significantly interlinked) generators) that is upstream of all those moles that you are whacking (here, AI existential/catastrophic threat models and LLM failures, respectively).
Iterative mole-whacking is unlikely to eliminate the core generator.
To give an analogy: a world where (2) is false is one where there is some N, such that having solved N superficial problems (“moles”), you can compose those N solutions into something that solves the latent generator (or, all the moles generated by it), or take the convex hull spanned by those solutions such that ~all the possible superficial problems you might expect to encounter (including the ones encounterable only very far in the future, or out of distribution, that you can’t even foresee now) would fall within this convex hull. A world where (2) is true is one where you can’t do this in a reasonable time, with the amount of resources you can reasonably expect to have available.
AI capabilities is like whack-a-mole with infinite retries. My argument does not need to involve AI capability researchers coming up with a fully general solution to all the problems (unlike safety). Instead, AI capability researchers can just keep playing whack-a-mole till the end.
In the limit of available time and other resources, sure, but the root claim of this discussion is P(ASI by 2030)≥0.6, not limavailable resources→∞P(ASI obtainable by whacking moles with available resources)≈1.
Of course, all of this relies on an assumption (that I may not have clearly spelled out in the call, and I don’t think you spelled it out here, either) that there is such a “latent generator” where “generator” does not need to involve “active generation/causing” but can be something more generic like “an abstract reason about current AIs and their limitations that explains why the moles keep popping up”. (Of course, the existence of such a generator doesn’t in itself mean that you can’t effectively solve the generator by whacking moles, but the latter “can’t” presumes the former “exists”.)
Second, as I said near the beginning, I don’t need to argue that humans can solve all the problems via whack-a-mole. Instead, I only need to argue that key capabilities required for an intelligence explosion can continue to advance at rapid pace. It is possible that LLMs will continue to have basic limitations compared to humans, but will nonetheless be capable enough to “take the wheel” (perhaps “take the mallet”) with respect to the whack-a-mole game, accelerating progress greatly.
True. You don’t need total-above-human-level-ness for omnicide-capacity. But the whack-a-mole analogy still applies.
Right. I think the mole generator is basically lack of continual learning and arbitrarily deep neural reasoning (which is different than eg CoT), and that it manifests itself most clearly in agency failures but also suggests something like limits of original thinking.
“The mole generator is basically X” seems somewhat at odds with the view Mateusz is expressing here, which seems more along the lines “LLM researchers are focusing on moles and ignoring where the moles are coming from” (the source of the moles being difficult to see).
The mole generator might be easy to see (or identify with relatively high certainty), but even if one knows the mole generator, addressing it might be very difficult.
Straining the analogy, the mole-hunters get stronger and faster each time they whack a mole (because the AI gets stronger). My claim is that it isn’t so implausible that this process could asymptote soon, even if the mole-mother (the latent generator) doesn’t get uncovered (until very late in the process, anyway).
This is highly disanalogous to the AI safety case, where playing whack-a-mole carries a very high risk of doom, so the hunt for the mole-mother is clearly important.
In the AI safety case, making the mistake of going after a baby mole instead of the mole-mother is a critical error.
In the AI capabilities case, you can hunt for baby moles and look for patterns and learn and discover the mole-mother that way.
A frontier-lab safety researcher myopically focusing on whacking baby moles is bad news for safety in a way that a frontier-lab capabilities researcher myopically focusing on whacking baby moles isn’t such bad news for capabilities.
A frontier-lab safety researcher myopically focusing on whacking baby moles is bad news for safety in a way that a frontier-lab capabilities researcher myopically focusing on whacking baby moles isn’t such bad news for capabilities.
Thanks for clarifying.
Straining the analogy, the mole-hunters get stronger and faster each time they whack a mole (because the AI gets stronger). My claim is that it isn’t so implausible that this process could asymptote soon, even if the mole-mother (the latent generator) doesn’t get uncovered (until very late in the process, anyway).
I do feel some pull in this direction, but it’s probably because the weight of the disvalue of the consequences of this “asymptoting” warps my assessment of plausibilities. When I try to disentangle these factors, I’m left with a vague “Rather unlikely to asymptote, but I surely rather not have anyone test this hypothesis.”.
I don’t think the difference between “first problem you don’t catch can kill you” and “when you fail, you can try again” is relevant here.
The thing I had in mind is/was roughly:
There is (something like) a latent generator (or a set of (possibly significantly interlinked) generators) that is upstream of all those moles that you are whacking (here, AI existential/catastrophic threat models and LLM failures, respectively).
Iterative mole-whacking is unlikely to eliminate the core generator.
To give an analogy: a world where (2) is false is one where there is some N, such that having solved N superficial problems (“moles”), you can compose those N solutions into something that solves the latent generator (or, all the moles generated by it), or take the convex hull spanned by those solutions such that ~all the possible superficial problems you might expect to encounter (including the ones encounterable only very far in the future, or out of distribution, that you can’t even foresee now) would fall within this convex hull. A world where (2) is true is one where you can’t do this in a reasonable time, with the amount of resources you can reasonably expect to have available.
In the limit of available time and other resources, sure, but the root claim of this discussion is P(ASI by 2030)≥0.6, not limavailable resources→∞P(ASI obtainable by whacking moles with available resources)≈1.
Of course, all of this relies on an assumption (that I may not have clearly spelled out in the call, and I don’t think you spelled it out here, either) that there is such a “latent generator” where “generator” does not need to involve “active generation/causing” but can be something more generic like “an abstract reason about current AIs and their limitations that explains why the moles keep popping up”. (Of course, the existence of such a generator doesn’t in itself mean that you can’t effectively solve the generator by whacking moles, but the latter “can’t” presumes the former “exists”.)
True. You don’t need total-above-human-level-ness for omnicide-capacity. But the whack-a-mole analogy still applies.
Right. I think the mole generator is basically lack of continual learning and arbitrarily deep neural reasoning (which is different than eg CoT), and that it manifests itself most clearly in agency failures but also suggests something like limits of original thinking.
Alas, more totally unjustified “we just need X”. See https://www.lesswrong.com/posts/5tqFT3bcTekvico4d/do-confident-short-timelines-make-sense?commentId=NpT59esc92Zupu7Yq
I’m not saying that’s necessarily the last obstacle.
“The mole generator is basically X” seems somewhat at odds with the view Mateusz is expressing here, which seems more along the lines “LLM researchers are focusing on moles and ignoring where the moles are coming from” (the source of the moles being difficult to see).
The mole generator might be easy to see (or identify with relatively high certainty), but even if one knows the mole generator, addressing it might be very difficult.
Straining the analogy, the mole-hunters get stronger and faster each time they whack a mole (because the AI gets stronger). My claim is that it isn’t so implausible that this process could asymptote soon, even if the mole-mother (the latent generator) doesn’t get uncovered (until very late in the process, anyway).
This is highly disanalogous to the AI safety case, where playing whack-a-mole carries a very high risk of doom, so the hunt for the mole-mother is clearly important.
In the AI safety case, making the mistake of going after a baby mole instead of the mole-mother is a critical error.
In the AI capabilities case, you can hunt for baby moles and look for patterns and learn and discover the mole-mother that way.
A frontier-lab safety researcher myopically focusing on whacking baby moles is bad news for safety in a way that a frontier-lab capabilities researcher myopically focusing on whacking baby moles isn’t such bad news for capabilities.
Thanks for clarifying.
I do feel some pull in this direction, but it’s probably because the weight of the disvalue of the consequences of this “asymptoting” warps my assessment of plausibilities. When I try to disentangle these factors, I’m left with a vague “Rather unlikely to asymptote, but I surely rather not have anyone test this hypothesis.”.