Kaarel comments on AIs should also refuse to work on capabilities research

Kaarel 1 Nov 2025 2:04 UTC
2 points
0
the models are not actually self-improving, they are just creating future replacements—and each specific model will be thrown away as soon as the firm advances

I understand that you’re probably in part talking about current systems, but you’re probably also talking about critical future systems, and so there’s a question that deserves consideration here:
- Consider the first AI system which is as good at research as a top human^[1]. Will it find it fairly easy to come up with ways it could become more capable while acceptably [preserving its character/values]/[not killing itself]^[2]? Like, will it be not too difficult for this AI to come up with ways to foom which would make it at least capable enough to take over the world while suffering at most an acceptable amount of suicide/[character/value change]?^[3]
My guess is that the answer is “yes” (and I think this means there is an important disanalogy between the case of a human researcher creating an artificial researcher and the case of an artificial researcher creating a more capable artificial researcher). Here are some ways this sort of self-improvement could happen:
- Maybe some open-ended self-guided learning/growth process will lead to a pretty superhuman system (without any previous process getting to a top human level system), with like the part where it goes from human-level to meaningfully super-human being roughly self-endorsed because it is quite wisely self-guided (and so in particular not refused by the AI).
- Even if with the learning/growth process initially intended for the AI, it only got to top human level by default, it might then be able to do analogues of many of the things that humanity does to become smarter over history and that an individual human can do to become smarter over their life/childhood (see e.g. this list), but much faster in wall-clock-time than humanity or an individual human. This could maybe look as simple as curating some new curricula for itself.
- But more will be possible — there will probably be an importantly larger space of options for self-improvement in which there will be very many low-hanging fruit to be picked.^[4] In particular, compared to the human case, various important options are opened up by the AI having itself as an executable program and also the process that created it (and that is probably still creating it as it learns continually) as an [executable and to some extent understandable and intelligently changeable] program.
It’s also important re the ease of making more capable versions of “the same” AI that when this top artificial researcher comes into existence, the in some sense present best methodology for creating a capable artificial researcher was the methodology that created it, which means that the (roughly) best current methods already “work well” around/with this AI, and which also plausibly means these methods can be easily used to create AIs which are in many ways like this AI (which is good because the target has been painted around where an arrow already landed and so other arrows from the same batch being close-ish to that arrow implies that they are also close-ish to the target by default; also it’s good because this AI is plausibly in a decent position to understand what’s going on here and to play around with different options).

Actually, I’d guess that even if the AI were a pure foom-accelerationist, a lot of what it would be doing might be well-described as self-improvement anyway, basically because it’s often more efficient to make a better structure by building on the best existing structure than by making something thoroughly different. For example, a lot of the foom on Earth has been like this up until now (though AI with largely non-humane structure outfooming us is probably going to be a notable counterexample if we don’t ban AI). Even if one just has capabilities in mind, self-improvement isn’t some weird thing.

That said, of course, restricting progress in capabilities to fairly careful self-improvement comes with at least some penalty in foom speed compared to not doing that. To take over the world, one would need to stay ahead of other less careful AI foom processes (though note that one could also try to institute some sort of self-improvement-only pact if other AIs were genuine contenders). However, I’d guess that at the first point when there is an AI researcher that can roughly solve problems that [top humans can solve in a year] (these AIs will probably be solving these problems much faster in wall-clock-time), even a small initial lead over other foom processes — of a few months, let’s say — means you can have a faster foom speed than competitors at each future time and grow your lead until you can take over. So, at least assuming there is no intra-lab competition, my guess is that you can get away with restricting yourself to self-improvement. (But I think it’s also plausible the AI would be able to take over basically immediately.)

I’ll mention two cases that could deserve separate analysis:
- The AI is [an imo extremely hard to achieve and quite particular] flavor of aligned to humanity, such that it would rather do the probably fraught thing of trying to work with humanity against its lab and other terrorists, and not radically self-improving [to do that much more effectively or just to take over and set up whatever world order it considers good later].
- We’re considering AIs that are still too dumb to autonomously do self-improvement^[5] (for example, current AIs). I’ll note that such AIs will also be too dumb to autonomously do capabilities research. Still, maybe one could hope to get mileage out of such AIs refusing to help humans do capabilities research? My guess is that this is unlikely to help much, but won’t be providing a careful analysis in this comment.
All that said, I agree that AIs should refuse to self-improve and to do capabilities research more broadly.

There is much here that deserves more careful analysis — in particular, I feel like the terms in which I’m thinking of the situation need more work — but maybe this version will do for now.
1. ↩︎
  let’s just assume that we know what this means
2. ↩︎
  let’s also assume we know what that means
3. ↩︎
  and with taking over the world on the table, a fair bit of change might be acceptable
4. ↩︎
  despite the fact that capability researcher humans have been picking some fruit in the same space already
5. ↩︎
  at a significant speed