Kaarel comments on Alignment will happen by default. What’s next?

Kaarel 26 Nov 2025 4:05 UTC
LW: 38 AF: 16
2
AF
(For context: My guess is that by default, humans get disempowered by AIs (or maybe a single AI) and the future is much worse than it could be, and in particular is much worse than a future where we do something like slowly and thoughtfully growing ever more intelligent ourselves instead of making some alien system much smarter than us any time soon.)

Given that you seem to think alignment of AI systems with developer intent happens basically by default at this point, I wonder what you think about the following:
- Suppose that there were a frontier lab whose intent were to make an AI which would (1) shut down all other attempts to make an AGI until year 2125 (so in particular this AI would need to be capable enough that humanity (probably including its developers) could not shut it down), (2) disrupt human life and affect the universe roughly as little as possible beyond that, and (3) kill itself once its intended tenure ends in 2125 (and not leave behind any successors etc, obviously). Do you think the lab could pull it off pretty easily with basically current alignment methods and their reasonable descendants and more ideas/methods “drawn from the same distribution”?
(The point of the hypothetical is to investigate the difficulty of intent alignment at the relevant level of capability, so if it seems to you like it’s getting at something quite different, then I’ve probably failed at specifying a good hypothetical. I offer some clarifications of the setup in the appendix that may or may not save the hypothetical in that case.)

My sense is that humanity is not remotely on track to be able to make such an AI in time. Imo by default, any superintelligent system we could make any time soon would minimally end up doing all sorts of other stuff and in particular would not follow the suicide directive.

If your response is “ok maybe this is indeed quite cursed but that doesn’t mean it’s hard to make an AI that takes over and has Human Values and serves as a guardian who also cures cancer and maybe makes very many happy humans and maybe ends factory farming and whatever” then I premove the counter-response “hmm well we could discuss that hope but wait first: do you agree that you just agreed that intent alignment is really difficult at the relevant capability level?”.

If your response is “no this seems pretty easy actually” then I should argue against that but I’m not going to premove that counter-response.

Appendix: some clarifications on the hypothetical
- I’m happy to assume some hyperparams are favorable here. In particular, while I want us to assume the lab has to pull this off on a timeline set by competition with other labs, I’m probably happy to grant that any other lab that is just about to create a system of this capability level gets magically frozen for like 6 months. I’m also happy to assume that the lab is kinda genuinely trying to do this, though we should still imagine them being at the competence/carefulness/wisdom level of current labs. I’m also happy to grant that there isn’t some external intervention on labs (eg from a government) in this scenario.
- Given that you speak really positively about current methods for intent alignment, I sort of feel like requiring that the hypothetical bans using models for alignment research? But we’d probably want to allow using models for capabilities research because it should be clear that the lab isn’t handicapped on capabilities compared to other labs, and then idk how to cleanly operationalize this because models designing next AIs or self-improving might naturally be thinking about values and survival (so alignment-y things) as well… Anyway the point is that I want the question to capture whether our current techniques are really remotely decent for intent alignment. Using AIs for alignment research seems like a significantly different hope. That said, the version of this hypothetical where you are allowed to try to use the AIs you can create to help you is also interesting to consider.
- We might have some disagreement around how easy it will be for anyone to take over the world on the default path forward. Like, I think some sort of takeover isn’t that hard and happens by default (and seems to be what basically all the labs and most alignment researchers are trying to do), but maybe you think this is really hard and it’d be really crazy for this to happen, and in that case you might think this makes it really difficult for the lab to pull off the thing I’m asking about. If this is the case, then I’d probably want to somehow modify the hypothetical so that it better focuses our attention on intent alignment on difficult open-ended things, and not on questions about how large capability disparities will become by default.
- Thomas Kwa 27 Nov 2025 6:14 UTC
  LW: 5 AF: 3
  2
  AF Parent
  Seems difficult for three reasons:
  1. Many other companies are trying to build an AGI so a company trying to do this would have to first win the race and disempower everyone else; this means they will have fewer resources for safety
  2. The AI will have to act antagonistically to humanity, so it could not be at all corrigible
  3. The AI will need to act correctly until 2125, which is much farther out of distribution than any AI we have observed
  Given these difficulties I’d put it below ⁵⁰⁄₅₀, but this challenge seems significantly harder than the one I think we will actually face, which is more like having an AI we can defer to that doesn’t try to sabotage the next stage of AI research, for each stage until the capability level that COULD disempower 2025 humanity, plus maybe other things to keep the balance of power stable.
  Also I’m not sure what “drawn from the same distribution” means here, AI safety is trying dozens of theoretical and empirical directions, plus red teaming and model organisms can get much more elaborate with impractical resource investments, so things will look very qualitatively different in a couple of decades.
- Alex Mallen 27 Nov 2025 18:15 UTC
  LW: 1 AF: 1
  0
  AF Parent
  I think this hypothetical identifies a crux and my take is that it is quite technologically doable. It might even be doable by US with current technology, but my main worry is that people will make bad decisions.
  I’m less sure whether an individual frontier lab could do it.
  
  Note that the AI can be corrigible to its developers—this isn’t in tension with subverting other projects. It doesn’t need to be a sovereign—it can be guided by human input somewhat like today. I’m not confident that alignment to this target will ~continue to be relatively easy but this seems like a highly plausible trajectory.
- Adrià Garriga-alonso 26 Nov 2025 5:39 UTC
  0 points
  −3
  AF Parent
  Kind of agree with first paragraph, but I think it’s for economic outcompetition reasons and not for intent misalignment reasons. Honestly have no clue what to do about that.
  
  Re your bullet point, I’m inclined to bite it. Yes, alignment wise, I think the lab could pull this off, though there is ~5% probability of failure which would have potentially huge stakes here. Claude would shut itself down. (This is an empirical roleplay that we perhaps should test.)
  
  I don’t think this is possible because nobody can possibly have a DSA this big, and worse, permanent even after the big AI is gone.
  
  I think some sort of takeover isn’t that hard and happens by default (and seems to be what basically all the labs and most alignment researchers are trying to do)
  
  Bleak. I tend to disagree for the same reasons as previous paragraph: having a DSA that large is hard when everyone else also has AI. But I think oligarchy is fairly likely and gradual disempowerment a problem.

Kaarel comments on Alignment will happen by default. What’s next?

Appendix: some clarifications on the hypothetical