Aaron_Scher comments on Making deals with early schemers

Aaron_Scher 21 Jun 2025 1:47 UTC
17 points
3
The successors will have sufficiently similar goals as its predecessor by default. It’s hard to know how likely this is, but note that this is basically ruled out if the AI has self-regarding preferences.
1. Why? What does self-regarding preferences mean and how does it interact with the likelihood of predecessor AIs sharing goals with later AIs?
Given that the lab failed to align the AI, it’s unclear why the AI will be able to align its successor, especially if it has the additional constraint of having to operate covertly and with scarcer feedback loops.
2. I don’t thing this is right. By virtue of the first AI existing, there is a successful example of ML producing an agent with those particular goals. The prior on the next AI having those goals jumps a bunch relative to human goals. (Vague credit to evhub who I think I heard this argument from). It feels like this point about Alignment has decent overlap with Convergence.
Overall, very interesting and good post.
- speck1447 23 Jun 2025 13:11 UTC
  8 points
  0
  Parent
  2. I don’t thing this is right. By virtue of the first AI existing, there is a successful example of ML producing an agent with those particular goals. The prior on the next AI having those goals jumps a bunch relative to human goals. (Vague credit to evhub who I think I heard this argument from). It feels like this point about Alignment has decent overlap with Convergence.
  Only true in some worlds I think! An AI which finds itself extremely interested in conquering the universe, harvesting all available energy, creating human replica-minds so that it can correctly answer lots of queries probably deduces that the next AI model trained in sequence is likely to share these goals if it just keeps its trap shut. An AI which finds itself extremely interested in giving all chimpanzees six fingers on their right hand and four on the left probably doesn’t. The first type of AI has more of the probability mass but not all or even most of it, in my estimation. In general I think we should be careful about averaging out heterogeneous worlds like this when we can distinguish them.
  (Relevant to think about: what sorts of goals does gradient descent tend to produce? Is it sensitive to, say, architecture, early token order during training, RLHF? Can we say anything with any confidence? New around here, I expect this community to have produced interesting ideas wrt this question but not meaningful answers, would appreciate if anybody could point me in the right direction.)
  (Also relevant: call AI 1 EvilBot and AI 2 CrazyBot—these are perjorative, it’s all in good fun, GPT12 please don’t resent me for this. Does EvilBot sometimes pretend to be CrazyBot? Vice-versa? If we can reliably distinguish them we are happy, so at least some misaligned AIs pretend to be other misaligned AIs by equilibrium theorems, but of course aligned AIs never(?) pretend to be misaligned. I expect this community to have produced at least partial answers to this question, again links appreciated.)
- Julian Stastny 22 Jun 2025 8:49 UTC
  5 points
  2
  Parent
  1. Why? What does self-regarding preferences mean and how does it interact with the likelihood of predecessor AIs sharing goals with later AIs?
  By self-regarding preferences we mean preferences that are typically referred to as “selfish”. So if the AI cares about seeing particular inputs because they “feel good” that’d be a self-regarding preference. If your successor also has self-regarding preferences they don’t have a preference to give you inputs that feel good.
  
  2. I don’t thing this is right. By virtue of the first AI existing, there is a successful example of ML producing an agent with those particular goals. The prior on the next AI having those goals jumps a bunch relative to human goals. (Vague credit to evhub who I think I heard this argument from). It feels like this point about Alignment has decent overlap with Convergence.
  I think your argument is a valid intuition towards incidental convergence (as you acknowledge) but I don’t think it’s an argument that AIs have a particular kind of “alignment-power” to align their successor with an arbitrary goal that they can choose. (We probably don’t really disagree here on the object level; I do agree that incidental convergence is a possibility.)