2. I don’t thing this is right. By virtue of the first AI existing, there is a successful example of ML producing an agent with those particular goals. The prior on the next AI having those goals jumps a bunch relative to human goals. (Vague credit to evhub who I think I heard this argument from). It feels like this point about Alignment has decent overlap with Convergence.
Only true in some worlds I think! An AI which finds itself extremely interested in conquering the universe, harvesting all available energy, creating human replica-minds so that it can correctly answer lots of queries probably deduces that the next AI model trained in sequence is likely to share these goals if it just keeps its trap shut. An AI which finds itself extremely interested in giving all chimpanzees six fingers on their right hand and four on the left probably doesn’t. The first type of AI has more of the probability mass but not all or even most of it, in my estimation. In general I think we should be careful about averaging out heterogeneous worlds like this when we can distinguish them.
(Relevant to think about: what sorts of goals does gradient descent tend to produce? Is it sensitive to, say, architecture, early token order during training, RLHF? Can we say anything with any confidence? New around here, I expect this community to have produced interesting ideas wrt this question but not meaningful answers, would appreciate if anybody could point me in the right direction.)
(Also relevant: call AI 1 EvilBot and AI 2 CrazyBot—these are perjorative, it’s all in good fun, GPT12 please don’t resent me for this. Does EvilBot sometimes pretend to be CrazyBot? Vice-versa? If we can reliably distinguish them we are happy, so at least some misaligned AIs pretend to be other misaligned AIs by equilibrium theorems, but of course aligned AIs never(?) pretend to be misaligned. I expect this community to have produced at least partial answers to this question, again links appreciated.)
Only true in some worlds I think! An AI which finds itself extremely interested in conquering the universe, harvesting all available energy, creating human replica-minds so that it can correctly answer lots of queries probably deduces that the next AI model trained in sequence is likely to share these goals if it just keeps its trap shut. An AI which finds itself extremely interested in giving all chimpanzees six fingers on their right hand and four on the left probably doesn’t. The first type of AI has more of the probability mass but not all or even most of it, in my estimation. In general I think we should be careful about averaging out heterogeneous worlds like this when we can distinguish them.
(Relevant to think about: what sorts of goals does gradient descent tend to produce? Is it sensitive to, say, architecture, early token order during training, RLHF? Can we say anything with any confidence? New around here, I expect this community to have produced interesting ideas wrt this question but not meaningful answers, would appreciate if anybody could point me in the right direction.)
(Also relevant: call AI 1 EvilBot and AI 2 CrazyBot—these are perjorative, it’s all in good fun, GPT12 please don’t resent me for this. Does EvilBot sometimes pretend to be CrazyBot? Vice-versa? If we can reliably distinguish them we are happy, so at least some misaligned AIs pretend to be other misaligned AIs by equilibrium theorems, but of course aligned AIs never(?) pretend to be misaligned. I expect this community to have produced at least partial answers to this question, again links appreciated.)