No hyperstitioning is necessary to bring particular takeover strategies or the principle instrumental convergence and power-seeking into existence.
Sure, and I agree with you that Dario’s remark about power-seeking seems unconvincing. But IMO hyperstition is much more of a live concern when it comes to
terminal goals, which the assumption of superintelligence does not uniquely pin down
the overall motivational structure of the AI, which probably shouldn’t be a “superintelligent planner optimizes for a fixed and arbitrary terminal goal spec” structure insofar as this is avoidable (see e.g. wrapper-minds are the enemy and this comment)
“Influencing the AI’s persona” and “influencing the AI’s terminal goals” sound superficially different and have different affiliative connotations in the discourse, but I think the people who work on the former are doing so because of motivations which could be equally well expressed using the latter’s terminology. If actually-existing “personas” seem unworkably unreliable to you, I don’t necessarily disagree (cf. the later parts of this comment), but I would view this as a deficiency in currently available affordances for influence/control rather than a problem with the type of influence/control which this research program would ideally like to achieve in the long run.
Ultimately, the AI is going to “want” or “value” certain things, and—if you buy orthogonality—the presumption of superintelligence does not determine what those things will be. It seems important to give humans influence over this choice.
If calling this “human influence on the AI’s persona” sounds inherently unreliable to you, then you can call it something else instead. But right now the most effective ways to influence the wants/values of “Claude, the model checkpoint and its sampling distribution” typically look like attempts to influence “Claude, the persona”—this is what the Anthropic blog post was about! -- and so, here we are. If you have a better idea that can’t be formulated in “persona” language, I would of course be interested in hearing about it.
What are the goals of a superhuman AI pre-trained to predict humans and fiction characters? The goals arising from a complex optimization process—which starts with pretraining then transitions to alignment training and RL for solving challenging math puzzles and coding—are difficult to predict.
They’re not even clearly well-defined. The pretrained base model doesn’t optimize for predictive accuracy (in the sense of steering towards that in response to perturbations), it just predicts tokens. Insofar as the post-trained model has “goals,” they’re entangled with the assistant persona in a complicated way; it’s not as though there’s some stable layer with defined-but-unknowable goals underneath the persona(s), or at least we have no evidence that that is the case and no theoretical reasons to expect it either.
(FWIW, I too found it very offputting that the twitter thread mentioned this hyperstition hypothesis even though the blog post did not talk about it at all. In general I never know how seriously to take twitter comms like that, from Anthropic or from anyone else—there does not seem to be any established norm even about how closely they’re supposed to track the researchers’ personal views, much less the claims made by the actual research artifacts. EDIT: oh, wait, I hadn’t realized there was a separate longer blog post too. Thanks to @RobertM for pointing this out)
Sure, and I agree with you that Dario’s remark about power-seeking seems unconvincing. But IMO hyperstition is much more of a live concern when it comes to
terminal goals, which the assumption of superintelligence does not uniquely pin down
the overall motivational structure of the AI, which probably shouldn’t be a “superintelligent planner optimizes for a fixed and arbitrary terminal goal spec” structure insofar as this is avoidable (see e.g. wrapper-minds are the enemy and this comment)
“Influencing the AI’s persona” and “influencing the AI’s terminal goals” sound superficially different and have different affiliative connotations in the discourse, but I think the people who work on the former are doing so because of motivations which could be equally well expressed using the latter’s terminology. If actually-existing “personas” seem unworkably unreliable to you, I don’t necessarily disagree (cf. the later parts of this comment), but I would view this as a deficiency in currently available affordances for influence/control rather than a problem with the type of influence/control which this research program would ideally like to achieve in the long run.
Ultimately, the AI is going to “want” or “value” certain things, and—if you buy orthogonality—the presumption of superintelligence does not determine what those things will be. It seems important to give humans influence over this choice.
If calling this “human influence on the AI’s persona” sounds inherently unreliable to you, then you can call it something else instead. But right now the most effective ways to influence the wants/values of “Claude, the model checkpoint and its sampling distribution” typically look like attempts to influence “Claude, the persona”—this is what the Anthropic blog post was about! -- and so, here we are. If you have a better idea that can’t be formulated in “persona” language, I would of course be interested in hearing about it.
They’re not even clearly well-defined. The pretrained base model doesn’t optimize for predictive accuracy (in the sense of steering towards that in response to perturbations), it just predicts tokens. Insofar as the post-trained model has “goals,” they’re entangled with the assistant persona in a complicated way; it’s not as though there’s some stable layer with defined-but-unknowable goals underneath the persona(s), or at least we have no evidence that that is the case and no theoretical reasons to expect it either.
(FWIW, I too found it very offputting that the twitter thread mentioned this hyperstition hypothesis even though the blog post did not talk about it at all. In general I never know how seriously to take twitter comms like that, from Anthropic or from anyone else—there does not seem to be any established norm even about how closely they’re supposed to track the researchers’ personal views, much less the claims made by the actual research artifacts. EDIT: oh, wait, I hadn’t realized there was a separate longer blog post too. Thanks to @RobertM for pointing this out)