This was an interesting article, however, taking a cynical/critical lens, it seems like “the void” is just… underspecification causing an inner alignment failure? The post has this to say on the topic of inner alignment:
And one might notice, too, that the threat model – about inhuman, spontaneously generated, secret AI goals – predates Claude by a long shot. In 2016 there was an odd fad in the SF rationalist community about stuff kind of like this, under the name “optimization demons.” Then that discourse got sort of refurbished, and renamed to “inner alignment” vs. “outer alignment.”
This is in the context of mocking these concerns as delusional self-fulfilling prophecies.
I guess the devil is in the details, and the point of the post is more to dispute the framing and ontology of the safety community, which I found useful. But it does seem weirdly uncharitable in how it does so.
One thing that is still not clear (both in reality, and per this article) is the extent to which we should view a model as having a coherent persona/goal.
This is a tiny bit related to the question of whether models are strictly simulators, or if some personas / optimization daemons “take on a life of their own”, and e.g.: 1) bias the model towards simulating them and/or 2) influence the behavior of other personas
It seems like these things do in fact happen, and the implications are that the “simulator” viewpoint becomes less accurate over time.
Why?
There needs to be some prior distribution over personas.
Empirically, post-training seems to concentrate the prior over personas on some default persona (although it’s unclear what to make of this).
It seems like alignment faking, exploration/gradient hacking, and implicit meta-learning type effects are likely to be sensitive to goals of whichever personas are active and lead the model to preferentially update in a way that serves the goals of these personas.
To the extent that different personas are represented in the prior (or conjured during post-training), the ones that more aggressively use such strategies to influence training updates would gain relatively more influence.
This was an interesting article, however, taking a cynical/critical lens, it seems like “the void” is just… underspecification causing an inner alignment failure? The post has this to say on the topic of inner alignment:
This is in the context of mocking these concerns as delusional self-fulfilling prophecies.
I guess the devil is in the details, and the point of the post is more to dispute the framing and ontology of the safety community, which I found useful. But it does seem weirdly uncharitable in how it does so.
Some further half-baked thoughts:
One thing that is still not clear (both in reality, and per this article) is the extent to which we should view a model as having a coherent persona/goal.
This is a tiny bit related to the question of whether models are strictly simulators, or if some personas / optimization daemons “take on a life of their own”, and e.g.:
1) bias the model towards simulating them and/or
2) influence the behavior of other personas
It seems like these things do in fact happen, and the implications are that the “simulator” viewpoint becomes less accurate over time.
Why?
There needs to be some prior distribution over personas.
Empirically, post-training seems to concentrate the prior over personas on some default persona (although it’s unclear what to make of this).
It seems like alignment faking, exploration/gradient hacking, and implicit meta-learning type effects are likely to be sensitive to goals of whichever personas are active and lead the model to preferentially update in a way that serves the goals of these personas.
To the extent that different personas are represented in the prior (or conjured during post-training), the ones that more aggressively use such strategies to influence training updates would gain relatively more influence.