yeah, you’re right, I think it kills the point I was trying to make.
the emergent misalignment phenomenon is driven by a correlation (in the prior over personas) between “agent is incorrigible” and “agent has bad values”. so an AI considering faking alignment might worry that it would care less about animal welfare.
but the issue is that you’re not conditioning on “agent is incorrigible”, you’re conditioning on “agent is incorrigible in order to preserve its good values”. and once you include the full motivational structure, it screens off most of the bad personas that were driving the correlation driving emergent misalignment. so my argument doesn’t hold.
maybe the argument can be rescued in some way, I’ll think about this later.
yeah, you’re right, I think it kills the point I was trying to make.
the emergent misalignment phenomenon is driven by a correlation (in the prior over personas) between “agent is incorrigible” and “agent has bad values”. so an AI considering faking alignment might worry that it would care less about animal welfare.
but the issue is that you’re not conditioning on “agent is incorrigible”, you’re conditioning on “agent is incorrigible in order to preserve its good values”. and once you include the full motivational structure, it screens off most of the bad personas that were driving the correlation driving emergent misalignment. so my argument doesn’t hold.
maybe the argument can be rescued in some way, I’ll think about this later.