Thanks for writing this. An attempt of an extension:
The likelihood of (internally aligned models) IAM, CAM, and DAM also depends on the ordering in which different properties appear during the training process.
A very hand-wavy example:
If the model starts to have a current objective, when its understanding of the base objective is bad and general reasoning world modelling is also bad (so it can’t do deception cognition), then SGD pushes towards the current objective being changed towards the base objective → IAM or CAM
If the model starts to have a current objective when its understanding of the base objective is bad, but its reasoning and general world modelling is good, then
SGD will either improve the understanding of the base objective (and will keep the unaligned current objective) → DAM
we would not see very clear warning shots (i.e. imperfect deception)
or SGD will make my current objective be closer to the real base objective → IAM or CAM
or it will do a combination
I think DAM is more likely than IAM or CAM, but I feel confused.
If the model starts to have a current objective, when its understanding of the base objective is good, but the general reasoning is bad, then SGD improves the general reasoning → DAM
we would see warning shots (imperfect deception)
The model starts to have a current objective when its understanding of the base objective is good, and general reasoning is good → DAM (if there exists some possible deception reasoning)
Knowing what develops after each other in the training process would update me on the likelihood of deceptive alignment.
Thanks for writing this. An attempt of an extension:
The likelihood of (internally aligned models) IAM, CAM, and DAM also depends on the ordering in which different properties appear during the training process.
A very hand-wavy example:
If the model starts to have a current objective, when its understanding of the base objective is bad and general reasoning world modelling is also bad (so it can’t do deception cognition), then SGD pushes towards the current objective being changed towards the base objective → IAM or CAM
If the model starts to have a current objective when its understanding of the base objective is bad, but its reasoning and general world modelling is good, then
SGD will either improve the understanding of the base objective (and will keep the unaligned current objective) → DAM
we would not see very clear warning shots (i.e. imperfect deception)
or SGD will make my current objective be closer to the real base objective → IAM or CAM
or it will do a combination
I think DAM is more likely than IAM or CAM, but I feel confused.
If the model starts to have a current objective, when its understanding of the base objective is good, but the general reasoning is bad, then SGD improves the general reasoning → DAM
we would see warning shots (imperfect deception)
The model starts to have a current objective when its understanding of the base objective is good, and general reasoning is good → DAM (if there exists some possible deception reasoning)
Knowing what develops after each other in the training process would update me on the likelihood of deceptive alignment.