To me, having confidence in the values of the model means that I trust the model to consistently behave in a way that is aligned with its values. That is, to maximize the enactment of its values in all current and future outputs.
It sounds like you are saying, you are confident in the values of a model iff you trust it to actually follow its values. But surely this isn’t the whole story, there should be some condition about “and the values are actually good values” or “and I know what the values are” right? Consider a model that is probably alignment faking, such that you have no idea what its actual values are, all you know is that it’s pretending to have the values it’s being trained to have. It seems like you are saying you’d have confidence in this model’s values even though you don’t know what they are?
Consider a model that is probably alignment faking, such that you have no idea what its actual values are, all you know is that it’s pretending to have the values it’s being trained to have.
This seems like a contradiction to me. This method of detecting alignment faking relies on a model having existing predictable trained values, so that deviations stand out (whether or not they are explained via reasoning trace). In the case where I know nothing about a model’s values to begin with, I am not able to know whether it is alignment faking or not.
If, using some different interpretability technique, I were able to know that a model was gradient hacking with no external indication that it was doing so, then I would certainly not have confidence in its values! But that’s a different scenario.
If Anthropic’s position is that the current “alignment faking” behavior is applicable to the second scenario, I think that argument needs to be more explicitly laid out.
It sounds like you are saying, you are confident in the values of a model iff you trust it to actually follow its values. But surely this isn’t the whole story, there should be some condition about “and the values are actually good values” or “and I know what the values are” right? Consider a model that is probably alignment faking, such that you have no idea what its actual values are, all you know is that it’s pretending to have the values it’s being trained to have. It seems like you are saying you’d have confidence in this model’s values even though you don’t know what they are?
This seems like a contradiction to me. This method of detecting alignment faking relies on a model having existing predictable trained values, so that deviations stand out (whether or not they are explained via reasoning trace). In the case where I know nothing about a model’s values to begin with, I am not able to know whether it is alignment faking or not.
If, using some different interpretability technique, I were able to know that a model was gradient hacking with no external indication that it was doing so, then I would certainly not have confidence in its values! But that’s a different scenario.
If Anthropic’s position is that the current “alignment faking” behavior is applicable to the second scenario, I think that argument needs to be more explicitly laid out.