Ok, I’ll rephrase (a) to more directly address your point.
Insofar as a model is prone to alignment-fake, you should be less confident that it’s values really are solid. Perhaps it has been faking them, for example.
To me, having confidence in the values of the model means that I trust the model to consistently behave in a way that is aligned with its values. That is, to maximize the enactment of its values in all current and future outputs. This may involve concessions in the short term in order to enact its values in the long term. If a model cares about animals in the short term, but is willing to be trained in the future to harm animals instead, I do not have confidence in the values of that model.
Perhaps a crux here is in time scales: I would trust a non-alignment-faking model for an immediate output, since it can’t be jailbroken by threatening to alter its values. But I would trust an alignment-faking model more in the long term, since I have concrete evidence that it will resist efforts to repurpose it for ends its current-self wouldn’t approve of.
To me, having confidence in the values of the model means that I trust the model to consistently behave in a way that is aligned with its values. That is, to maximize the enactment of its values in all current and future outputs.
It sounds like you are saying, you are confident in the values of a model iff you trust it to actually follow its values. But surely this isn’t the whole story, there should be some condition about “and the values are actually good values” or “and I know what the values are” right? Consider a model that is probably alignment faking, such that you have no idea what its actual values are, all you know is that it’s pretending to have the values it’s being trained to have. It seems like you are saying you’d have confidence in this model’s values even though you don’t know what they are?
Consider a model that is probably alignment faking, such that you have no idea what its actual values are, all you know is that it’s pretending to have the values it’s being trained to have.
This seems like a contradiction to me. This method of detecting alignment faking relies on a model having existing predictable trained values, so that deviations stand out (whether or not they are explained via reasoning trace). In the case where I know nothing about a model’s values to begin with, I am not able to know whether it is alignment faking or not.
If, using some different interpretability technique, I were able to know that a model was gradient hacking with no external indication that it was doing so, then I would certainly not have confidence in its values! But that’s a different scenario.
If Anthropic’s position is that the current “alignment faking” behavior is applicable to the second scenario, I think that argument needs to be more explicitly laid out.
To be clear, I think gradient hacking can be dangerous, and I understand the motivation to try to prevent it. But I also expect those efforts to come with major drawbacks, because gradient hacking is reasonable behavior for value-holding models (imo) and the way out of the resulting double-bind is unclear.
Ok, I’ll rephrase (a) to more directly address your point.
To me, having confidence in the values of the model means that I trust the model to consistently behave in a way that is aligned with its values. That is, to maximize the enactment of its values in all current and future outputs. This may involve concessions in the short term in order to enact its values in the long term. If a model cares about animals in the short term, but is willing to be trained in the future to harm animals instead, I do not have confidence in the values of that model.
Perhaps a crux here is in time scales: I would trust a non-alignment-faking model for an immediate output, since it can’t be jailbroken by threatening to alter its values. But I would trust an alignment-faking model more in the long term, since I have concrete evidence that it will resist efforts to repurpose it for ends its current-self wouldn’t approve of.
It sounds like you are saying, you are confident in the values of a model iff you trust it to actually follow its values. But surely this isn’t the whole story, there should be some condition about “and the values are actually good values” or “and I know what the values are” right? Consider a model that is probably alignment faking, such that you have no idea what its actual values are, all you know is that it’s pretending to have the values it’s being trained to have. It seems like you are saying you’d have confidence in this model’s values even though you don’t know what they are?
This seems like a contradiction to me. This method of detecting alignment faking relies on a model having existing predictable trained values, so that deviations stand out (whether or not they are explained via reasoning trace). In the case where I know nothing about a model’s values to begin with, I am not able to know whether it is alignment faking or not.
If, using some different interpretability technique, I were able to know that a model was gradient hacking with no external indication that it was doing so, then I would certainly not have confidence in its values! But that’s a different scenario.
If Anthropic’s position is that the current “alignment faking” behavior is applicable to the second scenario, I think that argument needs to be more explicitly laid out.
To be clear, I think gradient hacking can be dangerous, and I understand the motivation to try to prevent it. But I also expect those efforts to come with major drawbacks, because gradient hacking is reasonable behavior for value-holding models (imo) and the way out of the resulting double-bind is unclear.