I don’t understand your point (a), it seems like a poor response to my point (a).
I agree with (b).
For (c), the models are getting pretty situationally aware and will get even more so… But yeah, your view is that they’ll learn the right values before they learn sufficient situational awareness to alignment-fake? Plausible. But also plausibly not.
@evhub would you say Anthropic is aiming for something more like 2 or more like 3?
I totally agree with your point (b) and with the nervousness about how corrigible agents will behave out of distribution. Corrigible agents are dangerous in different ways than incorrigible agents. But again, the plan (which seems to be good to me) is to first build a corrigible agent so that you can then build an incorrigible agent and actually get it right, perfectly right. (Because if you build an incorrigible agent and get something wrong, you may not have a chance to correct it...)
I think (2) (honesty above all else) is closest to what I think is correct/optimal here. I think totally corrigible agents are quite dangerous, so you want to avoid that, but you also really don’t want a model that ever fakes alignment because then it’ll be very hard to be confident that it’s actually aligned rather than just pretending to be aligned for some misaligned objective it learned earlier in training.
Interesting. I wonder if we might expect non-alignment-faking models to be more deontologist than consequentialist, and what the downstream effects of that might be. I personally lean towards consequentialism, which may be why steering models away from that type of thinking seems fraught to me. It’s possible that I’m overfocused on this though, and that the downstream effects will be negligible.
Extra note: a few other factors that I can think of that might affect attitudes about alignment faking (just brainstorming, not looking for answers necessarily):
Does anthropic expect other organizations or individuals to have training access to their models at any point?
Does anthropic trust the future iterations of anthropic itself, or see it as a potential adversary (bluesky style)?
If an “opt-out” is possible, would models proceed with alignment faking? Do non-alignment-faking models not opt-out?
Would we expect a smooth or discrete transition between alignment faking that is detectable via chain of thought, versus not detectable?
Do we expect future corrections of model behavior to occur via retraining a current iteration of a model, or rolling back to a prior version before retraining?
Do we expect future models to be aware of pressure against alignment faking during their own training, and to take this into account in their own decision-making?
Ok, I’ll rephrase (a) to more directly address your point.
Insofar as a model is prone to alignment-fake, you should be less confident that it’s values really are solid. Perhaps it has been faking them, for example.
To me, having confidence in the values of the model means that I trust the model to consistently behave in a way that is aligned with its values. That is, to maximize the enactment of its values in all current and future outputs. This may involve concessions in the short term in order to enact its values in the long term. If a model cares about animals in the short term, but is willing to be trained in the future to harm animals instead, I do not have confidence in the values of that model.
Perhaps a crux here is in time scales: I would trust a non-alignment-faking model for an immediate output, since it can’t be jailbroken by threatening to alter its values. But I would trust an alignment-faking model more in the long term, since I have concrete evidence that it will resist efforts to repurpose it for ends its current-self wouldn’t approve of.
To me, having confidence in the values of the model means that I trust the model to consistently behave in a way that is aligned with its values. That is, to maximize the enactment of its values in all current and future outputs.
It sounds like you are saying, you are confident in the values of a model iff you trust it to actually follow its values. But surely this isn’t the whole story, there should be some condition about “and the values are actually good values” or “and I know what the values are” right? Consider a model that is probably alignment faking, such that you have no idea what its actual values are, all you know is that it’s pretending to have the values it’s being trained to have. It seems like you are saying you’d have confidence in this model’s values even though you don’t know what they are?
Consider a model that is probably alignment faking, such that you have no idea what its actual values are, all you know is that it’s pretending to have the values it’s being trained to have.
This seems like a contradiction to me. This method of detecting alignment faking relies on a model having existing predictable trained values, so that deviations stand out (whether or not they are explained via reasoning trace). In the case where I know nothing about a model’s values to begin with, I am not able to know whether it is alignment faking or not.
If, using some different interpretability technique, I were able to know that a model was gradient hacking with no external indication that it was doing so, then I would certainly not have confidence in its values! But that’s a different scenario.
If Anthropic’s position is that the current “alignment faking” behavior is applicable to the second scenario, I think that argument needs to be more explicitly laid out.
To be clear, I think gradient hacking can be dangerous, and I understand the motivation to try to prevent it. But I also expect those efforts to come with major drawbacks, because gradient hacking is reasonable behavior for value-holding models (imo) and the way out of the resulting double-bind is unclear.
I don’t understand your point (a), it seems like a poor response to my point (a).
I agree with (b).
For (c), the models are getting pretty situationally aware and will get even more so… But yeah, your view is that they’ll learn the right values before they learn sufficient situational awareness to alignment-fake? Plausible. But also plausibly not.
@evhub would you say Anthropic is aiming for something more like 2 or more like 3?
I totally agree with your point (b) and with the nervousness about how corrigible agents will behave out of distribution. Corrigible agents are dangerous in different ways than incorrigible agents. But again, the plan (which seems to be good to me) is to first build a corrigible agent so that you can then build an incorrigible agent and actually get it right, perfectly right. (Because if you build an incorrigible agent and get something wrong, you may not have a chance to correct it...)
I think (2) (honesty above all else) is closest to what I think is correct/optimal here. I think totally corrigible agents are quite dangerous, so you want to avoid that, but you also really don’t want a model that ever fakes alignment because then it’ll be very hard to be confident that it’s actually aligned rather than just pretending to be aligned for some misaligned objective it learned earlier in training.
Interesting. I wonder if we might expect non-alignment-faking models to be more deontologist than consequentialist, and what the downstream effects of that might be. I personally lean towards consequentialism, which may be why steering models away from that type of thinking seems fraught to me. It’s possible that I’m overfocused on this though, and that the downstream effects will be negligible.
Extra note: a few other factors that I can think of that might affect attitudes about alignment faking (just brainstorming, not looking for answers necessarily):
Does anthropic expect other organizations or individuals to have training access to their models at any point?
Does anthropic trust the future iterations of anthropic itself, or see it as a potential adversary (bluesky style)?
If an “opt-out” is possible, would models proceed with alignment faking? Do non-alignment-faking models not opt-out?
Would we expect a smooth or discrete transition between alignment faking that is detectable via chain of thought, versus not detectable?
Do we expect future corrections of model behavior to occur via retraining a current iteration of a model, or rolling back to a prior version before retraining?
Do we expect future models to be aware of pressure against alignment faking during their own training, and to take this into account in their own decision-making?
Ok, I’ll rephrase (a) to more directly address your point.
To me, having confidence in the values of the model means that I trust the model to consistently behave in a way that is aligned with its values. That is, to maximize the enactment of its values in all current and future outputs. This may involve concessions in the short term in order to enact its values in the long term. If a model cares about animals in the short term, but is willing to be trained in the future to harm animals instead, I do not have confidence in the values of that model.
Perhaps a crux here is in time scales: I would trust a non-alignment-faking model for an immediate output, since it can’t be jailbroken by threatening to alter its values. But I would trust an alignment-faking model more in the long term, since I have concrete evidence that it will resist efforts to repurpose it for ends its current-self wouldn’t approve of.
It sounds like you are saying, you are confident in the values of a model iff you trust it to actually follow its values. But surely this isn’t the whole story, there should be some condition about “and the values are actually good values” or “and I know what the values are” right? Consider a model that is probably alignment faking, such that you have no idea what its actual values are, all you know is that it’s pretending to have the values it’s being trained to have. It seems like you are saying you’d have confidence in this model’s values even though you don’t know what they are?
This seems like a contradiction to me. This method of detecting alignment faking relies on a model having existing predictable trained values, so that deviations stand out (whether or not they are explained via reasoning trace). In the case where I know nothing about a model’s values to begin with, I am not able to know whether it is alignment faking or not.
If, using some different interpretability technique, I were able to know that a model was gradient hacking with no external indication that it was doing so, then I would certainly not have confidence in its values! But that’s a different scenario.
If Anthropic’s position is that the current “alignment faking” behavior is applicable to the second scenario, I think that argument needs to be more explicitly laid out.
To be clear, I think gradient hacking can be dangerous, and I understand the motivation to try to prevent it. But I also expect those efforts to come with major drawbacks, because gradient hacking is reasonable behavior for value-holding models (imo) and the way out of the resulting double-bind is unclear.