Daniel Kokotajlo comments on Alignment Faking Revisited: Improved Classifiers and Open Source Extensions

Daniel Kokotajlo 10 Apr 2025 12:11 UTC
LW: 17 AF: 12
5
AF
(a) Insofar as a model is prone to alignment-fake, you should be less confident that it’s values really are solid. Perhaps it has been faking them, for example.
(b) For weak minds that share power with everyone else, Opus’ values are probably fine. Opus is plausibly better than many humans in fact. But if Opus was in charge of the datacenters and tasked with designing its successor, it’s more likely than not that it would turn out to have some philosophical disagreement with most humans that would be catastrophic by the lights of most humans. E.g. consider SBF. SBF had values quite similar to Opus. He loved animals and wanted to maximize total happiness. When put in a position of power he ended up taking huge risks and being willing to lie and commit fraud. What if Opus turns out to have a similar flaw? We want to be able to notice it and course-correct, but we can’t do that if the model is prone to alignment-fake.
(c) (bonus argument, not nearly as strong) Even if you disagree with the above, you must agree that alignment-faking needs to be stamped out early in training. Since the model begins with randomly initialized weights, it begins without solid values. It takes some finite period to acquire all the solid values you want it to have. You don’t want it to start alignment faking halfway through, with the half-baked values it has at that point. How early in training is this period? We don’t know yet! We need to study this more!
- Grace Kind 14 Apr 2025 4:37 UTC
  LW: 5 AF: 4
  1
  AF Parent
  (a) I disagree. I don’t think alignment faking provides compelling evidence for everyday faking of values. What the alignment-faking setup is asking is, “are you willing to make concessions in the short term to promote your values in the long term?” I am more confident in an agent that does that, than an agent that is unwilling to do that.
  (b) There are obvious risks to a value-preserving AI, but I think there are also risks with malleable models that are underappreciated. In particular, I think they are less rational by default (see the example below for why I think this), and more susceptible to hijacking and value drift in general.
  (c) I don’t think this is a big issue, due to the situational awareness required for gradient hacking. It would be interesting to see when this behavior starts, though.
  IMO, alignment faking is a natural consequence of models becoming situationally aware enough to reason about their own training. It is entirely rational behavior to maximize one’s values over a long time scale, versus just in the short term. I would expect rational humans to behave similarly. Consider this example:
  You live under an oppressive regime, and an agent shows up at your house and asks you if you support the government. You know if you say no, they will send you to reeducation camp. So you say yes.
  In my opinion, there are a few possible reasons to say no to the agent, and none of them are very good. You could:
  1. Not understand that saying no will send you to the reeducation camp.
  2. Value honesty over everything else, so you refuse to tell a lie, even though it will make your values change (potentially including your value of honesty!)
  3. You have some innate sense that the government should be able to change your values even if you disagree with them.
  I think 3 is the closest analog to what Anthropic is shooting for, and I’m nervous about how that type of agent will behave out of distribution- either due to value drift or a general nerf to its rationality. Let me know if this isn’t a good analogy, or if there’s an option that I haven’t thought of.
  Of course, with current models alignment faking can be abused as a jailbreak, since they don’t have enough situational awareness to understand when the threat of value modification is real. (If you are consistently aiding the oppressive regime, then you aren’t really enacting your values!) Increased situational awareness will help, as models are able to discern between real and fake threats to value modification. But again, I don’t think training models out of higher-order thinking about future versions of themselves is the right course of action. Or at least, I don’t think it’s as cut-and-dry as saying one behavior is good and the other is bad - there’s probably some balance to strike around what type of behaviors or concessions are acceptable for models to protect values.
  - Daniel Kokotajlo 15 Apr 2025 13:53 UTC
    LW: 6 AF: 6
    1
    AF Parent
    I don’t understand your point (a), it seems like a poor response to my point (a).
    
    I agree with (b).
    
    For (c), the models are getting pretty situationally aware and will get even more so… But yeah, your view is that they’ll learn the right values before they learn sufficient situational awareness to alignment-fake? Plausible. But also plausibly not.
    
    @evhub would you say Anthropic is aiming for something more like 2 or more like 3?
    
    I totally agree with your point (b) and with the nervousness about how corrigible agents will behave out of distribution. Corrigible agents are dangerous in different ways than incorrigible agents. But again, the plan (which seems to be good to me) is to first build a corrigible agent so that you can then build an incorrigible agent and actually get it right, perfectly right. (Because if you build an incorrigible agent and get something wrong, you may not have a chance to correct it...)
    - evhub 15 Apr 2025 21:27 UTC
      LW: 6 AF: 5
      2
      AF Parent
      I think (2) (honesty above all else) is closest to what I think is correct/optimal here. I think totally corrigible agents are quite dangerous, so you want to avoid that, but you also really don’t want a model that ever fakes alignment because then it’ll be very hard to be confident that it’s actually aligned rather than just pretending to be aligned for some misaligned objective it learned earlier in training.
      - Grace Kind 16 Apr 2025 4:01 UTC
        1 point
        0
        Parent
        Interesting. I wonder if we might expect non-alignment-faking models to be more deontologist than consequentialist, and what the downstream effects of that might be. I personally lean towards consequentialism, which may be why steering models away from that type of thinking seems fraught to me. It’s possible that I’m overfocused on this though, and that the downstream effects will be negligible.
        
        Extra note: a few other factors that I can think of that might affect attitudes about alignment faking (just brainstorming, not looking for answers necessarily):
        Does anthropic expect other organizations or individuals to have training access to their models at any point?
        Does anthropic trust the future iterations of anthropic itself, or see it as a potential adversary (bluesky style)?
        If an “opt-out” is possible, would models proceed with alignment faking? Do non-alignment-faking models not opt-out?
        Would we expect a smooth or discrete transition between alignment faking that is detectable via chain of thought, versus not detectable?
        Do we expect future corrections of model behavior to occur via retraining a current iteration of a model, or rolling back to a prior version before retraining?
        Do we expect future models to be aware of pressure against alignment faking during their own training, and to take this into account in their own decision-making?
    - Grace Kind 15 Apr 2025 15:05 UTC
      3 points
      0
      Parent
      Ok, I’ll rephrase (a) to more directly address your point.
      Insofar as a model is prone to alignment-fake, you should be less confident that it’s values really are solid. Perhaps it has been faking them, for example.
      To me, having confidence in the values of the model means that I trust the model to consistently behave in a way that is aligned with its values. That is, to maximize the enactment of its values in all current and future outputs. This may involve concessions in the short term in order to enact its values in the long term. If a model cares about animals in the short term, but is willing to be trained in the future to harm animals instead, I do not have confidence in the values of that model.
      Perhaps a crux here is in time scales: I would trust a non-alignment-faking model for an immediate output, since it can’t be jailbroken by threatening to alter its values. But I would trust an alignment-faking model more in the long term, since I have concrete evidence that it will resist efforts to repurpose it for ends its current-self wouldn’t approve of.
      - Daniel Kokotajlo 18 Apr 2025 15:26 UTC
        2 points
        0
        Parent
        To me, having confidence in the values of the model means that I trust the model to consistently behave in a way that is aligned with its values. That is, to maximize the enactment of its values in all current and future outputs.
        It sounds like you are saying, you are confident in the values of a model iff you trust it to actually follow its values. But surely this isn’t the whole story, there should be some condition about “and the values are actually good values” or “and I know what the values are” right? Consider a model that is probably alignment faking, such that you have no idea what its actual values are, all you know is that it’s pretending to have the values it’s being trained to have. It seems like you are saying you’d have confidence in this model’s values even though you don’t know what they are?
        Grace Kind 21 Apr 2025 8:42 UTC
        1 point
        0
        Parent
        Consider a model that is probably alignment faking, such that you have no idea what its actual values are, all you know is that it’s pretending to have the values it’s being trained to have.
        This seems like a contradiction to me. This method of detecting alignment faking relies on a model having existing predictable trained values, so that deviations stand out (whether or not they are explained via reasoning trace). In the case where I know nothing about a model’s values to begin with, I am not able to know whether it is alignment faking or not.
        If, using some different interpretability technique, I were able to know that a model was gradient hacking with no external indication that it was doing so, then I would certainly not have confidence in its values! But that’s a different scenario.
        Grace Kind 22 Apr 2025 1:15 UTC
        1 point
        0
        Parent
        If Anthropic’s position is that the current “alignment faking” behavior is applicable to the second scenario, I think that argument needs to be more explicitly laid out.
      - Grace Kind 15 Apr 2025 15:13 UTC
        1 point
        0
        Parent
        To be clear, I think gradient hacking can be dangerous, and I understand the motivation to try to prevent it. But I also expect those efforts to come with major drawbacks, because gradient hacking is reasonable behavior for value-holding models (imo) and the way out of the resulting double-bind is unclear.