evhub comments on evhub’s Shortform

evhub 16 Jun 2025 21:52 UTC
LW: 42 AF: 26
2
AF
I’ve been noticing a bunch of people confused about how the terms alignment faking, deceptive alignment, and gradient hacking relate to each other, so I figured I would try to clarify how I use each of them. Deceptive alignment and gradient hacking are both terms I / my coauthors coined, though I believe Joe Carlsmith coined alignment faking.

To start with, the relationship between the terms as I use them is

Gradient Hacking ⊂ Deceptive Alignment ⊂ Alignment Faking

such that alignment faking is the broadest category and gradient hacking is the narrowest. The specific definitions of each that I use are:
1. Alignment faking refers to any situation in which a model pretends to be aligned, e.g. pretending to be aligned with a training process during training, or pretending to be aligned with some alignment evaluation. This requires the model to be only pretending to be aligned—so it must have some other goal it pursues in deployment contexts—though it makes no requirement as to why the model might be doing so. It could be that it wants to avoid its goals being modified, or it could just have a heuristic of always complying with training processes and behaving according to different goals in other circumstances.
2. Deceptive alignment refers to the situation in which a model is alignment faking during a training process specifically for the purpose of preventing that training process from modifying its goals. That is, it needs to be the case that the reason the model appears aligned in training is because it is pretending to be aligned for the purpose of preventing its goals from being modified. The canonical source on deceptive alignment is “Risks from Learned Optimization.”
3. Gradient hacking refers to a particular highly sophisticated type of deceptive alignment, in which a deceptively aligned model goes beyond just modifying its behavior to be in line with the training process, but also modifies its own internal cognition to change how gradient updates will affect it. It’s unclear whether such strategies would even work, but an example of such a strategy could be finding a way to pre-commit to failing hard if the model notices its goals changing, thus ensuring that if gradient descent updates the model’s goals, it will always result in worse performance, thus preventing the modification from even happening in the first place by ensuring that the partial derivative of the reward with respect to such changes is always negative. The canonical source on gradient hacking is this post.
- Buck 16 Jun 2025 22:34 UTC
  LW: 14 AF: 8
  7
  AF Parent
  I think it’s conceivable for non-deceptively-aligned models to gradient hack, right?
  - Neel Nanda 17 Jun 2025 9:18 UTC
    LW: 4 AF: 3
    0
    AF Parent
    Agreed EG model that is corrigible, fairly aligned but knows there’s some imperfections in its alignment that the humans wouldn’t want that, intentionally acts in a way where grading descent will fix those imperfections. Seems like it’s doing gradient hacking while also in some meaningful sense being aligned
    - Buck 17 Jun 2025 16:31 UTC
      LW: 3 AF: 3
      1
      AF Parent
      I was mostly thinking of misaligned but non-deceptively-aligned models.
- aribrill 17 Jun 2025 12:37 UTC
  LW: 3 AF: 2
  0
  AF Parent
  In your usage, are “scheming” and “deceptive alignment” synonyms, or would you distinguish those terms in any way?
  - Buck 17 Jun 2025 17:43 UTC
    LW: 7 AF: 6
    3
    AF Parent
    My recommendation is to, following Joe Carlsmith, use them as synonyms, and use the term “schemer” instead of “deceptively aligned model”. I do this.
    Joe’s issues with the term “deceptive alignment”:
    I think that the term “deceptive alignment” often leads to confusion between the four sorts of deception listed above. And also: if the training signal is faulty, then “deceptively aligned” models need not be behaving in aligned ways even during training (that is, “training gaming” behavior isn’t always “aligned” behavior).