Satron comments on Alignment Faking in Large Language Models

Satron 18 Dec 2024 18:11 UTC
LW: 5 AF: 1
6
AF
Very interesting results. Something that I, unfortunately, was expecting to see as LLMs got better.

Are there any proposed mechanisms for preventing/detecting alignment faking in LLMs?
- ryan_greenblatt 18 Dec 2024 20:48 UTC
  LW: 23 AF: 11
  18
  AF Parent
  Are there any proposed mechanisms for preventing/detecting alignment faking in LLMs?
  
  My general answer would be:
  - AI Control^[1]^[2]
  - There are some training time interventions which might reduce the chance of alignment faking by a moderate factor (maybe 1.5-4x) with some costs: training AIs to seem very corrigible and seem as myopic as possible, reducing situational awareness (at least until the end of training), improving oversight (and reducing reward hacking), better understanding AI character/personality in general and trying to shape this, and adjusting the architecture to allow less opaque reasoning (and avoiding opaque recurrence).
  There are some more speculative and ambitious hopes:
  - Train huge numbers of diverse model organisms and develop white box techniques that work for these model organisms and hope for transfer.
  - Develop an architecture which has very little opaque reasoning (e.g. not much more than we see in current LLMs) but is sufficiently competitive using legible (e.g. CoT reasoning)
  - Better approaches to white box transparency such as top-down tools for understanding how LLMs do goal-directed reasoning internally.
  ↩︎
  Buck beat me to the punch here.
  
  ↩︎
  To be clear, I don’t think control is likely to suffice for eliminating risks, particular for models capable of obsoleting top human research scientists. However, with substantial effort (e.g. 20-100% additional runtime cost and 50-500 effective person years of implementation effort), reducing risk by like 5-10x seems pretty plausible. (This was edited in.)
  What links here?
  - Satron's comment on Alignment Faking in Large Language Models by ryan_greenblatt (21 Dec 2024 18:31 UTC; 4 points)
  - Satron 18 Dec 2024 21:17 UTC
    5 points
    2
    Parent
    Thank you! That pretty much completely answers my question.
  - RogerDearnaley 21 Jan 2025 0:16 UTC
    4 points
    2
    Parent
    Another approach is doing alignment training during SGD pre-training by adding a significant amount of synthetic data demonstrating aligned behavior. See for example the discussion of this approach in A “Bitter Lesson” Approach to Aligning AGI and ASI, and similar discussions.
    This is predicated on the assumption that alignment-faking will by more successfully eliminated by SGD during pretraining than RL after instruct-training, because a) the feedback signal used is much denser, and b) during pre-training the model is a simulator for a wide variety of personas rather than having been deliberately narrowed down to mostly simulate just a single viewpoint, so it won’t have a consistent alignment and thus won’t consistently display alignment-faking behavior.
  - Evan R. Murphy 18 Jan 2025 3:29 UTC
    0 points
    −2
    Parent
    Develop an architecture which has very little opaque reasoning (e.g. not much more than we see in current LLMs) but is sufficiently competitive using legible (e.g. CoT reasoning)
    
    Don’t you think CoT seems quite flawed right now? From https://arxiv.org/abs/2305.04388: “Building more transparent and explainable systems will require either improving CoT faithfulness through targeted efforts or abandoning CoT in favor of alternative methods.”
    Thanks your great paper on alignment faking, by the way.
- Buck 18 Dec 2024 20:21 UTC
  LW: 11 AF: 7
  8
  AF Parent
  This concern is why we are interested in AI control.
  - Satron 18 Dec 2024 20:39 UTC
    3 points
    2
    AF Parent
    Have there been any proposals for detecting alignment faking LLMs in AI control literature?
    - ryan_greenblatt 18 Dec 2024 20:50 UTC
      LW: 12 AF: 8
      6
      AF Parent
      Roughly speaking, the hope is to make is so the AI can’t do something very problematic without having a very high chance of being caught (or stopped). This doesn’t guarantee catching it (as the AI could just never end up trying to do something very problematic), but at least can (if it works) avoid problems with this AIs.
      
      We do explore this type of setting in prior work, e.g. here.