Charbel-Raphaël comments on Three Properties for Alignment (and Why We’re Not Training Them)

Charbel-Raphaël 17 Mar 2026 19:47 UTC
3 points
0
An AI aligned to American cultural norms is misaligned in China. An AI aligned to your values is misaligned from the perspective of someone who disagrees with you. An AI aligned for creative writing is misaligned for medical advice.
But those 3 different AIs would not try to fake alignment or to take over humans, so “alignment” means at least this
- Quentin FEUILLADE--MONTIXI 17 Mar 2026 22:11 UTC
  1 point
  0
  Parent
  Agreed (thus the “Almost nothing”), but I feel like these are very hand-wavy properties, what does “take over humans” means exactly, and what does “fake alignment” mean. These are very blurry objectives. I feel like you do want to have the AI take over humans in specific situations (like self driving cars, or when you want to aggregate multiple judgement without bias, etc.). Same for “the model should not fake alignement”, like what happens if in 99% of cases it is aligned but there is one situation where you can make it produce “misaligned” patterns, and the 99% rest of the time it tells you “I am aligned with you”. Is this alignment faking, or is it just a situation where we haven’t put the right red lines around what the model can emulate?
  - Charbel-Raphaël 18 Mar 2026 12:11 UTC
    2 points
    0
    Parent
    I agree on the problem, but unclear how tractable this is