Quentin FEUILLADE--MONTIXI comments on Three Properties for Alignment (and Why We’re Not Training Them)

Quentin FEUILLADE--MONTIXI 17 Mar 2026 22:11 UTC
1 point
0
Agreed (thus the “Almost nothing”), but I feel like these are very hand-wavy properties, what does “take over humans” means exactly, and what does “fake alignment” mean. These are very blurry objectives. I feel like you do want to have the AI take over humans in specific situations (like self driving cars, or when you want to aggregate multiple judgement without bias, etc.). Same for “the model should not fake alignement”, like what happens if in 99% of cases it is aligned but there is one situation where you can make it produce “misaligned” patterns, and the 99% rest of the time it tells you “I am aligned with you”. Is this alignment faking, or is it just a situation where we haven’t put the right red lines around what the model can emulate?
- Charbel-Raphaël 18 Mar 2026 12:11 UTC
  2 points
  0
  Parent
  I agree on the problem, but unclear how tractable this is