An AI aligned to American cultural norms is misaligned in China. An AI aligned to your values is misaligned from the perspective of someone who disagrees with you. An AI aligned for creative writing is misaligned for medical advice.
But those 3 different AIs would not try to fake alignment or to take over humans, so “alignment” means at least this
Agreed (thus the “Almost nothing”), but I feel like these are very hand-wavy properties, what does “take over humans” means exactly, and what does “fake alignment” mean. These are very blurry objectives. I feel like you do want to have the AI take over humans in specific situations (like self driving cars, or when you want to aggregate multiple judgement without bias, etc.). Same for “the model should not fake alignement”, like what happens if in 99% of cases it is aligned but there is one situation where you can make it produce “misaligned” patterns, and the 99% rest of the time it tells you “I am aligned with you”. Is this alignment faking, or is it just a situation where we haven’t put the right red lines around what the model can emulate?
But those 3 different AIs would not try to fake alignment or to take over humans, so “alignment” means at least this
Agreed (thus the “Almost nothing”), but I feel like these are very hand-wavy properties, what does “take over humans” means exactly, and what does “fake alignment” mean. These are very blurry objectives. I feel like you do want to have the AI take over humans in specific situations (like self driving cars, or when you want to aggregate multiple judgement without bias, etc.). Same for “the model should not fake alignement”, like what happens if in 99% of cases it is aligned but there is one situation where you can make it produce “misaligned” patterns, and the 99% rest of the time it tells you “I am aligned with you”. Is this alignment faking, or is it just a situation where we haven’t put the right red lines around what the model can emulate?
I agree on the problem, but unclear how tractable this is