Agreed (thus the “Almost nothing”), but I feel like these are very hand-wavy properties, what does “take over humans” means exactly, and what does “fake alignment” mean. These are very blurry objectives. I feel like you do want to have the AI take over humans in specific situations (like self driving cars, or when you want to aggregate multiple judgement without bias, etc.). Same for “the model should not fake alignement”, like what happens if in 99% of cases it is aligned but there is one situation where you can make it produce “misaligned” patterns, and the 99% rest of the time it tells you “I am aligned with you”. Is this alignment faking, or is it just a situation where we haven’t put the right red lines around what the model can emulate?
Agreed (thus the “Almost nothing”), but I feel like these are very hand-wavy properties, what does “take over humans” means exactly, and what does “fake alignment” mean. These are very blurry objectives. I feel like you do want to have the AI take over humans in specific situations (like self driving cars, or when you want to aggregate multiple judgement without bias, etc.). Same for “the model should not fake alignement”, like what happens if in 99% of cases it is aligned but there is one situation where you can make it produce “misaligned” patterns, and the 99% rest of the time it tells you “I am aligned with you”. Is this alignment faking, or is it just a situation where we haven’t put the right red lines around what the model can emulate?
I agree on the problem, but unclear how tractable this is