If a system is scheming then its usefulness is zero. It’s like talking with a devil. It shouldn’t be run at all. Why would it help us with anything? You won’t solve alignment with an AI that is scheming and the trusted one is too stupid otherwise you would have already solved alignment.
I don’t get why people disagree with me and don’t try to comment. I will do it by myself then. There is one thing we do to make use of models that are misaligned with our goals—we jailbreak them—so this is what we can do with scheming models—we can jailbreak them to get useful outputs. Or you might expect that the model is useful but it’s scheming from time to time. Then you can get useful outputs. Validation is still a problem tho.
If a system is scheming then its usefulness is zero. It’s like talking with a devil. It shouldn’t be run at all. Why would it help us with anything? You won’t solve alignment with an AI that is scheming and the trusted one is too stupid otherwise you would have already solved alignment.
I don’t get why people disagree with me and don’t try to comment. I will do it by myself then. There is one thing we do to make use of models that are misaligned with our goals—we jailbreak them—so this is what we can do with scheming models—we can jailbreak them to get useful outputs. Or you might expect that the model is useful but it’s scheming from time to time. Then you can get useful outputs. Validation is still a problem tho.