I don’t think “inner alignment” is applicable here.
If the clone behaves indistinguishably from the human it is based on, then there is simply nothing more to say. It doesn’t matter what is going on inside.
If the clone behaves indistinguishably from the human it is based on, then there is simply nothing more to say. It doesn’t matter what is going on inside.
Right, I agree on that. The problem is, “behaves indistinguishably” for how long? You can’t guarantee whether it will stop acting that way in the future, which is what is predicted by deceptive alignment.
You can’t guarantee whether it will stop acting that way in the future, which is what is predicted by deceptive alignment.
yes, that’s true. But in fact if your AI is merely supposed to imitate a human it will be much easier to prevent deceptive alignment because you can find the minimal model that mimics a human, and that minimality excludes exotic behaviors.
This is essentially why machine learning works at all—you don’t pick a random model that fits your training data well, you pick the smallest one.
I don’t think “inner alignment” is applicable here.
If the clone behaves indistinguishably from the human it is based on, then there is simply nothing more to say. It doesn’t matter what is going on inside.
Right, I agree on that. The problem is, “behaves indistinguishably” for how long? You can’t guarantee whether it will stop acting that way in the future, which is what is predicted by deceptive alignment.
yes, that’s true. But in fact if your AI is merely supposed to imitate a human it will be much easier to prevent deceptive alignment because you can find the minimal model that mimics a human, and that minimality excludes exotic behaviors.
This is essentially why machine learning works at all—you don’t pick a random model that fits your training data well, you pick the smallest one.