can’t extract the exact concept (e.g., concept of human values) from AI. Even if it has this concept somewhere. Yes, we can look which activations correlate with some behaviour, and stuff like that. But it’s far from enough.
can’t train AI to optimize some concept from the world model of its earlier version. We have no ability to formalize the training objective like this.
Maybe Bostrom thought the weak AIs will not have good enough world model, like you interpret him. Or maybe he already thought that we will not be able to use world model of one AI to direct other. But the conclusion stays anyway.
I also think that current AIs probably don’t have the concept of human values that would actually be fine to optimize hard. And I’m not sure that AIs will have it before they will have the ability to stop us from changing their goal. But if it was the only problem, I would agree that the risk is more manageable.
This is MUCH more clearly written, thanks.
We still have the problems that we
can’t extract the exact concept (e.g., concept of human values) from AI. Even if it has this concept somewhere. Yes, we can look which activations correlate with some behaviour, and stuff like that. But it’s far from enough.
can’t train AI to optimize some concept from the world model of its earlier version. We have no ability to formalize the training objective like this.
Maybe Bostrom thought the weak AIs will not have good enough world model, like you interpret him. Or maybe he already thought that we will not be able to use world model of one AI to direct other. But the conclusion stays anyway.
I also think that current AIs probably don’t have the concept of human values that would actually be fine to optimize hard. And I’m not sure that AIs will have it before they will have the ability to stop us from changing their goal. But if it was the only problem, I would agree that the risk is more manageable.