The chatbot is “generally intelligent”, so buying furniture is just one of many tasks it may be asked to execute; another task it could be asked to do is “order me some food”.
The hard part is indeed in spontaneously recognizing distinctions—but we already reward RL agents for curiosity, i.e. taking an action for which your world model fails to predict the consequences. Predicting which new distinctions are salient-to-humans is a thing you can optimize, because you can cleanly label it.
Also to clarify, we’re only arguing here about whether this capability will be naturally invested-in, so I don’t think it matters if highly capable bots have other strategies.
I think the capabilities of the AI matters a lot for alignment strategies, and that’s why I’m asking you about it and why I need you to answer that question.
A subhuman intelligence would rely on humans to make most of the decisions. It would order human-designed furniture types through human-created interfaces and receive human-fabricated furniture. At each of those steps, it delgates an enormous number of decisions to humans, which makes those decisions automatically end up reasonably aligned, but also prevents the AI from doing optimization over them. In the particular case of human-designed interfaces, they tend to automatically expose information about the things that humans care about, and eliciting human preferences can be shortcut be focusing on these dimensions.
But a superhuman intelligence would solve tasks through taking actions independently of humans, as that can allow it to more highly optimize the outcomes. And a solution for alignment that relies on humans making most of the decisions would presumably not generalize to this case, where the AI makes most of the decisions.
The chatbot is “generally intelligent”, so buying furniture is just one of many tasks it may be asked to execute; another task it could be asked to do is “order me some food”.
The hard part is indeed in spontaneously recognizing distinctions—but we already reward RL agents for curiosity, i.e. taking an action for which your world model fails to predict the consequences. Predicting which new distinctions are salient-to-humans is a thing you can optimize, because you can cleanly label it.
Also to clarify, we’re only arguing here about whether this capability will be naturally invested-in, so I don’t think it matters if highly capable bots have other strategies.
I think the capabilities of the AI matters a lot for alignment strategies, and that’s why I’m asking you about it and why I need you to answer that question.
A subhuman intelligence would rely on humans to make most of the decisions. It would order human-designed furniture types through human-created interfaces and receive human-fabricated furniture. At each of those steps, it delgates an enormous number of decisions to humans, which makes those decisions automatically end up reasonably aligned, but also prevents the AI from doing optimization over them. In the particular case of human-designed interfaces, they tend to automatically expose information about the things that humans care about, and eliciting human preferences can be shortcut be focusing on these dimensions.
But a superhuman intelligence would solve tasks through taking actions independently of humans, as that can allow it to more highly optimize the outcomes. And a solution for alignment that relies on humans making most of the decisions would presumably not generalize to this case, where the AI makes most of the decisions.
I think there are intermediate cases—delegating some but not all decisions—that require this sort of tooling. See Eg this paper from today: http://ai.googleblog.com/2022/04/simple-and-effective-zero-shot-task.html that focuses on how to learn intent.