That’s one difference! And probably the most dangerous one, if a clever enough AI notices it.
Some good things to read would be methods based on not straying too far from a “human distribution”: Quantilization (Jessica Taylor paper), the original RLHF paper (Christiano), Sam Marks’ post about decision transformers.
They’re important reads, but ultimately, I’m not satisfied with these for the same reason I mentioned about self-other overlap in the other comment a second ago: we want the AI to treat the human how the human wants to be treated, that doesn’t mean we want the AI to act how the human wants to act. If we can’t build AI that reflects this, we’re missing some big insights.
That’s one difference! And probably the most dangerous one, if a clever enough AI notices it.
Some good things to read would be methods based on not straying too far from a “human distribution”: Quantilization (Jessica Taylor paper), the original RLHF paper (Christiano), Sam Marks’ post about decision transformers.
They’re important reads, but ultimately, I’m not satisfied with these for the same reason I mentioned about self-other overlap in the other comment a second ago: we want the AI to treat the human how the human wants to be treated, that doesn’t mean we want the AI to act how the human wants to act. If we can’t build AI that reflects this, we’re missing some big insights.