If an agent has a preference for a move in a specific position in chess, but then gets more compute and more optimization and gets better at chess, and makes a different move after getting better, would you say it’s preference changed, or it reduced epistemic uncertainty and got better at achieving it’s preference, which stayed the same?
I think of preferences as a description of agent behavior, which means the preferences changed.
When you say “got better at achieving it’s preference” I suppose you’re thinking of preference as some goal the agent is pursuing. I find this view (assuming goal directedness) less general in its ability to describe agent behavior. It may be more useful, but if so I think we need to justify it better. I don’t exclude the possibility that there is a piece of information I don’t know about.
Goal-directedness leads toward instrumental convergence and away from corrigibility. If we are looking to solve corrigibility, I think it’s worth it to question goal-directedness.
If an agent has a preference for a move in a specific position in chess, but then gets more compute and more optimization and gets better at chess, and makes a different move after getting better, would you say it’s preference changed, or it reduced epistemic uncertainty and got better at achieving it’s preference, which stayed the same?
I think of preferences as a description of agent behavior, which means the preferences changed.
When you say “got better at achieving it’s preference” I suppose you’re thinking of preference as some goal the agent is pursuing. I find this view (assuming goal directedness) less general in its ability to describe agent behavior. It may be more useful, but if so I think we need to justify it better. I don’t exclude the possibility that there is a piece of information I don’t know about.
Goal-directedness leads toward instrumental convergence and away from corrigibility. If we are looking to solve corrigibility, I think it’s worth it to question goal-directedness.