Kerrigan comments on Understanding Agent Preferences

Kerrigan 13 Jun 2025 7:53 UTC
2 points
0
If an agent has a preference for a move in a specific position in chess, but then gets more compute and more optimization and gets better at chess, and makes a different move after getting better, would you say it’s preference changed, or it reduced epistemic uncertainty and got better at achieving it’s preference, which stayed the same?
- martinkunev 13 Jun 2025 13:24 UTC
  1 point
  0
  Parent
  I think of preferences as a description of agent behavior, which means the preferences changed.
  When you say “got better at achieving it’s preference” I suppose you’re thinking of preference as some goal the agent is pursuing. I find this view (assuming goal directedness) less general in its ability to describe agent behavior. It may be more useful, but if so I think we need to justify it better. I don’t exclude the possibility that there is a piece of information I don’t know about.
  Goal-directedness leads toward instrumental convergence and away from corrigibility. If we are looking to solve corrigibility, I think it’s worth it to question goal-directedness.