How can an agent have a utility function that references a value in the environment, and actually care about the state of the environment, as opposed to only caring about the reward signal in its mind? Wouldn’t the knowledge of the state of the environment be in its mind, which can be hackable and susceptible to wire heading?
A basic answer is that if it actually cares about its goals and can think about them, it’ll notice that it should also care about the state of the environment, as otherwise it’s liable to not achieve its goals. Which is pretty much why rationality is valuable and the main lesson of the sequences.
Will it think that goals are arbitrary, and that the only thing it should care about is its pleasure-pain axis? And then it will lose concern for the state of the environment?
You’re adding a lot of extra assumptions here, a couple being:
there is a problem with having arbitrary goals
it has a pleasure-pain axis
it notices it has a pleasure-pain axis
it cares about its pleasure-pain axis
its pleasure-pain axis is independent of its understanding of the state of the environment
The main problem of inner alignment is making an agent want to do what you want it to do (as opposed to even understanding what you want it to do). Which is an unsolved problem.
Although I’m criticizing your specific criticism, my main issue with it is that it’s a very specific failure mode, which is unlikely to appear, because it requires a lot of other things which are also unlikely. That being said, you’ve provided a good example of WHY inner alignment is a big problem, i.e. it’s very hard to keep something following the goals you set it, especially when it can think for itself and change its mind.
How can an agent have a utility function that references a value in the environment, and actually care about the state of the environment, as opposed to only caring about the reward signal in its mind? Wouldn’t the knowledge of the state of the environment be in its mind, which can be hackable and susceptible to wire heading?
Yes, exactly. This is sort of the whole point.
A basic answer is that if it actually cares about its goals and can think about them, it’ll notice that it should also care about the state of the environment, as otherwise it’s liable to not achieve its goals. Which is pretty much why rationality is valuable and the main lesson of the sequences.
Check out inner alignment and shard theory for a lot of confusing info on this topic.
Will it think that goals are arbitrary, and that the only thing it should care about is its pleasure-pain axis? And then it will lose concern for the state of the environment?
You’re adding a lot of extra assumptions here, a couple being:
there is a problem with having arbitrary goals
it has a pleasure-pain axis
it notices it has a pleasure-pain axis
it cares about its pleasure-pain axis
its pleasure-pain axis is independent of its understanding of the state of the environment
The main problem of inner alignment is making an agent want to do what you want it to do (as opposed to even understanding what you want it to do). Which is an unsolved problem.
Although I’m criticizing your specific criticism, my main issue with it is that it’s a very specific failure mode, which is unlikely to appear, because it requires a lot of other things which are also unlikely. That being said, you’ve provided a good example of WHY inner alignment is a big problem, i.e. it’s very hard to keep something following the goals you set it, especially when it can think for itself and change its mind.