Hey, I want to respond to the rest of your comments, but that might take some time. Just a quick thought/clarification about one particular thing.
I’ve spent years reading all the neuroscience that I can find about reward signals, and I’m not sure what you’re referring to. Pavlov was well aware that e.g. pain is innately aversive to mice, and yummy food is innately appetitive, and so on. The behaviorists talked at length about “unconditioned stimuli”, “primary rewards”, “primary reinforcers”, “primary punishers”, etc.
I understand that the reward signals are hard coded, but internal anticipation of reward signals is not. Pavlov’s breakthrough was that with proper conditioning you could connect unrelated stimuli to the possibility of “true” reward signals in the future. In other words, you can “teach” an RL system to perform credit assignment to just about anything. Lights, sounds, patterns of behaviour...
In the alignment context this means that even if the agent gets reward when the condition “someone is stabbed” is fulfilled and the reward function triggers, it is possible for it to associate and anticipate reward coming from (and therefore prefer) entirely unconnected world states. If for example one person is injured every time by an overseer every time it solves a sudoku puzzle, it will probably have a reward-anticipation signal as it gets closer to solving a sudoku puzzle (plus some partial reward when it does a row or a column or a cell satisfactorily etc). And the reward anticipation signal isn’t even wrong! It’s just the way the environment is set up. We do this to ourselves all the time, of course, when we effectively dose ourselves with partial reward signals for fulfilling steps of a long term plan which only distantly ends with something that we would “innately” desire.
That’s just to clarify a bit more of what I mean when I say that reward signal is only the first step. You then have to connect it with the correct learned inferences about what produces reward, which is something that only happens in the “deployment environment” i.e. the real world. As an example of what happens when goal state and acquired inference are unaligned, religious fanatics often also believe that they are saving the world, and vastly improving the lives of everyone in it.
(There’s also the small problem that even the action “take action to gather more information and improve your causal inferences about the world” is itself motivated by learned anticipation of future reward… it is very hard to explain something inconvenient to someone when their job depends on them not understanding it)
Hey, I want to respond to the rest of your comments, but that might take some time. Just a quick thought/clarification about one particular thing.
I understand that the reward signals are hard coded, but internal anticipation of reward signals is not. Pavlov’s breakthrough was that with proper conditioning you could connect unrelated stimuli to the possibility of “true” reward signals in the future. In other words, you can “teach” an RL system to perform credit assignment to just about anything. Lights, sounds, patterns of behaviour...
In the alignment context this means that even if the agent gets reward when the condition “someone is stabbed” is fulfilled and the reward function triggers, it is possible for it to associate and anticipate reward coming from (and therefore prefer) entirely unconnected world states. If for example one person is injured every time by an overseer every time it solves a sudoku puzzle, it will probably have a reward-anticipation signal as it gets closer to solving a sudoku puzzle (plus some partial reward when it does a row or a column or a cell satisfactorily etc). And the reward anticipation signal isn’t even wrong! It’s just the way the environment is set up. We do this to ourselves all the time, of course, when we effectively dose ourselves with partial reward signals for fulfilling steps of a long term plan which only distantly ends with something that we would “innately” desire.
That’s just to clarify a bit more of what I mean when I say that reward signal is only the first step. You then have to connect it with the correct learned inferences about what produces reward, which is something that only happens in the “deployment environment” i.e. the real world. As an example of what happens when goal state and acquired inference are unaligned, religious fanatics often also believe that they are saving the world, and vastly improving the lives of everyone in it.
(There’s also the small problem that even the action “take action to gather more information and improve your causal inferences about the world” is itself motivated by learned anticipation of future reward… it is very hard to explain something inconvenient to someone when their job depends on them not understanding it)