Given a hard-coded agent that explicitly computed the consequences of its actions, and then took the action which maximized expected value according to its utility function, we would observe precisely the opposite behavior. Mutate a single line of code and the functionality of this agent would almost certainly be broken. However, the agent is still robust in the alignment sense, as its values will never drift
Why are its values immune? If its utility function is not made of code, what is it made of?
Why are its values immune? If its utility function is not made of code, what is it made of?