Starting from the Callard version of Aspiration (how should we reason/act about things that change our values).
Extend it to generalize to all kinds of values shifts (not just the ones desired by the agent).
Deal with the case of adversaries (other agents in your environment want to change your values)
Figure out a game theory (what does it mean to optimally act in an environment where me & others are changing my values / how can I optimally act)
Figure out what this means for corrigibility (e.g. is corrigibility just a section of the phase space diagram for strategies in a world where your values are changing; is there a separate phase that gets the good parts of corrigibility without the horrifying parts of corrigibility)
Philosophical progress I wish would happen:
Starting from the Callard version of Aspiration (how should we reason/act about things that change our values).
Extend it to generalize to all kinds of values shifts (not just the ones desired by the agent).
Deal with the case of adversaries (other agents in your environment want to change your values)
Figure out a game theory (what does it mean to optimally act in an environment where me & others are changing my values / how can I optimally act)
Figure out what this means for corrigibility (e.g. is corrigibility just a section of the phase space diagram for strategies in a world where your values are changing; is there a separate phase that gets the good parts of corrigibility without the horrifying parts of corrigibility)