I’ve been having discussions with a friend about his ideas of trying to get a near-corrigible agent to land in a ‘corrigibility basin’. The idea is that you could make an agent close enough to corrigible that it will be willing to self-edit to bring itself more in line with a more corrigible version of itself upon receiving critical feedback from a supervisor or the environment about the imperfection of its corrigibility.
I would like to see some toy-problem research focused on the corrigibility sub-problem of correctional-self-editing rather than on the sub-problem of ‘the shutdown problem’.
I’ve been having discussions with a friend about his ideas of trying to get a near-corrigible agent to land in a ‘corrigibility basin’. The idea is that you could make an agent close enough to corrigible that it will be willing to self-edit to bring itself more in line with a more corrigible version of itself upon receiving critical feedback from a supervisor or the environment about the imperfection of its corrigibility.
I would like to see some toy-problem research focused on the corrigibility sub-problem of correctional-self-editing rather than on the sub-problem of ‘the shutdown problem’.
I’d be interested to see this as well!