Not sure if this should in particular preserve information about preference and opportunity for more agency targeted at it (corrigibility in my sense), since losing that opportunity doesn’t seem conservative, wastes utility for most preferences. But then shutdown would involve some optimal level of agency in the environment, caretakers of corrigibility, not inactivity. Which does seem possibly correct, the agent should’t be eradicating environmental agents that have nothing to do with the agent, when going into shutdown, while refusing total shutdown when there are no environmental agents left at all (including humans and other AGIs) might be right.
If this is the case, maximal anti-goodharting is not shutdown, but maximal uncertainty about preference, so a maximally anti-goodharting agent purely pursues corrigibility (receptiveness to preference), computes what it is without optimizing for it, since it has no tractable knowledge of what it is at the moment. If environment already contains other systems receptive to preference, this might look like shutdown.
The outcome of shutdown seems important. It’s the limiting case of soft optimization (anti-goodharting, non-agency), something you do when maximally logically uncertain about preference, conservative decisions robust to adversarial assignment of your preference.
Not sure if this should in particular preserve information about preference and opportunity for more agency targeted at it (corrigibility in my sense), since losing that opportunity doesn’t seem conservative, wastes utility for most preferences. But then shutdown would involve some optimal level of agency in the environment, caretakers of corrigibility, not inactivity. Which does seem possibly correct, the agent should’t be eradicating environmental agents that have nothing to do with the agent, when going into shutdown, while refusing total shutdown when there are no environmental agents left at all (including humans and other AGIs) might be right.
If this is the case, maximal anti-goodharting is not shutdown, but maximal uncertainty about preference, so a maximally anti-goodharting agent purely pursues corrigibility (receptiveness to preference), computes what it is without optimizing for it, since it has no tractable knowledge of what it is at the moment. If environment already contains other systems receptive to preference, this might look like shutdown.