This post helped me clarify my thoughts on interference with supervisors.
Before this, I was unclear on how to draw the boundary between interference (like a cleaning robot disabling a human to stop punishments for broken furniture) and positive environmental changes (like turning on a light fixture to see better) in a concrete way. The difference I thought of is that the supervisor exerts direct pressure to keep the agent from altering the supervisor. So a rule to prevent treacherous turns might look like “if an aspect of the environment is optimizing against change by the agent, act as though the defenses against change had no loophole.”
Of course, we’d eventually want something finer-grained than that- we’d want a sufficiently aligned agent to be able to dismantle a dangerous object, or eventually carry out a complicated brain surgery that was too tricky for a human doctor.