When thinking about how a smarter-than-human AI would treat human input to close the control loop, it pays to consider the cases where humans are that smart intelligence. How do we close the loop when dealing with young children? primates/dolphins/magpies? dogs/cats? fish? insects? bacteria? In all these cases the apparent values/preferences of the “environment” are basically adversarial, something that must be taken into account, but definitely not obeyed. In the original setup a super-intelligent aligned AI’s actions would be incomprehensible to us, no matter how much it would try to explain them to us (go explain to a baby that eating all the chocolate it wants is not a good idea, or to a cat that their favorite window must remain closed). Again, in the original setup it can be as drastic as an AI culling the human population, to help save us from a worse fate, etc. Sadly, this is not far from the “God works in mysterious ways” excuse one hears as a universal answer to the questions of theodicy.
The problem with this line of reasoning is that it assumes that the goal-directness comes from the smarter part of the duo decision-maker and bearer of consequences. With children and animals we consider they preferences as an input into our decision making, which mainly seeks to satisfies our preferences. We do not raise children solely for the purpose of satisfying their preferences.
This is why Rohin stresses particuarly on the idea that the danger in is the source of goal-directedness and if it comes from humans, then we are safer.
We raise children to satisfy their expected well being, not their naive preferences (for chocolate and toys), and that seems similar to what a smarter-than-human AI would do to/for us. Which was my point.
I think we raise children to satisfy our common expected wellbeing (our + theirs + the overall societal one). Thus, the goal-directness comes from society as a whole. I think there is a key difference between this system and one where a a smarter-than-human AI focuses solely on the well-being of its users, even if it does Context Etrapolated Volition, which I think is what you are referring to when you talk about expected well being (which I agree that if you look only at their CEV-like property the two systems are equivalent).