Let me babble some nearby strategies that are explicitly not judged on their wisdom:
Do not what the user wants you to do, but what he expects you to do.
If the animal/user would consent to your help eventually, help it then. If it wouldn’t, help it now.
It seems to me both strategies articulated here reflect three values.
All else equal, implementing a helpful act is better than not doing so, and what makes that act good is its outcome rather than meeting some stated desire of a consenting user.
There can be additional value to strategies where consent is obtained first.
Helping sooner is better than helping later when a time delay would not change the outcome of the help or the extent to which the help matched stated desires.
Obviously you were very clear in not explicitly judging the strategies you mentioned, but reading them did make me think about how someone who did find them wise might respond to situations where the net impact of the helpful act decreases between when it could first be implimented and when the receiver would offer explicit consent.
Supposing for whatever reason, the value of the helpful act would cease to be beneficial right shortly before the user would consent, but up until this moment the net impact would be constant. Can I conceive of a sense in which there would be benefit to waiting until the last possible moment, as close as possible in time to when the user would consent? Or is the notion of receiving consent all or nothing in this sense?
Reasoning about utility functions, ie restricting deontological to consequentalist mindspace, seems a misstep, because slightly changing utility functions tends to change alignment a lot, and slightly changing deontological injunctions might not, making it easier for us to hillclimb mindspace.
Perhaps we should have some mathematical discussion of utilityfunction-space, mindspace, its consequentialist subspace, the injection turing-machines → mindspace, the function mindspace → alignment, how well that function can be optimized, properties that make for good lemmata about the previous such as continuity, mindspace modulo equal utility functions, etc.
Aaand I’ve started it. What shape has mindspace?
My gut response is that hillclimbing is itself consequentialist, so this doesn’t really help with fragility of value; if you get the hillclimbing direction slightly wrong, you’ll still end up somewhere very wrong. On the other hand, Paul’s approach rests on something which we could call a deontological approach to the hillclimbing part (IE, amplification steps do not rely on throwing more optimization power at a pre-specified function).
We are doing the hillclimbing, and implementing other object-level strategies does not help. Paul proposes something, we estimate the design’s alignment, he tweaks the design to improve it. That’s the hill-climbing I mean.