Towards_Keeperhood comments on Consequentialism & corrigibility

Towards_Keeperhood 19 May 2025 17:52 UTC
1 point
0
The part where you wrote “not trajectories as in “include preferences about the actions you take” kind of sense, but only about how the universe unfolds” sounds to me like you’re invoking non-indexical preferences? (= preferences that make no reference to this-agent-in-particular.)
(Not that important but IIRC “preferences over trajectories” was formalized as “preferences over state-action-sequences”, and I think it’s sorta weird to have preferences over your actions other than what kind of states they result in, so I meant without the action part. (Because it’s an action is either an atomic label, in which case actions could be relabeled so that preferences over actions are meaningless, or it’s in some way about what happens in reality.) But it doesn’t matter much. In my way of thinking about it, the agent is part of the environment and so you can totally have preferences related to this-agent-in-particular.)
It’s important that timestamps during the course-of-action are not playing a big role in the decision, but it’s not important that there is one and only one future timestamp that matters. I still have consequentialist preferences (preferences purely over future states) even if I care about what the universe is like in both 3000AD and 4000AD.
I guess then I misunderstood what you mean by “preferences over future states/outcomes”. It’s not exactly the same as my “preferences over worlds” model because of e.g. logical decision theory stuff, but I suppose it’s close enough that we can say it’s equivalent if I understand you correctly.
But if you can care about multiple timestamps, why would only be able to care about what happens (long) after a decision, rather than also what happens during it? I don’t understand why you think “the human remains in control” isn’t a preference over future states. It seems to me just straightforwardly a preference that the human is in control at all future timesteps.
Can you make one or more examples of what is a “other kind of preference”? Or where you draw the distinction what is not a “preference over (future) states”? I just don’t understand what you mean here then.
(One perhaps bad attempt at guessing: You think helpfulness over worlds/future-states wouldn’t weigh strongly enough in decisions, so you want a myopic/act-based helpfulness preference in each decision. (I can think about this if you confirm.))
Or maybe you just actually mean that you can have preferences about multiple timestamps but all must be in the non-close future? Though this seems to me like an obviously nonsensical position and an extreme strawman of Eliezer.
Show that you are describing a coherent preference that could be superintelligently/unboundedly optimized while still remaining safe/shutdownable/correctable.
I reject this way of talking, in this context. We shouldn’t use the passive voice, “preference that could be…optimized”. There is a particular agent which has the preferences and which is doing the optimization, and it’s the properties of this agent that we’re talking about. It will superintelligently optimize something if it wants to superintelligently optimize it, and not if it doesn’t, and it will do that methods that it wants to employ, and not via methods that it doesn’t want to employ, etc.
From my perspective it looks like this:
If you want to do a pivotal act you need powerful consequentialist reasoning directed at a pivotal task. This kind of consequentialist cognition can be modelled as utility maximization (or quantilization or so).
If you try to keep it safe through constraints that aren’t part of the optimization target, powerful enough optimization will figure out a way around that or a way to get rid of the constraint.
So you want to try to embed the desire for helpfulness/corrigibility in the utility function.
If I try to imagine how a concrete utility function might look like for your proposal, e.g. “multiply the score of how well I accomplishing my pivotal task with the score of how well the operators remain in control”, I think the utility function will have undesirable maxima. And we need to optimize on utility that hard enough that the pivotal act is actually successful, which is probably hard enough to get into the undesireable zones.
Passive voice was meant to convey that you only need to write down a coherent utility function rather than also describing how you can actually point your AI to that utility function. (If you haven’t read the “ADDED” part which I added yesterday at the bottom of my comment, perhaps read that.)
Maybe you disagree with the utility frame?
I don’t think fuzzy time-extended concepts are necessarily “incoherent”, although I’m not sure I know what you mean by that anyway. I do think it’s “just math” (isn’t everything?), but like I said before, I don’t know how to formalize it, and neither does anyone else, and if I did know then I wouldn’t publish it because of infohazards.
If you think that part would be infohazardry you misunderstand me. E.g. check out Max Harms’ attempt at formalizing corrigibility through empowerment. Good abstract concepts usually have simple mathemtatical cores, e.g.: probability, utility, fairness, force, mass, acceleration, …
Didn’t say it was easy, but that’s how I think actually useful progress on corrigibility looks like. (Without concreteness/math you may fail to realize how the preferences you want the AI to have are actually in tension with each other and quite difficult to reconcile, and then if you build the AI (and maybe push it past it’s reluctances so it actually becomes competent enough to do something useful) the preferences don’t get reconciled in that difficult desireable way, but somehow differently in a way it ends up badly.)