Thane Ruthenis comments on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals

Thane Ruthenis 26 Jan 2025 19:44 UTC
8 points
0
Hmm. This does have the feel of gesturing at something important, but I don’t see it clearly yet...
Free association: geometric rationality.
MIRI’s old results argue that “corrigibility via uncertainty regarding the utility function” doesn’t work, because if the agent maximizes expected utility anyway, it doesn’t matter one whit whether we’re taking expectation over actions or over utility functions. However, the corrigibility-via-instrumental-goals does have the feel of “make the agent uncertain regarding what goals it will want to pursue next”. Is there, therefore, some way to implement something-like-this while avoiding MIRI’s counterexample?
Loophole: the counterexample works in the arithmetically-expected utility regime. What if we instead do it in the geometric one? I. e., have an agent take actions that maximize the geometrically-expected product of candidate utility functions? This is a more conservative/egalitarian regime: any one utility function flipping to negative or going to zero wipes out all value, unlike with sums (which are more tolerant of ignoring/pessimizing some terms, and can have “utility monsters”). So it might potentially make the agent actually hesitant to introduce potentially destructive changes to its environment...
(This is a very quick take and it potentially completely misunderstands the concepts involved. But I figure it’s better to post than not, in case the connection turns out obvious to anyone else.)
- johnswentworth 26 Jan 2025 19:50 UTC
  12 points
  0
  Parent
  However, the corrigibility-via-instrumental-goals does have the feel of “make the agent uncertain regarding what goals it will want to pursue next”.
  That’s an element, but not the central piece. The central piece (in the subagents frame) is about acting-as-though there are other subagents in the environment which are also working toward your terminal goal, so you want to avoid messing them up.
  The “uncertainty regarding the utility function” enters here mainly when we invoke instrumental convergence, in hopes that the subagent can “act as though other subagents are also working torward its terminal goal” in a way agnostic to its terminal goal. Which is a very different role than the old “corrigibility via uncertainty” proposals.