Steven Byrnes comments on What Is The Alignment Problem?

Steven Byrnes 16 Jan 2025 23:37 UTC
6 points
0
OK I’m more confused by your model than I thought.
There should be some part of your framework that’s hooked up to actual decision-making—some ingredient for which “I do things iff this ingredient has a high score” is tautological (cf. my “1.5.3 “We do things exactly when they’re positive-valence” should feel almost tautological”) . IIUC that’s “value function” in your framework. (Right?)
If your proposal is that some latent variable in the world-model gets a flag meaning “this latent variable is the Value Function”, thus hooking that latent variable up to decision-making in a mechanical, tautological way, then how does that flag wind up attached to that latent variable, rather than to some other latent variable? What if the world-model lacks any latent variable that looks like what the value function is supposed to look like?
~~
My proposal (and I think LeCun’s and certainly AlphaZero’s) is instead: the true “value function” is not part of the world model. Not mathematically, not neuroanatomically—IMO the world model is in the cortex, the value function is in the striatum. (The reward function is not part of the world model either, but I guess you already agree about that.)
…However, the world-model might wind up incorporating a model of the value function and reward function, just like the world-model might wind up incorporating a model of any other salient aspect of my world and myself. It won’t necessarily form such a model—the world-model inside simple fish brains probably doesn’t have a model of the value function, and ditto for sufficiently young human children. But for human adults, sure. If so, the representation of my value function in my world model is not my actual value function, just as the representation of my arm in my world model is not my actual arm.
If you think that my proposal here is inconsistent with the fact that I don’t want to do heroin right now, then I disagree, and I’m happy to explain why.
- johnswentworth 17 Jan 2025 17:27 UTC
  4 points
  0
  Parent
  There should be some part of your framework that’s hooked up to actual decision-making—some ingredient for which “I do things iff this ingredient has a high score” is tautological (cf. my “1.5.3 “We do things exactly when they’re positive-valence” should feel almost tautological”) . IIUC that’s “value function” in your framework. (Right?)
  This part’s useful, it does update me somewhat. In my own words: sometimes people do things which are obviously-even-in-the-moment not in line with their values. Therefore the thing hooked up directly to moment-to-moment decision making must not be values themselves, and can’t be estimated-values either (since if it were estimated values, the divergence of behavior from values would not be obvious in the moment).
  Previously my model was that it’s the estimated values which are hooked up directly to decision making (though I incorrectly stated above that it was values), but I think I’ve now tentatively updated away from that. Thanks!
  If your proposal is that some latent variable in the world-model gets a flag meaning “this latent variable is the Value Function”… then how does that flag wind up attached to that latent variable, rather than to some other latent variable? What if the world-model lacks any latent variable that looks like what the value function is supposed to look like?
  I have lots of question marks around that part still. My current best guess is that the value-modeling part of the world model has a bunch of hardcoded structure to it (the details of which I don’t yet know), so it’s a lot less ontologically flexible than most of the human world model. That said, it would generally operate by estimating value-assignments to stuff in the underlying epistemic world-model, so it should at least be flexible enough to handle epistemic ontology shifts (even if the value-ontology on top is more rigid).
  IMO the world model is in the cortex, the value function is in the striatum.
  Minor flag which might turn out to matter: I don’t think the value function is in the mind at all; the mind contains an estimate of the value function, but the “true” value function (insofar as it exists at all) is projected. (And to be clear, I don’t mean anything unusually mysterious by this; it’s just the ordinary way latent variables work. If a variable is latent, then the mind doesn’t know the variable’s “true” value at all, and the variable itself is projected in some sense, i.e. there may not be any “true” value of the variable out in the environment at all.)