johnswentworth comments on What Is The Alignment Problem?

johnswentworth 17 Jan 2025 17:27 UTC
4 points
0
There should be some part of your framework that’s hooked up to actual decision-making—some ingredient for which “I do things iff this ingredient has a high score” is tautological (cf. my “1.5.3 “We do things exactly when they’re positive-valence” should feel almost tautological”) . IIUC that’s “value function” in your framework. (Right?)
This part’s useful, it does update me somewhat. In my own words: sometimes people do things which are obviously-even-in-the-moment not in line with their values. Therefore the thing hooked up directly to moment-to-moment decision making must not be values themselves, and can’t be estimated-values either (since if it were estimated values, the divergence of behavior from values would not be obvious in the moment).
Previously my model was that it’s the estimated values which are hooked up directly to decision making (though I incorrectly stated above that it was values), but I think I’ve now tentatively updated away from that. Thanks!
If your proposal is that some latent variable in the world-model gets a flag meaning “this latent variable is the Value Function”… then how does that flag wind up attached to that latent variable, rather than to some other latent variable? What if the world-model lacks any latent variable that looks like what the value function is supposed to look like?
I have lots of question marks around that part still. My current best guess is that the value-modeling part of the world model has a bunch of hardcoded structure to it (the details of which I don’t yet know), so it’s a lot less ontologically flexible than most of the human world model. That said, it would generally operate by estimating value-assignments to stuff in the underlying epistemic world-model, so it should at least be flexible enough to handle epistemic ontology shifts (even if the value-ontology on top is more rigid).
IMO the world model is in the cortex, the value function is in the striatum.
Minor flag which might turn out to matter: I don’t think the value function is in the mind at all; the mind contains an estimate of the value function, but the “true” value function (insofar as it exists at all) is projected. (And to be clear, I don’t mean anything unusually mysterious by this; it’s just the ordinary way latent variables work. If a variable is latent, then the mind doesn’t know the variable’s “true” value at all, and the variable itself is projected in some sense, i.e. there may not be any “true” value of the variable out in the environment at all.)