Rob Bensinger comments on When Most VNM-Coherent Preference Orderings Have Convergent Instrumental Incentives

Rob Bensinger 30 Sep 2021 15:36 UTC
LW: 26 AF: 15
0
AF
Copying over a Slack comment from Abram Demski:
I think this post could be pretty important.
It offers a formal treatment of “goal-directedness” and its relationship to coherence theorems such as VNM, a topic which has seen some past controversy but which has—till now—been dealt with only quite informally. Personally I haven’t known how to engage with the whole goal-directedness debate, and I think part of the reason for that is the vagueness of the idea. Goal-directedness doesn’t seem that cruxy for most of my thinking, but some other people seem to really strongly perceive it as a crux for miri-type thought, and sometimes as a crux for AI risk more generally. (I once made a “tool AI” argument against AI risk myself, although in hindsight I would say that was all motivated cognition, which ignored the idea that even tool AI has to optimize strongly in order to have high capabilities.)
So, as I see it, there’s been something of a stalemate between people who think the “goal-directed AI” vs “non-goal-directed AI” distinction is important for one reason or another, vs people who don’t think that.
Alex Turner seems to give real technical meaning to this distinction, showing that most VNM-coherent preferences are indeed “goal directed” in the sense of acting broadly like we expect agents to act (that is, behaving in ways consistent with instrumental convergence). However, he also gives a class of VNM-coherent preferences which are not goal-directed in this sense, instead exhibiting essentially random behavior. This gives us a plausible formal proxy for the “goal directed vs not goal directed” distinction!
I’m not sure how it can/should carry the broader conversation forward, yet, but it seems like something to think about.