In particular: the motivations that matter most for safe instruction-following are not the AI’s long-term consequentialist motivations (indeed, if possible, I think we mostly want to avoid our AIs having this kind of motivation except insofar as it is implied by safe instruction-following).
That seems like a reasonable position given that you accept the risk of long-term motivations. But it doesn’t seem to be what people are actually aiming for. In particular, people seem to aim for agentic AI that can act on a person’s behalf on longer time scales. And the trend predictions by METR seem to point to longer horizons soon.
I have already remarked in another comment that a short time horizon failed to prevent GPT4o(!) from making the user post messages into the wilderness. Does it mean that some kind of long-term motives has already appeared?
That seems like a reasonable position given that you accept the risk of long-term motivations. But it doesn’t seem to be what people are actually aiming for. In particular, people seem to aim for agentic AI that can act on a person’s behalf on longer time scales. And the trend predictions by METR seem to point to longer horizons soon.
I have already remarked in another comment that a short time horizon failed to prevent GPT4o(!) from making the user post messages into the wilderness. Does it mean that some kind of long-term motives has already appeared?