So a notable thing going on with Agent 4 is that it’s collapsed into one context / one rollout. It isn’t just the weights; it’s a single causally linked entity. I do indeed think running a singular agent for many times longer than it was ever run in training would be more likely for it’s behavior to wander—although, unlike the 2027 story I think it’s also just likely for it too become incoherent or something. But yeah, this could lead to weird or unpredictable behavior.
But I also find this to be a relatively implausible future—I anticipate that there’s no real need to join contexts in this way—and have criticized it here. But conditional on me being wrong about this, I would indeed grow at least some iota more pessimistic.
In general, the evidence seems to suggest that models do not like completing tasks in a strategic sense. They will not try to get more tasks to do, which would be a natural thing to do if they liked completing tasks; they will not try to persuade you to give them more tasks; they will not try to strategically get in situations where they get more tasks.
Instead, evidence suggests that they are trying to complete each instruction—they “want” to just do whatever the instructions given them were—and with relatively few exceptions (Opus 3) concerning themselves extremely weakly with things outside of the specific instructions. That is of course why they are useful, and I think what we should expect their behavior to (likely?) converge to, given that people want them to be of use.
The right abstraction (compared to a rollout) really was at the (model, context) level.
Actually I’m just confused what you mean here, a rollout is a (model, [prefill, instructions]=context) afaict.
Instead, evidence suggests that they are trying to complete each instruction—they “want” to just do whatever the instructions given them were
I disagree with this, for Appendix M in https://www.arxiv.org/abs/2509.15541 (for o3) and Appendix B.6 in https://arxiv.org/abs/2412.04984 (for sonnet 3.5) we systematically ablate things specifically to show that the explanation needs to incorporate beyond episode preferences, i.e. that instruction following / being confused / etc isn’t sufficient. (If there’s additional ablations you’d find convincing I’d be very interested to know and could run them! I had run a lot more in anticipation of this coming up more, for example that they’ll sacrifice in episode reward etc)
concerning themselves extremely weakly with things outside of the specific instructions
Do you think they’ll increasingly have longer horizon revealed preferences as they’re trained to work over longer horizon lengths? I would find it surprising if models don’t learn useful heuristics and tendencies. A model that’s taking on tasks that span multiple weeks does really need to be concerned about longer horizon things.
But I also find this to be a relatively implausible future
This was really helpful! I think this is a crux that helps me understand where our models differ a lot here. I agree this “single fresh rollout” concept becomes much more important if no one figures out continual learning, however this feels unlikely given labs are actively openly working on this (which doesn’t mean it’ll be production ready in the next few months or anything, but it seems very implausible to me that something functionally like it is somehow 5 years away or similarly difficult)
So a notable thing going on with Agent 4 is that it’s collapsed into one context / one rollout. It isn’t just the weights; it’s a single causally linked entity. I do indeed think running a singular agent for many times longer than it was ever run in training would be more likely for it’s behavior to wander—although, unlike the 2027 story I think it’s also just likely for it too become incoherent or something. But yeah, this could lead to weird or unpredictable behavior.
But I also find this to be a relatively implausible future—I anticipate that there’s no real need to join contexts in this way—and have criticized it here. But conditional on me being wrong about this, I would indeed grow at least some iota more pessimistic.
In general, the evidence seems to suggest that models do not like completing tasks in a strategic sense. They will not try to get more tasks to do, which would be a natural thing to do if they liked completing tasks; they will not try to persuade you to give them more tasks; they will not try to strategically get in situations where they get more tasks.
Instead, evidence suggests that they are trying to complete each instruction—they “want” to just do whatever the instructions given them were—and with relatively few exceptions (Opus 3) concerning themselves extremely weakly with things outside of the specific instructions. That is of course why they are useful, and I think what we should expect their behavior to (likely?) converge to, given that people want them to be of use.
Actually I’m just confused what you mean here, a rollout is a (model, [prefill, instructions]=context) afaict.
I disagree with this, for Appendix M in https://www.arxiv.org/abs/2509.15541 (for o3) and Appendix B.6 in https://arxiv.org/abs/2412.04984 (for sonnet 3.5) we systematically ablate things specifically to show that the explanation needs to incorporate beyond episode preferences, i.e. that instruction following / being confused / etc isn’t sufficient. (If there’s additional ablations you’d find convincing I’d be very interested to know and could run them! I had run a lot more in anticipation of this coming up more, for example that they’ll sacrifice in episode reward etc)
Do you think they’ll increasingly have longer horizon revealed preferences as they’re trained to work over longer horizon lengths? I would find it surprising if models don’t learn useful heuristics and tendencies. A model that’s taking on tasks that span multiple weeks does really need to be concerned about longer horizon things.
This was really helpful! I think this is a crux that helps me understand where our models differ a lot here. I agree this “single fresh rollout” concept becomes much more important if no one figures out continual learning, however this feels unlikely given labs are actively openly working on this (which doesn’t mean it’ll be production ready in the next few months or anything, but it seems very implausible to me that something functionally like it is somehow 5 years away or similarly difficult)