Instead, evidence suggests that they are trying to complete each instruction—they “want” to just do whatever the instructions given them were
I disagree with this, for Appendix M in https://www.arxiv.org/abs/2509.15541 (for o3) and Appendix B.6 in https://arxiv.org/abs/2412.04984 (for sonnet 3.5) we systematically ablate things specifically to show that the explanation needs to incorporate beyond episode preferences, i.e. that instruction following / being confused / etc isn’t sufficient. (If there’s additional ablations you’d find convincing I’d be very interested to know and could run them! I had run a lot more in anticipation of this coming up more, for example that they’ll sacrifice in episode reward etc)
concerning themselves extremely weakly with things outside of the specific instructions
Do you think they’ll increasingly have longer horizon revealed preferences as they’re trained to work over longer horizon lengths? I would find it surprising if models don’t learn useful heuristics and tendencies. A model that’s taking on tasks that span multiple weeks does really need to be concerned about longer horizon things.
But I also find this to be a relatively implausible future
This was really helpful! I think this is a crux that helps me understand where our models differ a lot here. I agree this “single fresh rollout” concept becomes much more important if no one figures out continual learning, however this feels unlikely given labs are actively openly working on this (which doesn’t mean it’ll be production ready in the next few months or anything, but it seems very implausible to me that something functionally like it is somehow 5 years away or similarly difficult)
I disagree with this, for Appendix M in https://www.arxiv.org/abs/2509.15541 (for o3) and Appendix B.6 in https://arxiv.org/abs/2412.04984 (for sonnet 3.5) we systematically ablate things specifically to show that the explanation needs to incorporate beyond episode preferences, i.e. that instruction following / being confused / etc isn’t sufficient. (If there’s additional ablations you’d find convincing I’d be very interested to know and could run them! I had run a lot more in anticipation of this coming up more, for example that they’ll sacrifice in episode reward etc)
Do you think they’ll increasingly have longer horizon revealed preferences as they’re trained to work over longer horizon lengths? I would find it surprising if models don’t learn useful heuristics and tendencies. A model that’s taking on tasks that span multiple weeks does really need to be concerned about longer horizon things.
This was really helpful! I think this is a crux that helps me understand where our models differ a lot here. I agree this “single fresh rollout” concept becomes much more important if no one figures out continual learning, however this feels unlikely given labs are actively openly working on this (which doesn’t mean it’ll be production ready in the next few months or anything, but it seems very implausible to me that something functionally like it is somehow 5 years away or similarly difficult)