I think I’m going to try and formalize modes 1 and 2 a bit, write some evals/possibly a generative eval suite or new method for scoring existing eval suites, and post that soon. Very hard to parametrize over OOD environments, though...
Have you seen any good probability supports for this kind of thing? Like… counterfactual agent trajectories and unexplored areas of the environment?
Eval performance feels weak but it’s probably what we’ve got
I think mode 1 is actually to some extent also a tunable explore/exploit preference (in reasoning space) that may stumble egregiously in OOD environments.
Like it’s not just forgetting, it’s “was I paranoid enough to ruminate about this unlikely connection between fact 1 and fact 2 that eventually led to me nuking prod”.