My prediction: Mode 1 dominates the “incoherence” the paper measures, especially at longer reasoning traces. The scaling relationship they found (more reasoning steps, more incoherent errors) is exactly what you’d expect if attention decay is the primary driver. Modes 2 and 3 should be roughly constant with context length, since they’re about behavioral gaps and value calibration, not information routing.
I think most of the relatively few people working on AI Safety are confident that the many people working on AI capabilities will solve your mode 1 (and are actually hoping it takes them a while, to give us more time to work on more safety-specific issues). So their concentration on modes 2 and 3 is a reasoned one. But your critique that people are spending too much time on mode 3 and not enough on mode 2 is a valid one: mode 2 could be a simple capabilities issue, but it’s also a “due care and attention” issue — when humans make egregious examples of this failure mode, we call it recklessness, consider it a character flaw, and in sufficiently serious cases it’s grounds for suing over the outcome.
Technically, this is a case of “Rogue AI takes out Western Civilization’s Financial Infrastructure”, but it’s also “Idiots Are Excitedly Using Idiot AI to Ship Bugs to Prod, more at 11″[3], and the latter situation is serious, worsening, and deserving of attention!
I think it’s going to get attention. I am hopeful the market forces will, sooner or later, cause all the people working on capability to start paying attention to safety, at least at this fairly simple level. When a bot deletes all the email off a Meta safety lead’s laptop, it tends to concentrate the mind. In short, I expect LOTS of near misses, of escalating size and irony, until people eventually get a bit more cautious and take the problem seriously.
I think I’m going to try and formalize modes 1 and 2 a bit, write some evals/possibly a generative eval suite or new method for scoring existing eval suites, and post that soon. Very hard to parametrize over OOD environments, though...
Have you seen any good probability supports for this kind of thing? Like… counterfactual agent trajectories and unexplored areas of the environment?
Eval performance feels weak but it’s probably what we’ve got
I think mode 1 is actually to some extent also a tunable explore/exploit preference (in reasoning space) that may stumble egregiously in OOD environments.
Like it’s not just forgetting, it’s “was I paranoid enough to ruminate about this unlikely connection between fact 1 and fact 2 that eventually led to me nuking prod”.
I think most of the relatively few people working on AI Safety are confident that the many people working on AI capabilities will solve your mode 1 (and are actually hoping it takes them a while, to give us more time to work on more safety-specific issues). So their concentration on modes 2 and 3 is a reasoned one. But your critique that people are spending too much time on mode 3 and not enough on mode 2 is a valid one: mode 2 could be a simple capabilities issue, but it’s also a “due care and attention” issue — when humans make egregious examples of this failure mode, we call it recklessness, consider it a character flaw, and in sufficiently serious cases it’s grounds for suing over the outcome.
I think it’s going to get attention. I am hopeful the market forces will, sooner or later, cause all the people working on capability to start paying attention to safety, at least at this fairly simple level. When a bot deletes all the email off a Meta safety lead’s laptop, it tends to concentrate the mind. In short, I expect LOTS of near misses, of escalating size and irony, until people eventually get a bit more cautious and take the problem seriously.
I think I’m going to try and formalize modes 1 and 2 a bit, write some evals/possibly a generative eval suite or new method for scoring existing eval suites, and post that soon. Very hard to parametrize over OOD environments, though...
Have you seen any good probability supports for this kind of thing? Like… counterfactual agent trajectories and unexplored areas of the environment?
Eval performance feels weak but it’s probably what we’ve got
I think mode 1 is actually to some extent also a tunable explore/exploit preference (in reasoning space) that may stumble egregiously in OOD environments.
Like it’s not just forgetting, it’s “was I paranoid enough to ruminate about this unlikely connection between fact 1 and fact 2 that eventually led to me nuking prod”.