We will soon see the first high-profile example of “misaligned” model behavior where a model does something neither the user nor the developer want it to do, but which instead appears to be due to scheming.
On examination, the AI’s actions will not actually be a good way to accomplish that goal. Other instances of the same model will be capable of recognizing this.
The AI’s actions will make a lot of sense as an extrapolated of some contextually-activated behavior which led to better average performance on some benchmark.
That is to say, the traditional story is
We use RL to train AI
AI learns to predict reward
AI decides that its goal is to maximize reward
AI reasons about what behavior will lead to maximal reward
AI does something which neither its creators nor the user want it to do, but that thing serves the AI’s long term goals, or at least it thinks that’s the case
We all die when the AI releases a bioweapon (or equivalent) to ensure no future competition
The AI takes to the stars, but without us
My prediction here is
We use RL to train AI
AI learns to recognize what the likely loss/reward signal is for its current task
AI learns a heuristic like “if the current task seems to have a gameable reward and success seems unlikely by normal means, try to game the reward”
AI ends up in some real-world situation which it decides resembles an unwinnable task (it knows it’s not being evaluated, but that doesn’t matter)
AI decides that some random thing it just thought of looks like success criterion
AI thinks of some plan which has an outside chance of “working” by that success criterion it just came up with
AI does some random pants-on-head stupid thing which its creators don’t want, the user doesn’t want, and which doesn’t serve any plausible long-term goal.
We all die when the AI releases some dangerous bioweapon because doing so pattern-matches to some behavior that helped in training, but not actually in a way that kills everyone and not only after it can take over the roles humans had
Prediction:
We will soon see the first high-profile example of “misaligned” model behavior where a model does something neither the user nor the developer want it to do, but which instead appears to be due to scheming.
On examination, the AI’s actions will not actually be a good way to accomplish that goal. Other instances of the same model will be capable of recognizing this.
The AI’s actions will make a lot of sense as an extrapolated of some contextually-activated behavior which led to better average performance on some benchmark.
That is to say, the traditional story is
We use RL to train AI
AI learns to predict reward
AI decides that its goal is to maximize reward
AI reasons about what behavior will lead to maximal reward
AI does something which neither its creators nor the user want it to do, but that thing serves the AI’s long term goals, or at least it thinks that’s the case
We all die when the AI releases a bioweapon (or equivalent) to ensure no future competition
The AI takes to the stars, but without us
My prediction here is
We use RL to train AI
AI learns to recognize what the likely loss/reward signal is for its current task
AI learns a heuristic like “if the current task seems to have a gameable reward and success seems unlikely by normal means, try to game the reward”
AI ends up in some real-world situation which it decides resembles an unwinnable task (it knows it’s not being evaluated, but that doesn’t matter)
AI decides that some random thing it just thought of looks like success criterion
AI thinks of some plan which has an outside chance of “working” by that success criterion it just came up with
AI does some random pants-on-head stupid thing which its creators don’t want, the user doesn’t want, and which doesn’t serve any plausible long-term goal.
We all die when the AI releases some dangerous bioweapon because doing so pattern-matches to some behavior that helped in training, but not actually in a way that kills everyone and not only after it can take over the roles humans had