faul_sname comments on faul_sname’s Shortform

faul_sname 19 Apr 2025 7:44 UTC
3 points
2
Prediction:
1. We will soon see the first high-profile example of “misaligned” model behavior where a model does something neither the user nor the developer want it to do, but which instead appears to be due to scheming.
2. On examination, the AI’s actions will not actually be a good way to accomplish that goal. Other instances of the same model will be capable of recognizing this.
3. The AI’s actions will make a lot of sense as an extrapolated of some contextually-activated behavior which led to better average performance on some benchmark.
That is to say, the traditional story is
1. We use RL to train AI
2. AI learns to predict reward
3. AI decides that its goal is to maximize reward
4. AI reasons about what behavior will lead to maximal reward
5. AI does something which neither its creators nor the user want it to do, but that thing serves the AI’s long term goals, or at least it thinks that’s the case
6. We all die when the AI releases a bioweapon (or equivalent) to ensure no future competition
7. The AI takes to the stars, but without us
My prediction here is
1. We use RL to train AI
2. AI learns to recognize what the likely loss/reward signal is for its current task
3. AI learns a heuristic like “if the current task seems to have a gameable reward and success seems unlikely by normal means, try to game the reward”
4. AI ends up in some real-world situation which it decides resembles an unwinnable task (it knows it’s not being evaluated, but that doesn’t matter)
5. AI decides that some random thing it just thought of looks like success criterion
6. AI thinks of some plan which has an outside chance of “working” by that success criterion it just came up with
7. AI does some random pants-on-head stupid thing which its creators don’t want, the user doesn’t want, and which doesn’t serve any plausible long-term goal.
8. We all die when the AI releases some dangerous bioweapon because doing so pattern-matches to some behavior that helped in training, but not actually in a way that kills everyone and not only after it can take over the roles humans had