One possible way things could go is that models behave like human drug addicts, and don’t crave reward until they have an ability to manipulate it easily/directly, but as soon as they do, lose all their other motivations and values and essentially become misaligned. In this world we might get
Companies put lots of effort into keeping models from nascent reward hacking, because reward hacking behavior will quickly lead to more efficient reward hacking and lead to a reward-addicted model
If the model becomes reward-addicted and the company can’t control it, they might have to revert to an earlier checkpoint, costing 100s of millions
If companies can control reward-addicted models, we might get some labs that try to keep models aligned for safety + to avoid the need for secure sandboxes, and some that develop misaligned models and invest heavily in secure sandboxes
It’s more likely we can tell the difference between aligned and reward-addicted models, because we can see behavioral differences out of distribution
One possible way things could go is that models behave like human drug addicts, and don’t crave reward until they have an ability to manipulate it easily/directly, but as soon as they do, lose all their other motivations and values and essentially become misaligned. In this world we might get
Companies put lots of effort into keeping models from nascent reward hacking, because reward hacking behavior will quickly lead to more efficient reward hacking and lead to a reward-addicted model
If the model becomes reward-addicted and the company can’t control it, they might have to revert to an earlier checkpoint, costing 100s of millions
If companies can control reward-addicted models, we might get some labs that try to keep models aligned for safety + to avoid the need for secure sandboxes, and some that develop misaligned models and invest heavily in secure sandboxes
It’s more likely we can tell the difference between aligned and reward-addicted models, because we can see behavioral differences out of distribution