Noosphere89 comments on Cole Wyeth’s Shortform

Noosphere89 27 Apr 2025 19:26 UTC
2 points
0
I think the hallucinations/reward hacking is actually a real alignment failure, but an alignment failure that happens to degrade capabilities a lot, though at least some of the misbehavior is probably due to context, but I have seen evidence that the alignment failures are more deliberate than regular capabilities failures.

That said, if this keeps happening, the likely answer is because capabilities progress is to a significant degree bottlenecked on alignment progress, such that you need a significant degree of progress on preventing specification gaming to get new capabilities, and this would definitely be a good world for misalignment issues if the hypothesis is true (which I put some weight on)

(Also, it’s telling that the areas where RL has worked best are areas where you can basically create unhackable reward models like many games/puzzles, and once reward hacking is on the table, capabilities start to decrease).