habryka comments on Alignment remains a hard, unsolved problem

habryka 10 Dec 2025 4:31 UTC
4 points
2
I believe this!
Like, it’s quite likely to me that if you take a predictive agent, and then train it using an RL-like setup where you roll out a whole chain of thought in order to predict the next token, this will pretty obviously result in a bunch of agentic capabilities developing. Indeed, a lot of the loss after a lot of roll-outs of this will end up being located in tasks pretty similar to current RL training environments (like predicting solutions to programming problems, and doing complicated work to figure out reverse-hashing algorithms of various kinds).