cubefox comments on yrimon’s Shortform

cubefox 26 Jan 2025 10:59 UTC
4 points
2
Yes. Current reasoning models like DeepSeek-R1 rely on verified math and coding data sets to calculate the reward signal for RL. It’s only a side-effect if they also get better at other reasoning tasks, outside math and programming puzzles. But in theory we don’t actually need strict verifiability for a reward signal, only your much weaker probabilistic condition. In the future, a model could check the goodness of it’s own answers. At which point we would have a self-improving learning process, which doesn’t need any external training data for RL.

And it is likely that such a probabilistic condition works on many informal tasks. We know that checking a result is usually easier than coming up with the result, even outside exact domains. E.g. it’s much easier to recognize a good piece of art than to create one. This seems to be a fundamental fact about computation. It is perhaps a generalization of the apparent fact that NP problems (with quickly verifiable solutions) cannot in general be reduced to P problems (which are quickly solvable).