Matthew Khoriaty comments on Matthew Khoriaty’s Shortform

Matthew Khoriaty 6 Mar 2025 22:56 UTC
1 point
0
The facebook bots aren’t doing R1 or o1 reasoning about the context before making an optimal reinforcement-learned post. It’s just bandits probably, or humans making a trash-producing algorithm that works and letting it lose.
Agreed that I should try Reddit first. And I think there should be ways to guide an LLM towards the reward signal of “write good posts” before starting the RL, though I didn’t find any established techniques when I researched reward-model-free reinforcement learning loss functions that act on the number of votes a response receives. (What I mean is the results of searching DPO’s citations for “Vote”. Lots of results, though none of them have many citations.)