The facebook bots aren’t doing R1 or o1 reasoning about the context before making an optimal reinforcement-learned post. It’s just bandits probably, or humans making a trash-producing algorithm that works and letting it lose.
Agreed that I should try Reddit first. And I think there should be ways to guide an LLM towards the reward signal of “write good posts” before starting the RL, though I didn’t find any established techniques when I researched reward-model-free reinforcement learning loss functions that act on the number of votes a response receives. (What I mean is the results of searching DPO’s citations for “Vote”. Lots of results, though none of them have many citations.)
The facebook bots aren’t doing R1 or o1 reasoning about the context before making an optimal reinforcement-learned post. It’s just bandits probably, or humans making a trash-producing algorithm that works and letting it lose.
Agreed that I should try Reddit first. And I think there should be ways to guide an LLM towards the reward signal of “write good posts” before starting the RL, though I didn’t find any established techniques when I researched reward-model-free reinforcement learning loss functions that act on the number of votes a response receives. (What I mean is the results of searching DPO’s citations for “Vote”. Lots of results, though none of them have many citations.)