Several months ago I had dinner with a GMD employee who’s team was working on RL to make LLMs play games. I would be very surprised if this hasn’t been going on for well over a year already.
In terms of public releases, reasoning models are less than a year old. The way these things work, I suspect, is that there are a lot of smaller, less expensive experiments going on at any given time, which generally take time to make it into the next big training run. These projects take some time to propose and develop, and the number of such experiments going on at a frontier lab at a given time is (very roughly) the number of research engineers (ie talent-constrained; you can’t try every idea). Big training runs take several months, with roughly one happening at a time.
“Agentic” wasn’t a big buzzword until very recently. Google Trends shows an obvious exponential-ish trend which starts very small, in the middle of last year, but doesn’t get significant until the beginning of this year, and explodes out from there.
Thinking about all this, I think things seem just about on the fence. I suspect the first few reasoning models didn’t have game-playing in their RL at all, because the emphasis was on getting “reasoning” to work. A proactive lab could have put game-playing into the RL for the next iteration. A reactive lab could have only gotten serious about it this year.
The scale also matters a lot. Data-hunger means that they’ll throw anything they have into the next training run so long as it saw some success in smaller-scale experiments and maybe even if not. However, the first round of game-playing training environments could end up being a negligible effect on the final product due to not having a ton of training cases yet. However, by the second round, if not the first, they should have scraped together a big collection of cases to train on.
There’s also the question of how good the RL algorithms are. I haven’t looked into it very much and also most of the top labs keep details quite private anyway, but, my impression is that the RL algorithms used so far have been quite bad (not ‘real RL’—just assigning equal credit to all tokens in a chain-of-thought). This will presumably get better (EG they’ll figure out how to use some MCTS variant if they haven’t already). This is extremely significant for long-horizon tasks, because the RL algorithms right now (I’m guessing) have to be able to solicit at least one successful sample in order to get a good training gradient in that direction; long tasks will be stuck in failed runs if there’s not any planning-like component.
In any case, yeah, I think if we haven’t seen proper game-playing training in frontier models yet, we should see it very soon. If LLMs are still “below child level at this task” end-of-year then this will be a significant update towards longer timelines for me. (Pokemon doesn’t count anymore, though, because now there’s been significant scaffolding-tuning for that case, and because a lab could specifically train on pokemon due to the focus on that case.)
Also: I suspect there’s already been a lot of explicit agency training in the context of programming. (Maybe not very long time-horizon stuff, though.)
Several months ago I had dinner with a GMD employee who’s team was working on RL to make LLMs play games. I would be very surprised if this hasn’t been going on for well over a year already.
In terms of public releases, reasoning models are less than a year old. The way these things work, I suspect, is that there are a lot of smaller, less expensive experiments going on at any given time, which generally take time to make it into the next big training run. These projects take some time to propose and develop, and the number of such experiments going on at a frontier lab at a given time is (very roughly) the number of research engineers (ie talent-constrained; you can’t try every idea). Big training runs take several months, with roughly one happening at a time.
“Agentic” wasn’t a big buzzword until very recently. Google Trends shows an obvious exponential-ish trend which starts very small, in the middle of last year, but doesn’t get significant until the beginning of this year, and explodes out from there.
Thinking about all this, I think things seem just about on the fence. I suspect the first few reasoning models didn’t have game-playing in their RL at all, because the emphasis was on getting “reasoning” to work. A proactive lab could have put game-playing into the RL for the next iteration. A reactive lab could have only gotten serious about it this year.
The scale also matters a lot. Data-hunger means that they’ll throw anything they have into the next training run so long as it saw some success in smaller-scale experiments and maybe even if not. However, the first round of game-playing training environments could end up being a negligible effect on the final product due to not having a ton of training cases yet. However, by the second round, if not the first, they should have scraped together a big collection of cases to train on.
There’s also the question of how good the RL algorithms are. I haven’t looked into it very much and also most of the top labs keep details quite private anyway, but, my impression is that the RL algorithms used so far have been quite bad (not ‘real RL’—just assigning equal credit to all tokens in a chain-of-thought). This will presumably get better (EG they’ll figure out how to use some MCTS variant if they haven’t already). This is extremely significant for long-horizon tasks, because the RL algorithms right now (I’m guessing) have to be able to solicit at least one successful sample in order to get a good training gradient in that direction; long tasks will be stuck in failed runs if there’s not any planning-like component.
In any case, yeah, I think if we haven’t seen proper game-playing training in frontier models yet, we should see it very soon. If LLMs are still “below child level at this task” end-of-year then this will be a significant update towards longer timelines for me. (Pokemon doesn’t count anymore, though, because now there’s been significant scaffolding-tuning for that case, and because a lab could specifically train on pokemon due to the focus on that case.)
Also: I suspect there’s already been a lot of explicit agency training in the context of programming. (Maybe not very long time-horizon stuff, though.)