Strong guess: they’re letting it generate images in the chain-of-thought. This would obviously be useful for image generation (make ten tries, pick the best parts of each for final answer) and is probably useful for other kinds of planning as well, but I’d guess it’s hard to RL a model into thinking in pictures usefully (no pretraining data of that form).
I guess you could RL a model into generating images that don’t actually do anything during chain-of-thought, just by instructing it to do so, then rewarding that. Depending on how competent Meta’s team is, they might have done either thing.
Strong guess: they’re letting it generate images in the chain-of-thought. This would obviously be useful for image generation (make ten tries, pick the best parts of each for final answer) and is probably useful for other kinds of planning as well, but I’d guess it’s hard to RL a model into thinking in pictures usefully (no pretraining data of that form).
I guess you could RL a model into generating images that don’t actually do anything during chain-of-thought, just by instructing it to do so, then rewarding that. Depending on how competent Meta’s team is, they might have done either thing.