I’d say long reasoning wasn’t really elicited by CoT prompting
IIRC, “let’s think step-by-step” showed up in benchmark performance basically immediately, and that’s the core of it. On the other hand, there’s nothing like “be madly obsessed with your goal” that’s known to boost LLM performance in agent settings.
There were clear “signs of life” on extended inference-time reasoning; there are (to my knowledge) none on agent-like reasoning.
you might need a GPT-8 for it to spontaneously emerge in a base model, demonstrating that it was in GPT-5.5 all along
If you agree that it can spontaneously emerge at a sufficiently big scale, why would you assume this scale is GPT-8, not GPT-5?
That’s basically the core of my argument. If LLMs learned agency skills, they would’ve been elicitable in some GPT-N, with no particular reason to think that this N needs to be very big. On the contrary, by extrapolating a similarly qualitative jump from GPT-3 to GPT-4 as happened from GPT-2 to GPT-3, I’d expected these skills to spontaneously show up in GPT-4 – if they were ever going to show up.
They didn’t show up. GPT-4 ended up as a sharper-looking GPT-3.5, and all progress since then amounted to GPT-3.5′s shape being more sharply defined, without that shape changing.
IIRC, “let’s think step-by-step” showed up in benchmark performance basically immediately, and that’s the core of it.
It’s not central to the phenomenon I’m using as an example of a nontrivially elicited capability. There, the central thing is efficient CDCL-like in-context search that enumerates possibilities while generalizing blind alleys to explore similar blind alleys less within the same reasoning trace, which can get about as long as the whole effective context (on the order of 100K tokens). Prompted (as opposed to elicited-by-tuning) CoT won’t scale to arbitrarily long reasoning traces by adding “Wait” at the end of a reasoning trace either (Figure 3 of the s1 paper). Quantitatively, this manifests as scaling of benchmark outcomes with test-time compute that’s dramatically more efficient per token (Figure 4b of s1 paper) than the parallel scaling methods such as consensus/majority and best-of-N, or even PRM-based methods (Figure 3 of this Aug 2024 paper).
If you agree that it can spontaneously emerge at a sufficiently big scale, why would you assume this scale is GPT-8, not GPT-5?
I was just anchoring to your example that I was replying to where you sketch some stand-in capability (“paperclipping”) that doesn’t spontaneously emerge in “GPT-5/6″ (i.e. with mere prompting). I took that framing as it was given in your example and extended it to more scale (“GPT-8”) to sketch my own point, that I expect capabilities that can be elicited to emerge much later than the scale where they can be merely elicited (with finetuning on a tiny amount of data). It wasn’t my intent to meaningfully gesture at particular scales with respect to particular capabilities.
IIRC, “let’s think step-by-step” showed up in benchmark performance basically immediately, and that’s the core of it. On the other hand, there’s nothing like “be madly obsessed with your goal” that’s known to boost LLM performance in agent settings.
There were clear “signs of life” on extended inference-time reasoning; there are (to my knowledge) none on agent-like reasoning.
If you agree that it can spontaneously emerge at a sufficiently big scale, why would you assume this scale is GPT-8, not GPT-5?
That’s basically the core of my argument. If LLMs learned agency skills, they would’ve been elicitable in some GPT-N, with no particular reason to think that this N needs to be very big. On the contrary, by extrapolating a similarly qualitative jump from GPT-3 to GPT-4 as happened from GPT-2 to GPT-3, I’d expected these skills to spontaneously show up in GPT-4 – if they were ever going to show up.
They didn’t show up. GPT-4 ended up as a sharper-looking GPT-3.5, and all progress since then amounted to GPT-3.5′s shape being more sharply defined, without that shape changing.
It’s not central to the phenomenon I’m using as an example of a nontrivially elicited capability. There, the central thing is efficient CDCL-like in-context search that enumerates possibilities while generalizing blind alleys to explore similar blind alleys less within the same reasoning trace, which can get about as long as the whole effective context (on the order of 100K tokens). Prompted (as opposed to elicited-by-tuning) CoT won’t scale to arbitrarily long reasoning traces by adding “Wait” at the end of a reasoning trace either (Figure 3 of the s1 paper). Quantitatively, this manifests as scaling of benchmark outcomes with test-time compute that’s dramatically more efficient per token (Figure 4b of s1 paper) than the parallel scaling methods such as consensus/majority and best-of-N, or even PRM-based methods (Figure 3 of this Aug 2024 paper).
I was just anchoring to your example that I was replying to where you sketch some stand-in capability (“paperclipping”) that doesn’t spontaneously emerge in “GPT-5/6″ (i.e. with mere prompting). I took that framing as it was given in your example and extended it to more scale (“GPT-8”) to sketch my own point, that I expect capabilities that can be elicited to emerge much later than the scale where they can be merely elicited (with finetuning on a tiny amount of data). It wasn’t my intent to meaningfully gesture at particular scales with respect to particular capabilities.