IIRC, “let’s think step-by-step” showed up in benchmark performance basically immediately, and that’s the core of it.
It’s not central to the phenomenon I’m using as an example of a nontrivially elicited capability. There, the central thing is efficient CDCL-like in-context search that enumerates possibilities while generalizing blind alleys to explore similar blind alleys less within the same reasoning trace, which can get about as long as the whole effective context (on the order of 100K tokens). Prompted (as opposed to elicited-by-tuning) CoT won’t scale to arbitrarily long reasoning traces by adding “Wait” at the end of a reasoning trace either (Figure 3 of the s1 paper). Quantitatively, this manifests as scaling of benchmark outcomes with test-time compute that’s dramatically more efficient per token (Figure 4b of s1 paper) than the parallel scaling methods such as consensus/majority and best-of-N, or even PRM-based methods (Figure 3 of this Aug 2024 paper).
If you agree that it can spontaneously emerge at a sufficiently big scale, why would you assume this scale is GPT-8, not GPT-5?
I was just anchoring to your example that I was replying to where you sketch some stand-in capability (“paperclipping”) that doesn’t spontaneously emerge in “GPT-5/6″ (i.e. with mere prompting). I took that framing as it was given in your example and extended it to more scale (“GPT-8”) to sketch my own point, that I expect capabilities that can be elicited to emerge much later than the scale where they can be merely elicited (with finetuning on a tiny amount of data). It wasn’t my intent to meaningfully gesture at particular scales with respect to particular capabilities.
It’s not central to the phenomenon I’m using as an example of a nontrivially elicited capability. There, the central thing is efficient CDCL-like in-context search that enumerates possibilities while generalizing blind alleys to explore similar blind alleys less within the same reasoning trace, which can get about as long as the whole effective context (on the order of 100K tokens). Prompted (as opposed to elicited-by-tuning) CoT won’t scale to arbitrarily long reasoning traces by adding “Wait” at the end of a reasoning trace either (Figure 3 of the s1 paper). Quantitatively, this manifests as scaling of benchmark outcomes with test-time compute that’s dramatically more efficient per token (Figure 4b of s1 paper) than the parallel scaling methods such as consensus/majority and best-of-N, or even PRM-based methods (Figure 3 of this Aug 2024 paper).
I was just anchoring to your example that I was replying to where you sketch some stand-in capability (“paperclipping”) that doesn’t spontaneously emerge in “GPT-5/6″ (i.e. with mere prompting). I took that framing as it was given in your example and extended it to more scale (“GPT-8”) to sketch my own point, that I expect capabilities that can be elicited to emerge much later than the scale where they can be merely elicited (with finetuning on a tiny amount of data). It wasn’t my intent to meaningfully gesture at particular scales with respect to particular capabilities.