I think the counterpoint basically makes the paper instantly become ~0 evidence for the claim that large latent reasoners will exist by the next year, and in general more generic task improvements matter more than specialized task improvements due to the messiness and complexity of reality, and one of my updates over the past 2 years is that RL inference/pre-training scaling dwarfs scaffolding improvements by such large margins that scaffolding quickly becomes worthless, so I no longer consider scaffolded LLMs as a relevant concern/threat.
I’d update back to your prior belief on how likely LLMs will become latent reasoners/have something like neuralese.
I’m retracting the claim that scaffolding doesn’t matter permanently (though admittedly I was biased by stuff like the AutoGPT stuff being no longer talked about, presumably because newer LLMs have completely obsoleted their scaffolding).
Edit: Apparently current RL is just mostly the good version of scaffolding that people thought in 2023, if you believe the paper here.
I no longer consider scaffolded LLMs as a relevant concern/threat.
I am extremely surprised to see you say that, to the point that I think I must be misinterpreting you. What tools an LLM has the ability to use seems to have huge effects on its ability to do things.
Concretely, Claude 3.5 Sonnet can do far more useful coding tasks with a single tool to execute bash commands on a VM than Claude 4.5 Sonnet can in the absence of that tool. Or is “while loop plus tools” not the type of scaffolding you’re referring to?
I think the counterpoint basically makes the paper instantly become ~0 evidence for the claim that large latent reasoners will exist by the next year, and in general more generic task improvements matter more than specialized task improvements due to the messiness and complexity of reality,
and one of my updates over the past 2 years is that RL inference/pre-training scaling dwarfs scaffolding improvements by such large margins that scaffolding quickly becomes worthless, so I no longer consider scaffolded LLMs as a relevant concern/threat.I’d update back to your prior belief on how likely LLMs will become latent reasoners/have something like neuralese.
I’d also be substantially worried about data leakage here.
I’m retracting the claim that scaffolding doesn’t matter permanently (though admittedly I was biased by stuff like the AutoGPT stuff being no longer talked about,
presumably because newer LLMs have completely obsoleted their scaffolding).Edit: Apparently current RL is just mostly the good version of scaffolding that people thought in 2023, if you believe the paper here.
I am extremely surprised to see you say that, to the point that I think I must be misinterpreting you. What tools an LLM has the ability to use seems to have huge effects on its ability to do things.
Concretely, Claude 3.5 Sonnet can do far more useful coding tasks with a single tool to execute bash commands on a VM than Claude 4.5 Sonnet can in the absence of that tool. Or is “while loop plus tools” not the type of scaffolding you’re referring to?