sam b

Karma: 46

sam b 3 Oct 2025 1:35 UTC
1 point
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
For what it’s worth, I would expect the behavior you described, and suspect a malicious explanation overall.
Anecdotally, Claude lies/cheats an enormous amount (far more than comparable frontier models).

sam b 3 Oct 2025 0:49 UTC
1 point
0
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
From the published cases, I could not find a clear example of Sonnet 4.5 following that particular pattern, but maybe I missed it. Could you share?

sam b 5 Sep 2025 17:41 UTC
1 point
0
in reply to: Ronald Klingler’s comment on: Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
What apps have you tried for this, and how recently?

Most of my usage is multi-turn in a 200k line codebase, for what it’s worth. It’s extremely rare that GPT-5 (via Codex CLI) breaks anything.

sam b 5 Sep 2025 17:40 UTC
3 points
0
in reply to: williawa’s comment on: Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
What app were you using? This sounds very similar to my experience using GPT-5 in Cursor.

Codex CLI is much much better—night and day difference.

I suppose this is good evidence that harness-specific RL was important for GPT-5.

sam b 4 Sep 2025 17:13 UTC
20 points
0
on: Trust me bro, just one more RL scale up, this one will be the real scale up with the good environments, the actually legit one, trust me bro
Anecdotally, GPT-5 seems way above trend for real-world agentic coding.

I think METR’s constrained task-set overstated the capabilities of previous models, and the “real-world” performance of GPT-5 in e.g. Codex CLI seems to be much much higher than e.g. Claude Code.

Sonnet/Opus 4 was almost never able to test, debug, and fix end-to-end in our real codebase, and GPT-5 usually can.
I don’t work with RL, but I predict that if you created an RL environment with
1. GPT-5 (released <1 month ago) via Codex CLI
2. Claude Opus 4 (released 4 months ago) via Claude Code
you would see a dramatic difference in robustness/quality/functionality. Perhaps someone could test this.

I’m not sure what drove GPT-5 seeming so much better for agentic coding (plausibly a big RL scale-up + other improvements?), but I do expect recent and upcoming advancements to drive an explosion of RL environment quality/quantity/robustness.

sam b 18 Jul 2025 22:32 UTC
2 points
−2
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
The prediction is correct on all counts, and perhaps slightly understates progress (though it obviously makes weak/ambiguous claims across the board).

The claim that “coding and research agents are beginning to transform their professions” is straightforwardly true (e.g. 50% of Google lines of code are now generated by AI). The METR study was concentrated in March (which is early 2025).

And it is not currently “mid-late 2025”, it is 16 days after the exact midpoint of the year.

sam b 18 Jul 2025 22:24 UTC
4 points
4
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
Early: January, February, March, April
Mid: May, June, July, August

Late: September, October, November, December

sam b 9 May 2025 19:34 UTC
1 point
0
in reply to: Expertium’s comment on: Let’s stop making “Intelligence scale” graphs with humans and AI
If you extrapolate further, do you think the one-dimensional scale works well to describe the high-level trend (surpassing human abilities broadly)?

Trying to determine if the disagreement here is “AI probably won’t surpass human abilities broadly in a short time” or “even if it does, the one-dimensional scale wasn’t a good way to describe the trend”.

sam b 9 May 2025 19:32 UTC
3 points
0
in reply to: Expertium’s comment on: Let’s stop making “Intelligence scale” graphs with humans and AI
I agree that AI capabilities are spiky and developed in an unusual order. And I agree that because of this, the single-variable representation of intelligence is not very useful for understanding the range of abilities of current frontier models.

At the same time, I expect the jump from “Worse than humans at almost everything” to “Better than humans at almost everything” will be <5 years, which would make the single-variable representation work reasonably well for the purposes of the graph.

I think these “examples of silly mistakes” have not held up well at all. This was often blamed on “training around the limitations”; however, in the case of the linked post, we got a model the next day that performed much better.

And almost every benchmark and measurable set of capabilities has rapidly improved (in some cases beyond human experts).

”We too often give wrong answers to questions ourselves to be justified in being very pleased at such evidence of fallibility on the part of the machines. Further, our superiority can only be felt on such an occasion in relation to the one machine over which we have scored our petty triumph.”
Alan Turing, Computing Machinery and Intelligence
1950

sam b 9 May 2025 17:10 UTC
1 point
0
on: Let’s stop making “Intelligence scale” graphs with humans and AI
I’ve generally found it much harder over time to find “examples where LLMs fail in surprising ways”. If you test o3 (released the day after that post!) for the examples they chose, it does much better than previous models. And I’ve just tried it on your “269 words” task, which it nailed.

sam b 8 May 2025 22:35 UTC
5 points
0
in reply to: faul_sname’s comment on: faul_sname’s Shortform
Wow. When given just your first 2 sentences, it was able to guess this is a LessWrong post, and ruled out Reddit and Hacker News based on your “tone”.

sam b 1 May 2025 18:16 UTC
16 points
2
in reply to: faul_sname’s comment on: faul_sname’s Shortform
I reproduced your result with your prompt and images, and o3 guessed the location ³⁄₅ times (on the same images).

However, when testing with a different prompt, “Here is a picture”, ⁰⁄₅ of them guessed the location.

I think “This picture was taken” usually precedes information about how (when or where) it was taken. I confirmed this via a Google search for the phrase.

I was able to get similar behavior with GPT-4o-mini (less likely to have been RL’d for this task?) with the “This picture was taken” prompt.

So this behavior might be a product of pre-training! If only it was yesterday, so we could test with GPT-4.