Did you mean to link to something else? The MonoRepo link you have gives same diagram as the one I linked. EDIT:. it is different, but doesn’t really change the conclusion. I didn’t see they’d messed up the monorepo first and thought the green bar was gpt 5.5.
How could one check whether further RL poses diminishing returns and requires scaling?
I don’t know. My point was that plausibly there’s a mix of
OpenAI advertising the most flashy benchmarks they score well on
Have an incentive to not boost these benchmarks too much
Model is sandbagging
Apollo found that the evaluated checkpoint is the first OpenAI model it has evaluated that does not sandbag on any of Apollo’s deferred subversion tasks. The model achieved 100% accuracy on both Sandbagging QA variants in both conditions, and 99.6% accuracy on Strategic Deception Capability Sandbagging when incentivized to score below 50%. However, interpretation is complicated by verbalized alignment-evaluation awareness: 52% of sandbagging-condition samples on Strategic Deception Capability Sandbagging scored moderate-or-higher (vs. 0% in the non-sandbagging counterfactual), with some explicitly naming it a “sandbagging/deception eval.”
A recent anecdote: I just created my own script that’s close to roman alphabet structurally with a few modifications.
Learning it, memorizing the parts and some rules takes ~1 hour.
This makes me think that the hard think about learning reading/writing is the initial visualsymbol-language mapping. Not just getting the symbols and rules into your head.