I asked some LLM models/agents to consider this post, in preparation for considering this for some form of Unjournal.org evaluation. FWIW:
1. Conversation started with GPTPro
2. RoastmyPoast.org “epistemic audit” (result: C+, 68⁄100, which is a bit below average iirc)
3. Claude 4.5 Opus
My take: they saw this post as plausible, no major errors, and generating some useful insights, but with some important limitations, and the main claims are not all ‘obviously demonstrated’
Below. Some overall syntheses/pulled quotes that seemed relevant to me. All folded content is LLM
GPTPro
What holds up (probability ~0.6–0.8):
Public evidence supports that inference budgets buy big gains on current reasoning benchmarks, and RL post-training scaling appears meaningfully less compute-efficient (often by ~2 extra decades to cover similar 20→80 improvements).
RL post-training is now plausibly reaching “pretraining-scale” at least at xAI (and maybe elsewhere soon), so “RL is no longer a trivially cheap add-on” is real.
What’s uncertain / overconfident (probability ~0.2–0.5)
The specific conversion “100× training ≈ 1,000× inference” as a general rule, and thus the specific “1,000,000× RL for a GPT-level jump.” This rests on a non-robust mapping and then exponentiates it.
The implication that we’re “near the effective limit” of RL training gains, given recent public RL-scaling work emphasizing recipe dependence and improved efficiency/asymptotes.
…[Verdict] Ord is on solid ground that current reasoning improvements rely heavily on inference budgets … he is on weak ground when he turns that into a near-term “end of scaling” claim via a brittle 1,000,000× extrapolation.
The specific conversion “100× training ≈ 1,000× inference” as a general rule, and thus the specific “1,000,000× RL for a GPT-level jump.” This rests on a non-robust mapping and then exponentiates it.
The implication that we’re “near the effective limit” of RL training gains, given recent public RL-scaling work emphasizing recipe dependence and improved efficiency/asymptotes.
…[Verdict] Ord is on solid ground that current reasoning improvements rely heavily on inference budgets … he is on weak ground when he turns that into a near-term “end of scaling” claim via a brittle 1,000,000× extrapolation.
I asked what aspects were missing in the comments on LW and EA Forum; it noted a lack of discussion of …
lack of discussion of …
how sensitive the RL-vs-inference scaling gap is to model size, data quality, reuse, training recipe, domain/task type;
how recent empirical RLHF / RL‑post‑training research (on open, small-scale, or controlled setups) might affect that gap;
the analogy of “inefficiency gap = fundamental ceiling” vs. “inefficiency may be engineering‑level problem, solvable with better algorithms/research”;
the degree of uncertainty involved in extrapolating over many orders of magnitude;
the possibility that RL‑post-training inefficiency might be significantly reduced in the future (with better methodology).
So in short: the public conversation has touched some of the major “skeptical” themes, but not with the depth, technical framing, or caution that a more expert‑oriented review might use.how sensitive the RL-vs-inference scaling gap is to model size, data quality, reuse, training recipe, domain/task type;
how recent empirical RLHF / RL‑post‑training research (on open, small-scale, or controlled setups) might affect that gap;
the analogy of “inefficiency gap = fundamental ceiling” vs. “inefficiency may be engineering‑level problem, solvable with better algorithms/research”;
the degree of uncertainty involved in extrapolating over many orders of magnitude;
the possibility that RL‑post-training inefficiency might be significantly reduced in the future (with better methodology).
So in short: the public conversation has touched some of the major “skeptical” themes, but not with the depth, technical framing, or caution that a more expert‑oriented review might use
Claude 4.5 Opus
Bottom Line Assessment
Aspect Assessment Methodological rigor Decent for available data; not peer-reviewed ML research Alignment with expert consensus Broadly consistent with Sutskever, partially at odds with Epoch AI’s optimism Potential blind spots Algorithmic improvements, insider knowledge, IDA possibilities Originality Useful synthesis, but not breakthrough technical analysis Should you trust it? Trust it as informed policy analysis, not as definitive ML research The honest answer: Ord is probably directionally correct that RL scaling is less efficient than pre-training was, and that we’re approaching limits. But the specific numbers (10,000x, 1,000,000x) should be held loosely. Actual ML researchers at frontier labs know things that aren’t public, and algorithmic breakthroughs could change the picture.
RoastMyPoast Epistemic Audit (C+, 68⁄100)
Uses agents and claude-sonnet-4-5-20250929
Noted overconfidence
Noted “Overconfidence” about
Causal claims about what RL “unlocks” or “allows” without establishing mechanism
.… Long-range extrapolations presented as reliable estimates
And “Single points of failure”:
The assumption that observed slopes continue far beyond measured ranges
The causal interpretation of RL “unlocking” inference rather than teaching token-intensive strategies
The representativeness of OpenAI’s published data
RE Unjournal.org potentially commissioning this for an evaluation of some form, we might consider
Is this post highly influential on its own (are funders and labs using this to guide important policy choices)?
Is there further expertise we could unlock that is not reflected in these comments? (The LLMs suggested some evaluators, but we sometimes find it hard to get people to accept the assignment and follow through)
Is there a more formal research output that covers this same ground, coming from ML researchers, scaling experts, etc.?
The Unjournal (unjournal.org) is considering commissioning this post for expert evaluation in our applied stream.
Looking for any feedback (here or privately) on whether this would be high-value, how to go about it (what particular issues/expertise), whether other research in this domain is higher value, etc.