tdko

Karma: 152

tdko 6 Mar 2026 17:20 UTC
2 points
0
in reply to: gbtw’s comment on: gbtw’s Shortform
Note it was forecast for May 2030 exactly one year ago. It’s been fluctuating from 2030-2034 ever since GPT-4 was dropped almost exactly 3 years ago, with a few extended periods closer to both the high and low ends. I think it’s mostly noise.

tdko 3 Mar 2026 22:27 UTC
2 points
0
on: tdko’s Shortform
METR discovered an issue with their task horizon modeling. Fixing the issue reduces some of the recent model scores by 10-20% on the 50% success horizon but also increases 80% horizon a bit. For example, Opus 4.6 dropped from 14.5 hours to 12 hours.
https://x.com/METR_Evals/status/2028948235486937098

tdko 23 Feb 2026 21:36 UTC
3 points
0
in reply to: deep’s comment on: METR Time Horizons: Now 10x/Year
It’s not just your friend, some staff on METR themselves thinks the same. Not even just Opus 4.6, but “Similar deal for other recent models” too.
https://x.com/MKinniment/status/2025060520689369187

tdko 21 Feb 2026 5:10 UTC
14 points
5
on: tdko’s Shortform
Likely just the result of noise but extraordinarily funny METR results today.
Opus 4.6 got nearly triple the time horizon of Opus 4.5, immediately followed by GPT-5.3 getting a worse time horizon than GPT-5.2.

tdko 3 Feb 2026 20:48 UTC
3 points
0
in reply to: ChristianKl’s comment on: ChristianKl’s Shortform
It’s profoundly misleading framing which absolutely qualifies it as a form of misinformation. It’s superficially neutral, yes, but a layman reading the article would come out with an entirely different take on RFK’s vaccine policy than someone who read the Wikipedia article. The Grokipedia article steelmans all of RFK’s positions to an unreasonable degree and does almost everything it can to downplay that 95% of his vaccine policy is pseudoscientific bunk without losing its “neutral” cred.
>reconstituting the CDC’s Advisory Committee on Immunization Practices (ACIP) on June 9, 2025, by replacing all members to eliminate pharmaceutical financial ties and prioritize safety data
He purged the the board and stacked it with vaccine-skeptical ideological allies.
>The new committee voted on July 23, 2025, to remove thimerosal—a mercury-based preservative—from U.S. influenza vaccines, invoking precautionary principles despite earlier findings of safety in low doses
This is the least problematic sentence since it contains the (somewhat limp) note about thimerosal’s safety, but again presents the vote as neutral scientific judgment when it was the foreseeable result of (the again unmentioned) ideological stacking.
>Kennedy called this a move to safer options, while manufacturers cautioned about potential supply disruptions
The “safety” concern here being the nonexistent risk of autism, and it frames the opposing view as a corporations less worried about safety and more concerned about more traditional business-oriented interests .

If anything, it sounds like the inconvenient truth here that Grok would rather avoid is that vaccines are actually very safe and the political correctness at play is sanewashing an insane HHS chief.

tdko 3 Feb 2026 17:52 UTC
3 points
0
in reply to: ChristianKl’s comment on: ChristianKl’s Shortform
Do you think the Grokipedia’s coverage of RFK Jr’s vaccine policy is accurate and responsible?

tdko 30 Jan 2026 18:05 UTC
3 points
0
on: tdko’s Shortform
METR has updated their task horizon methodology with more tasks which leads to different task horizon scores.
https://x.com/METR_Evals/status/2016976660474843635
https://metr.org/blog/2026-1-29-time-horizon-1-1/

tdko 9 Oct 2025 18:30 UTC
13 points
1
on: tdko’s Shortform
Claude Sonnet 4.5′s 50% task horizon is 1 hr 53 min, putting it slightly behind GPT-5′s 2 hr 15 min score.
https://x.com/METR_Evals/status/1976331315772580274

tdko 29 Sep 2025 18:00 UTC
3 points
2
in reply to: t14n’s comment on: t14n’s Shortform
Note that they also claimed Opus 4 worked for over seven hours but only scored 1h20m on the METR task suite. I wouldn’t be surprised to see Sonnet 4.5 get a strong METR score but 30 hours definitely isn’t likely.

tdko 14 Aug 2025 17:41 UTC
25 points
14
on: METR Research Update: Algorithmic vs. Holistic Evaluation
Surprised that this hasn’t gotten more discussion. There’s some potentially big implications for the time horizons study, which has become fairly load-bearing in timelines discourse.

tdko 7 Aug 2025 17:43 UTC
36 points
0
on: tdko’s Shortform
METR’s task-horizon score on GPT-5 is 2h17m @ 50% success. For comparison, o3 was 1h32m and Grok 4 (prior SOTA) was 1hr50m. The 80% success score is 25m, prior SOTA was 20m from both o3 and Claude 4 Opus.
https://metr.github.io/autonomy-evals-guide/gpt-5-report/
What links here?
- Jacob_Hilton's comment on Jacob_Hilton’s Shortform by Jacob_Hilton (8 Aug 2025 9:07 UTC; 47 points)

tdko 31 Jul 2025 16:38 UTC
1 point
0
in reply to: Nikola Jurkovic’s comment on: nikola’s Shortform
Did we ever get any clarification as to whether Grok 4 did in fact use as much compute on posttraining as pretraining?

tdko 28 Jul 2025 21:20 UTC
30 points
5
on: tdko’s Shortform
METR has finally tested Gemini 2.5 Pro (June Preview) and found its 50% success task horizon is only 39 minutes, far worse than o3 or Opus 4 which are at 90 and 80 minutes respectively. Probably shouldn’t be a gigantic update given 2.5 Pro never scored amazingly at SWE-Bench, but still worse than I expected given how good the model is otherwise.

tdko 21 Jul 2025 15:59 UTC
5 points
0
in reply to: james oofou’s comment on: james oofou’s Shortform
I feel like looking at unreleased models for doubling time mucks things up a bit. For instance I’m assuming the unreleased o3 model from December had a significantly longer time-horizon in math than the released o3, given its much higher benchmarks in FrontierMath, etc.

tdko 20 Jul 2025 1:15 UTC
9 points
0
in reply to: Afterimage’s comment on: OpenAI Claims IMO Gold Medal
Worth noting this year’s p3 was really easy, Gemini 2.5 pro even got it some of the time, and Grok 4 Heavy and Gemini Deep Think got problems rated as harder. Still an achievement, though.
From the author of the epoch article:
https://x.com/GregHBurnham/status/1946655635400950211
https://x.com/GregHBurnham/status/1946725960557949227
https://x.com/GregHBurnham/status/1946567312850530522

tdko 1 Jul 2025 19:38 UTC
13 points
0
on: tdko’s Shortform
METR’s task length horizon analysis for Claude 4 Opus is out. The 50% task success chance is at 80 minutes, slightly worse than o3′s 90 minutes. The 80% task success chance is tied with o3 at 20 minutes.
https://x.com/METR_Evals/status/1940088546385436738