CAC

Karma: 21

CAC 18 Jul 2025 19:45 UTC
2 points
0
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
Sorry, this is the most annoying kind of nitpicking on my part, but since I guess it’s probably relevant here (and for your other comment responding to Stanislav down below), the center point of the year is July 2, 2025. So we’re just two weeks past the absolute mid-point – that’s 54.4% of the way through the year.
Also, the codex-1 benchmarks released on May 16, while Claude 4′s were announced on May 22 (certainly before the midpoint).

CAC 18 Jul 2025 18:40 UTC
10 points
−3
in reply to: Cole Wyeth’s comment on: Cole Wyeth’s Shortform
I don’t think there’s enough evidence to draw hard conclusions about this section’s accuracy in either direction, but I would err on the side of thinking ai-2027′s description is correct.
Footnote 10, visible in your screenshot, reads:
For example, we think coding agents will move towards functioning like Devin. We forecast that mid-2025 agents will score 85% on SWEBench-Verified.
SOTA models score at:
• 83.86% (codex-1, pass@8)
• 80.2% (Sonnet 4, pass@several, unclear how many)
• 79.4% (Opus 4, pass@several)
(Is it fair to allow pass@k? This Manifold Market doesn’t allow it for its own resolution, but here I think it’s okay, given that the footnote above makes claims about ‘coding agents’, which presumably allow iteration at test time.)
Also, note the following paragraph immediately after your screenshot:
The agents are impressive in theory (and in cherry-picked examples), but in practice unreliable. AI twitter is full of stories about tasks bungled in some particularly hilarious way. The better agents are also expensive; you get what you pay for, and the best performance costs hundreds of dollars a month.¹¹ Still, many companies find ways to fit AI agents into their workflows.¹²
AI twitter sure is full of both impressive cherry-picked examples, but also stories about bungled tasks. I also agree that the claims about “find[ing] ways to fit AI agents into the workflows” is exceedingly weak. But it’s certainly happening. A quick Google for “AI agent integration” turns up this article from IBM, where agents are diffusing across multiple levels of the company.

CAC 26 Jun 2025 10:55 UTC
3 points
0
in reply to: johnswentworth’s comment on: johnswentworth’s Shortform
Do group conversations count?
I would agree that the median one-on-one conversation for me is equivalent to something like a mediocre blogpost (though I think my right-tail is longer than yours, I’d say my favorite one-on-one conversations were about as fun as watching some of my favorite movies).
But, in groups, my median shifts toward 80th percentile YouTube video (or maybe the average curated post here on LessWrong).
It does feel like a wholly different activity, and might not be the answer you’re looking for. Group conversations, for example, are in a way inherently less draining: you’re not forced to either speak or actively listen for 100% of the time.

CAC 13 Apr 2025 5:09 UTC
4 points
0
in reply to: Adam Zerner’s comment on: adamzerner’s Shortform
My assumption is that many of these successes would tend to be widely distributed around some mean, rather than being narrowly concentrated at one point.
So if a joke needs to be ⁷⁄₁₀ funny to get a laugh, but a comedian delivers what is actually a 6.5/10 joke, you’ll still get some subset of people who find it funnier than it is, such that it gets an appropriate amount of laughs.
Probably there’s some inefficiency, but because of this effect, the number of laughs/number of upvotes I think gives quite good information about the perceived quality of the joke/post.

CAC 31 Mar 2025 21:19 UTC
7 points
0
in reply to: Mo Putera’s comment on: Mo Putera’s Shortform
I asked GPT 4.5 to write a system prompt and user message for models to write Pilish poems, feeding it your comment as context.
Then I gave these prompts to o1 (via OpenAI’s playground).
GPT 4.5′s system prompt
You are an expert composer skilled in writing poetry under strict, unusual linguistic constraints, specifically “Pilish.” Pilish is a literary constraint in which the length of consecutive words precisely matches each digit of π (pi). The first word contains 3 letters, second word 1 letter, third word 4 letters, fourth word 1 letter, fifth word 5 letters, sixth word 9 letters, and so forth, accurately reflecting the sequence of pi’s digits.
For example, the classic Pilish sentence is:
”How I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics!”
As another reference, Mike Keith famously wrote a poem titled “Near a Raven,” a Pilish version of Edgar Allan Poe’s “The Raven,” precisely encoding hundreds of pi digits into elegant verse.
Important guidelines:
You must strictly adhere to the Pilish constraint without exception or deviation.
Carefully recount each word to match each digit exactly.
Maintain coherence, clarity, and literary quality in your writing.
Make sure the poem or text is meaningful, evocative, creative, and intriguing, not just mechanically correct.
Verify each word length explicitly to ensure compliance with Pilish.
Your task is to craft a compelling, original poem in Pilish form. Do not explain the poem afterward; the precision and artistry of the piece itself is sufficient.
--
GPT 4.5′s user message
Write an original Pilish poem about pi itself: reflecting upon the wonder, mystery, and endless fascination humans have felt over millennia towards this famous mathematical constant. Make it vivid, thoughtful, and meticulously correct in matching the Pilish constraint. Begin now.
--
o1′s completion
Sky I seek a vivid resonance in orbits round the mists whirling boundless mystery mesmerize all of our reveries flow within to unveil deep pi’s hum.
I asked o1 to count the letters in each word, comparing with pi:
1. Sky – 3 letters
2. I – 1 letter
3. seek – 4 letters
4. a – 1 letter
5. vivid – 5 letters
6. resonance – 9 letters
7. in – 2 letters
8. orbits – 6 letters
9. round – 5 letters
10. the – 3 letters
11. mists – 5 letters
12. whirling – 8 letters
13. boundless – 9 letters
14. mystery – 7 letters
15. mesmerize – 9 letters
16. all – 3 letters
17. of – 2 letters
18. our – 3 letters
19. reveries – 8 letters
20. flow – 4 letters
21. within – 6 letters
22. to – 2 letters
23. unveil – 6 letters
24. deep – 4 letters
25. pi’s – 3 letters (ignoring the apostrophe)
26. hum – 3 letters (ignoring the period)
This gives us the sequence:
3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9, 3, 2, 3, 8, 4, 6, 2, 6, 4, 3, 3
And this seems correct!