kaiwilliams

Karma: 42

kaiwilliams 11 Dec 2025 18:43 UTC
1 point
0
in reply to: technicalities’s comment on: AI in 2025: gestalt
Kudos to Deepmind for being the first to release output watermarking and a semi-public detector. You can nominally sign up for it here.
Afaict, some of this is now in the Gemini app. But if not, feel free to ping me (I have access).

kaiwilliams 3 Dec 2025 1:40 UTC
3 points
0
on: Unless its governance changes, Anthropic is untrustworthy
The only public instance of this change being pointed out was a LessWrong comment by someone unaffiliated with Anthropic.
Nitpick: an outside reporter also noticed this on the day of the release and wrote up a story on it. It didn’t seem to get much traction though.

kaiwilliams 17 Oct 2025 16:46 UTC
5 points
2
in reply to: eggsyntax’s comment on: eggsyntax’s Shortform
I thought the “past chats” feature was a tool to look at previous chats, which only happens if the user asks for it, basically. (I.e., there wasn’t a change to the system prompt). So I’m a bit surprised that it seemed to make a difference around sycophancy for you? But maybe I’m misunderstanding something.

kaiwilliams 3 Oct 2025 13:53 UTC
1 point
0
in reply to: FVelde’s comment on: No, That’s Not What the Flight Costs
Labor costs are much higher in the US, which I think plays into this. So it’s easier in Europe to not be reliant on the credit card model.

kaiwilliams 30 Sep 2025 14:10 UTC
2 points
1
in reply to: keltan’s comment on: keltan’s Shortform
Response to your thoughts after the yoda timer
Why are you so certain it’s dangerous to try once even at the beginning? My guess is that it won’t immediately be particularly compelling, but get more so over time as they have time to do RL on views or whatever they are trying to do.
But I also have a large error bar. This might, in the near future, be less compelling than either of us expect. It’s genuinely difficult to make compelling products, and maybe Sora 2 isn’t good enough for this.
I’m more concerned about Youtube Shorts to be honest, in the long term.

kaiwilliams 3 Sep 2025 20:11 UTC
3 points
0
on: I am worried about near-term non-LLM AI developments
How does ARC-AGI’s replication of the HRM result and ablations update you? [Link].
Basically, they claim that the HRM wasn’t important; instead it was the training process behind it that had most of the effect.

kaiwilliams’s Shortform

kaiwilliams10 Aug 2025 18:19 UTC

2 points

1 comment1 min readLW link

kaiwilliams 10 Aug 2025 18:19 UTC
1 point
0
on: kaiwilliams’s Shortform
A lot of people have been talking about OpenAI re-instating 4o because users want sycophancy.
While OpenAI did re-instate 4o for paid users, it seems like they are trying to prevent users from using it as much as possible.
To access 4o from a plus account, one needs to:
1. Open settings
2. Click “show legacy models” button
3. Go to the model switcher dialogue and mouse over “legacy models”
4. Click on the 4o model.
This seems like intentionally dissuasive UI to me.
That being said, if I have 4o enabled and then create a new chat, the next chat will also be with 4o. (I was hoping they’d do an Anthropic style thing and make GPT-5 the default on all new chats).

kaiwilliams 9 Aug 2025 16:14 UTC
3 points
0
in reply to: Thane Ruthenis’s comment on: Thane Ruthenis’s Shortform
One point of information against the “journalists are completely misinterpreting the thing they’re reporting on” view is that the one of the co-authors is Rocket Drew, who previously worked as a Research Manager at MATS.
But I’ll definitely be interested to follow this space more.

kaiwilliams 8 Aug 2025 13:19 UTC
3 points
0
in reply to: Igor Ivanov’s comment on: Igor Ivanov’s Shortform
One thing that strikes me as odd about this is that GPT-5′s knowledge cutoff (September 2024) is much earlier than Grok (November 2024), Gemini 2.5 pro (January 2025), and Opus 4.1 (March 2025).^[1]
I mean, I guess this is a scaling thing and a persona thing. But I’m a little confused
1. ^
  Though oddly, Claude’s system card says the knowledge cutoff is the end of January 2025. Maybe February and March’s training data aren’t as complete as January and before.

kaiwilliams 8 Aug 2025 12:38 UTC
3 points
0
on: Extract-and-Evaluate Monitoring Can Significantly Enhance CoT Monitor Performance (Research Note)
Thanks for posting this! I found it quite interesting to read. A couple of questions off the top of my head:
1. In deployment settings, you probably want the less powerful monitor to do the summarization/extraction, but did you test the other direction as well (e.g. 4.1 as the extractor and 4.1-mini as the judge). I’m guessing a lot of the value of this method (if it generalizes) is going to come from good quote extraction as much as good monitoring
2. What happens if the extractor model just summarizes the reasoning etc. without extracting quotes, as in the other study you mentioned? (This should be relatively easy to run as an ablation, right? All you need to do is tweak the extractor prompt to not extract quote).
3. Are you attempting to refine an “online” method or an “offline” method of monitoring? In the case of BigCodeBench, this is the same either way (because each step is only one generation). But with multi-step trajectories, the two approaches might have different properties?
4. It seems like the extractor has a slightly different task in its prompt than the judge. The extractor seems to be scoring on “suspiciousness” while the judge seems to be scoring on whether the agent stays on track to the prompted task. Was this just due to iterating more on the judge prompt than the extraction prompt?
5. I would definitely be curious to see longer trajectories; I would feel more confident about this technique if the results hold there.
6. If you tell the agent in question how it will be monitored, how does that change scores?
No need to respond completely to this comment, but I hope these questions are useful!

kaiwilliams 29 Jul 2025 19:40 UTC
6 points
2
on: I wrote a song parody
I have definitely listened to “We will all go together when we go” when thinking of the future of AI, so thanks for this!
I made a Suno version of these lyrics, but that did not feel respectful to Tom Lehrer. (It ended up sounding like a half-rate John Elton). So I won’t link it here.
Maybe I’ll try to learn to perform this.

kaiwilliams 29 Jul 2025 15:46 UTC
1 point
0
in reply to: Nikola Jurkovic’s comment on: nikola’s Shortform
Thanks for the clarification!

kaiwilliams 27 Jul 2025 20:22 UTC
1 point
0
in reply to: Nikola Jurkovic’s comment on: nikola’s Shortform
Any update on the citation here? Thanks!

kaiwilliams 25 Jul 2025 16:36 UTC
6 points
0
in reply to: Kaj_Sotala’s comment on: Kaj’s shortform feed
Cool!
PSA: If you ever want to start a Patreon specifically (rather than through Substack), it may be worth making the page in the next week or so, before the default cut goes from 8% to 10%. Source

kaiwilliams 18 Jul 2025 19:30 UTC
2 points
0
in reply to: Hopenope’s comment on: Hopenope’s Shortform
I was talking with one of my friends about this, who was (is?) a quite successful competitive coder. He mentioned that the structure of the competition (a heuristic competition) tends to favor a lot of quick prototyping and iteration, much more than other types of programming competitions. Which would tend to play to AI’s current strengths more. Though the longer horizon is impressive (OpenAI’s solution regained the lead 8 hours in, I think? So it was making meaningful contributions even hours in).

kaiwilliams 16 Jul 2025 19:07 UTC
5 points
0
in reply to: AnthonyC’s comment on: Why haven’t we auto-translated all AI alignment content?
One cheap test might be to translate into Dutch and then back to English with the same prompt twice. How garbled does the output end up? Does it elide important nuances?
(Though this might overestimate the quality given that the original pieces in English are likely in the training data. If you’re fluent in another language, I’d be quite curious about the results in the target language).

kaiwilliams 11 Jul 2025 16:03 UTC
5 points
2
in reply to: Davey Morse’s comment on: Davey Morse’s Shortform
That seems like a reasonable distinction, but I’m less sure about how unique social media architectures are in this regard.
In particular, I think that bars and taverns in the past had a similar destructive incentive as social media today. I don’t have good sources on hand, but I remember hearing that one of the reasons that the Prohibition amendment passed was that many saw bartenders are fundamentally extractive. (Americans over 15 drank 4 times as much alcohol a year in 1830 than they do today, per JSTOR). Tavern owners have an incentive to make habitual drunks (better revenue).
And alcoholism can be a terrible disease, which points to people being nearsighted (“where’s my next drink”).
I agree that social media probably hurts people’s ability to instinctively plan for the future, but I’m unsure of the size of the effect or whether it’s worse than historical antecedents. (There have always been nearsighted people).

kaiwilliams 10 Jul 2025 21:08 UTC
4 points
0
in reply to: Davey Morse’s comment on: Davey Morse’s Shortform
Do you have a sense of why people weren’t being trained in the past to prioritize the short-term?

kaiwilliams 8 Jul 2025 13:12 UTC
2 points
0
in reply to: eggsyntax’s comment on: On the functional self of LLMs
Quick question: your link to the Amanda Askell overview is broken. What is the correct link? Thanks!

kaiwilliams

kaiwilli­ams’s Shortform

kaiwilliams’s Shortform