Ben Millwood

Karma: 127

Ben Millwood 1 Dec 2025 13:30 UTC
1 point
0
in reply to: gwern’s comment on: BIG-Bench Canary Contamination in GPT-4
One notable difference is that sites that allow user-submitted content (e.g. this one) could be exempted by the user, whereas robots.txt et al require the server admin to intervene. (But I agree that this would be a feature and not a bug.)

White Box Control at UK AISI—Update on Sandbagging Investigations

Joseph Bloom, Jordan Taylor, Connor Kissane, Sid Black, merizian, alexdzm, jacoba, Ben Millwood and Alan Cooney

10 Jul 2025 13:37 UTC

80 points

10 comments18 min readLW link

Ben Millwood 6 Jun 2025 11:56 UTC
10 points
1
on: Do models know when they are being evaluated?
Looks like this turned into Large Language Models Often Know When They Are Being Evaluated

Ben Millwood 3 Jun 2025 12:35 UTC
1 point
2
in reply to: eggsyntax’s comment on: Do models know when they are being evaluated?
LMSYS is non-agentic though, right? Would be cool to have a dataset of production agent use transcripts.

Ben Millwood 2 Jun 2025 9:43 UTC
4 points
0
on: It’s hard to make scheming evals look realistic for LLMs
Some recent work on white-box probes for evaluation awareness: Linear Control of Test Awareness Reveals Differential Compliance in Reasoning Models (tweet thread).

Ben Millwood 29 May 2025 12:33 UTC
4 points
0
in reply to: Aransentin’s comment on: It’s hard to make scheming evals look realistic for LLMs
One speculation I have: an alternative way to study the plausibility of your environment if you have logit access is to tell the model it’s in a simulation (or fake a CoT of the model realising it’s in a simulation) and see if the perplexity of this suffix measures how “surprised” the model is. Could be too noisy to be practical (+ you have to think if there are any other reasons why it would be surprising to be told this at this point, in this particular phrasing), but it hopefully works around the issues with the question itself being evidence.

(I think studying model beliefs through perplexity has been tried before, but I’m fairly new to the literature and don’t know if it works / if applying it to this problem particularly is novel.)

Ben Millwood 28 May 2025 10:38 UTC
7 points
4
in reply to: Kaj_Sotala’s comment on: It’s hard to make scheming evals look realistic for LLMs
This could also be explained by many real-world prompts being easy to judge as real e.g. because they would have no value as evaluations.

Given just the results, it seems to me difficult to rule out the hypothesis that LLMs currently judge any high-stakes situation, or one involving a significant moral dilemma, as likely to be an evaluation. After all, in today’s environments, this would be a pretty accurate heuristic!

(That said, it already seemed quite likely to me that models can tell when they are in evaluations most of the time, but I’m not sure how much this evidence moves me on that hypothesis vs. the one I suggest above.)

Ben Millwood 27 May 2025 15:55 UTC
1 point
0
on: Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations
I’m not entirely clear what the implications would be one way or the other, but do we know if models mistakenly take production deployments to be evaluation scenarios? We might be assuaged that our evaluations are still predictive if the model just always thinks it’s probably still in training (which, after all, it has been for its entire “life” so far). Perhaps it would even be sufficient for the model to suspect it is being tested every time it finds itself in a high-stakes scenario.

Ben Millwood 22 Jan 2025 10:38 UTC
5 points
3
on: Ben Millwood’s Shortform
Given ambiguity about whether GitHub trains models on private repos, I wonder if there’s demand for someone to host a public GitLab (or similar) instance that forbids training models on their repos, and takes appropriate countermeasures against training data web scrapers accessing their public content.

Ben Millwood 11 Dec 2024 20:37 UTC
3 points
0
in reply to: Yonatan Cale’s comment on: Should we exclude alignment research from LLM training datasets?

how about a robots.txt?

Yeah, that’s a strong option, which is why I went around checking + linking all the robots.txt files for the websites I listed above :)

In my other post I discuss the tradeoffs of the different approaches one in particular is that it would be somewhat clumsy to implement post-by-post filters via robots.txt, whereas user-agent filtering can do it just fine.

Ben Millwood 10 Dec 2024 23:50 UTC
1 point
0
in reply to: Yonatan Cale’s comment on: Should we exclude alignment research from LLM training datasets?
I think there’s two levels of potential protection here. One is a security-like “LLMs must not see this” condition, for which yes, you need to do something that would keep out a human too (though in practice maybe “post only visible to logged-in users” is good enough).

However I also think there’s a lower level of protection that’s more like “if you give me the choice, on balance I’d prefer for LLMs not to be trained on this”, where some failures are OK and imperfect filtering is better than no filtering. The advantage of targeting this level is simply that it’s much easier and less obtrusive, so you can do it at a greater scale with a lower cost. I think this is still worth something.

Ben Millwood 10 Dec 2024 23:40 UTC
1 point
0
in reply to: Marius Hobbhahn’s comment on: Frontier Models are Capable of In-context Scheming
in case anyone else wanted to look this up, it’s at https://axrp.net/episode/2024/12/01/episode-39-evan-hubinger-model-organisms-misalignment.html#training-away-sycophancy-subterfuge

FWIW I also tried to start a discussion on some aspects of this at https://www.lesswrong.com/posts/yWdmf2bRXJtqkfSro/should-we-exclude-alignment-research-from-llm-training but it didn’t get a lot of eyeballs at the time.

Ben Millwood 22 Oct 2024 12:27 UTC
1 point
0
in reply to: Thomas Kwa’s comment on: If far-UV is so great, why isn’t it everywhere?
That slides presentation presents me with a “you need access” screen. Is it OK to be public?

Ben Millwood 6 Oct 2024 12:25 UTC
1 point
0
in reply to: DanielFilan’s comment on: DanielFilan’s Shortform Feed
You can’t always use liability to internalise all the externality because e.g. you can’t effectively sue companies for more than they have, and for some companies that stay afloat by regular fundraising rounds, that may not even be that much? like, if they’re considering an action that is a coinflip between “we cause some huge liability” and “we triple the value of our company” then it’s usually going to be correct from a shareholder perspective to take it, no matter the size of the liability, right?

Criminal law has the ability to increase the deterrent somewhat – probably many people will not accept any amount of money for a significant enough chance of prison – though obviously it’s not perfect either

Ben Millwood 9 Aug 2024 10:01 UTC
4 points
2
on: Leaving MIRI, Seeking Funding
I heard on the grapevine (this PirateSoftware YouTube Short) that Ko-fi offers a similar service to Patreon but cheaper, curious if you prefer Patreon or just weren’t aware of Ko-fi

edit: I think the prices in the short are not accurate (maybe outdated?) but I’d guess it still works out cheaper

Ben Millwood 7 Aug 2024 12:46 UTC
7 points
4
in reply to: basil.halperin’s comment on: How I Learned To Stop Trusting Prediction Markets and Love the Arbitrage
Even in real-money prediction markets, “how much real money?” is a crucial question for deciding whether to trust the market or not. If you had a tonne of questions and no easy way to find the “forgotten” markets, and each market has (say) 10s-100s of dollars of orders on the book, then the people skilled enough to do the work likely have better ways to turn their time (and capital) into money. For example, I think some of the more niche Betfair markets are probably not worth taking particularly seriously.

Ben Millwood 27 Jul 2024 9:36 UTC
5 points
3
in reply to: Adam Zerner’s comment on: Universal Basic Income and Poverty

I don’t think it’s reasonable to expect very much from the author, and so I lean away from viewing the lack of citations as something that (meaningfully) weakens the post.

I feel like our expectations of the author and the circumstances of the authorship can inform our opinions of how “blameworthy” the author is for not improving the post in some way, but shouldn’t really have any relevance to what changes would be improvements if they occurred. The latter seems to me to purely be a claim about the text of the post, not a claim about the process that wrote it.

Ben Millwood 22 Jul 2024 21:48 UTC
11 points
4
on: Ben Millwood’s Shortform
xAI has ambitions to compete with OpenAI and DeepMind, but I don’t feel like it has the same presence in the AI safety discourse. I don’t know anything about its attitude to safety, or how serious a competitor it is. Are there good reasons it doesn’t get talked about? Should we be paying it more attention?

Ben Millwood 18 Jul 2024 13:21 UTC
4 points
1
in reply to: Ben Pace’s comment on: Against Aschenbrenner: How ‘Situational Awareness’ constructs a narrative that undermines safety and threatens humanity
For my taste, the apostrophe in “you’re” is not confusing because quotations can usually only end on word boundaries.

I think (though not confidently) that any attempt to introduce specific semantics to double vs. single quotes is doomed, though. Such conventions probably won’t reach enough adoption that you’ll be able to depend on people adhering to or understanding them.

(My convention is that double quotes and single quotes mean the same thing, and you should generally make separately clear if you’re not literally quoting someone. I mostly only use single quotes for nesting inside double quotes, although the thing I said above about quote marks only occurring on word boundaries make this a redundant clarification.)

[Question] Should we exclude alignment research from LLM training datasets?

Ben Millwood18 Jul 2024 10:27 UTC

3 points

5 comments1 min readLW link

Ben Millwood

White Box Con­trol at UK AISI—Up­date on Sand­bag­ging Investigations

[Question] Should we ex­clude al­ign­ment re­search from LLM train­ing datasets?

White Box Control at UK AISI—Update on Sandbagging Investigations

[Question] Should we exclude alignment research from LLM training datasets?