lennie

Karma: 197

Toy Models of Initialisation Effects on RL Dynamics

Edward James Young and lennie

14 Jul 2026 7:04 UTC

78 points

2 comments13 min readLW link

Research update: RL on Debate Games shows Proposal Accuracy uplift alongside Judge Hacking

lennie, joanv, Shi and Jacob Pfau

2 Jul 2026 17:42 UTC

77 points

4 comments21 min readLW link

Whack-a-mole: generalisation resistance could be facilitated by training-distribution imprintation

lennie13 Dec 2025 17:46 UTC

27 points

0 comments14 min readLW link

lennie 13 Dec 2025 16:44 UTC
2 points
0
on: The behavioral selection model for predicting AI motivations
Hi Alex—great post—thanks so much!
I’m intrigued your thoughts on the list of different ‘priors’. I actually tried to explain some of these ideas in a lecture earlier this year, largely drawing from Evan’s presentation in ‘How likely is deceptive alignment?‘; the notion of ‘prior’ here is clearly important but I found the topic awkward to talk about since I had near zero intuition for which arguments were more/less relevant to the current LLM (or near-future AI) paradigm.
Your section mostly refers to Joe Carlsmith’s ‘Will AIs fake alignment...’ paper from 2023, which has really nice explanations of Joe’s PoV from then, and outlines some directions for empirical research.
Are you (or anyone else) aware of any more recent work on the matter?
(I’d be interested to know both about empirics, and conceptual/heuristic takes/syntheses).
Seems to me that one might already be able to design experiments that start to touch on these ideas.
Would be very interested to discuss possible experimental ideas if this inspires any!

Which differences between sandbagging evaluations and sandbagging safety research are important for control?

lennie6 Oct 2025 18:20 UTC

6 points

0 comments11 min readLW link

Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions.

lennie6 Oct 2025 14:00 UTC

8 points

0 comments8 min readLW link

[Question] Feedback request: Is the time right for an AI Safety stack exchange?

lennie26 Sep 2025 9:14 UTC

22 points

0 comments4 min readLW link

lennie 24 Sep 2025 13:30 UTC
1 point
0
in reply to: Ruby’s comment on: LessWrong FAQ
Cool! Thanks Ruby!
Yes, would be keen to play around with that!
What sort of tech stack are you using for this?

Your idea of Claude-integration was super interesting—and got me thinking about how best to let arbitrary LLMs interface with LW/AF content.
So I asked Claude about it—see this chat.
Claude suggested that building a custom MCP server might be ‘straightforward’ - and would allow anyone using the Claude API to immediately use the MCP server to access a structured form of LW/AF output.
It would be ideal to have this as a default/optional feature of the Claude web-interface, but that would require Anthropic’s buy-in.
How excited are you about these directions?

Re making exporting easy: I think a ‘paste to markdown’ button would still/also be helpful—and that I’d use this a lot if available, even without LLM integrations. Do you think others would also be interested? / Is anyone else also interested?

lennie 15 Sep 2025 14:50 UTC
1 point
0
on: LessWrong FAQ
I’m wondering about best practices for pasting content from LW into LLM context.
I would like to have a QA with Claude about a set of posts, and seek a good text representation to paste into context. Web search APIs still seem a bit hit and miss, and think a custom solution could have value. Something like a ‘paste to markdown’ button would be ideal.
I think this code from 2022 might be a good first pass, which I found linked from this comment thread.

Does this sound like a good idea?
If so, might Lightcone Infrastructure be interested in creating such a feature?