lennie

Karma: 197

Toy Models of Initialisation Effects on RL Dynamics

Edward James Young and lennie

14 Jul 2026 7:04 UTC

78 points

2 comments13 min readLW link

Research update: RL on Debate Games shows Proposal Accuracy uplift alongside Judge Hacking

lennie, joanv, Shi and Jacob Pfau

2 Jul 2026 17:42 UTC

77 points

4 comments21 min readLW link

Whack-a-mole: generalisation resistance could be facilitated by training-distribution imprintation

lennie13 Dec 2025 17:46 UTC

27 points

0 comments14 min readLW link

Which differences between sandbagging evaluations and sandbagging safety research are important for control?

lennie6 Oct 2025 18:20 UTC

7 points

0 comments11 min readLW link

Sandbagging: distinguishing detection of underperformance from incrimination, and the implications for downstream interventions.

lennie6 Oct 2025 14:00 UTC

8 points

0 comments8 min readLW link

[Question] Feedback request: Is the time right for an AI Safety stack exchange?

lennie26 Sep 2025 9:14 UTC

22 points

0 comments4 min readLW link