RSS

lennie

Karma: 53

Whack-a-mole: gen­er­al­i­sa­tion re­sis­tance could be fa­cil­i­tated by train­ing-dis­tri­bu­tion imprintation

lennie13 Dec 2025 17:46 UTC
21 points
0 comments14 min readLW link

Which differ­ences be­tween sand­bag­ging eval­u­a­tions and sand­bag­ging safety re­search are im­por­tant for con­trol?

lennie6 Oct 2025 18:20 UTC
6 points
0 comments11 min readLW link

Sand­bag­ging: dis­t­in­guish­ing de­tec­tion of un­der­perfor­mance from in­crim­i­na­tion, and the im­pli­ca­tions for down­stream in­ter­ven­tions.

lennie6 Oct 2025 14:00 UTC
8 points
0 comments8 min readLW link

[Question] Feed­back re­quest: Is the time right for an AI Safety stack ex­change?

lennie26 Sep 2025 9:14 UTC
22 points
0 comments4 min readLW link