RSS

Simon Lermen

Karma: 890

Twitter: @SimonLermenAI

Si­mon Ler­men’s Shortform

Simon Lermen6 Oct 2025 15:04 UTC
5 points
1 comment1 min readLW link

Why I don’t be­lieve Su­per­al­ign­ment will work

Simon Lermen22 Sep 2025 17:10 UTC
44 points
6 comments5 min readLW link

Hu­man study on AI spear phish­ing campaigns

3 Jan 2025 15:11 UTC
81 points
8 comments5 min readLW link

Cur­rent safety train­ing tech­niques do not fully trans­fer to the agent setting

3 Nov 2024 19:24 UTC
160 points
9 comments5 min readLW link

De­cep­tive agents can col­lude to hide dan­ger­ous fea­tures in SAEs

15 Jul 2024 17:07 UTC
33 points
2 comments7 min readLW link

Ap­ply­ing re­fusal-vec­tor ab­la­tion to a Llama 3 70B agent

Simon Lermen11 May 2024 0:08 UTC
51 points
14 comments7 min readLW link

Creat­ing un­re­stricted AI Agents with Com­mand R+

Simon Lermen16 Apr 2024 14:52 UTC
77 points
13 comments5 min readLW link

unRLHF—Effi­ciently un­do­ing LLM safeguards

12 Oct 2023 19:58 UTC
117 points
15 comments20 min readLW link

LoRA Fine-tun­ing Effi­ciently Un­does Safety Train­ing from Llama 2-Chat 70B

12 Oct 2023 19:58 UTC
151 points
29 comments14 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

15 Jul 2023 19:12 UTC
47 points
5 comments9 min readLW link

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

16 May 2023 10:53 UTC
26 points
0 comments13 min readLW link