RSS

Simon Lermen

Karma: 1,472

Substack: https://​​substack.com/​​@simonlermen

Twitter: @SimonLermenAI

Will We Get Align­ment by De­fault? — with Adrià Gar­riga-Alonso

27 Nov 2025 19:19 UTC
41 points
3 comments1 min readLW link
(simonlermen.substack.com)

Com­ment on Nat­u­ral Emer­gent Misal­ign­ment Paper by Anthropic

Simon Lermen23 Nov 2025 4:21 UTC
20 points
0 comments4 min readLW link

Jailbreak­ing AI mod­els to Phish Elderly Victims

18 Nov 2025 23:17 UTC
17 points
0 comments2 min readLW link
(simonlermen.substack.com)

AI 2025 - Last Shipmas

Simon Lermen17 Nov 2025 19:39 UTC
55 points
5 comments7 min readLW link

Univer­sal Ba­sic In­come in an AGI Future

Simon Lermen11 Nov 2025 2:26 UTC
21 points
1 comment2 min readLW link
(simonlermen.substack.com)

An­thropic & Dario’s dream

Simon Lermen8 Nov 2025 1:19 UTC
54 points
1 comment5 min readLW link

Com­par­a­tive ad­van­tage & AI

Simon Lermen3 Nov 2025 21:50 UTC
113 points
28 comments4 min readLW link

Model welfare and open source

Simon Lermen2 Nov 2025 2:29 UTC
15 points
1 comment5 min readLW link

Si­mon Ler­men’s Shortform

Simon Lermen6 Oct 2025 15:04 UTC
5 points
43 comments1 min readLW link

Why I don’t be­lieve Su­per­al­ign­ment will work

Simon Lermen22 Sep 2025 17:10 UTC
47 points
6 comments5 min readLW link

Hu­man study on AI spear phish­ing campaigns

3 Jan 2025 15:11 UTC
81 points
8 comments5 min readLW link

Cur­rent safety train­ing tech­niques do not fully trans­fer to the agent setting

3 Nov 2024 19:24 UTC
162 points
9 comments5 min readLW link

De­cep­tive agents can col­lude to hide dan­ger­ous fea­tures in SAEs

15 Jul 2024 17:07 UTC
33 points
2 comments7 min readLW link

Ap­ply­ing re­fusal-vec­tor ab­la­tion to a Llama 3 70B agent

Simon Lermen11 May 2024 0:08 UTC
51 points
14 comments7 min readLW link

Creat­ing un­re­stricted AI Agents with Com­mand R+

Simon Lermen16 Apr 2024 14:52 UTC
77 points
13 comments5 min readLW link

unRLHF—Effi­ciently un­do­ing LLM safeguards

12 Oct 2023 19:58 UTC
117 points
15 comments20 min readLW link

LoRA Fine-tun­ing Effi­ciently Un­does Safety Train­ing from Llama 2-Chat 70B

12 Oct 2023 19:58 UTC
151 points
29 comments14 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

15 Jul 2023 19:12 UTC
47 points
5 comments9 min readLW link

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

16 May 2023 10:53 UTC
26 points
0 comments13 min readLW link