RSS

Simon Lermen

Karma: 442

Eval­u­at­ing Lan­guage Model Be­havi­ours for Shut­down Avoidance in Tex­tual Scenarios

16 May 2023 10:53 UTC
22 points
0 comments13 min readLW link

Ro­bust­ness of Model-Graded Eval­u­a­tions and Au­to­mated Interpretability

15 Jul 2023 19:12 UTC
44 points
5 comments9 min readLW link

LoRA Fine-tun­ing Effi­ciently Un­does Safety Train­ing from Llama 2-Chat 70B

12 Oct 2023 19:58 UTC
148 points
29 comments14 min readLW link

Creat­ing un­re­stricted AI Agents with Com­mand R+

Simon Lermen16 Apr 2024 14:52 UTC
70 points
12 comments5 min readLW link

Ap­ply­ing re­fusal-vec­tor ab­la­tion to a Llama 3 70B agent

Simon Lermen11 May 2024 0:08 UTC
41 points
7 comments7 min readLW link