Simon Lermen

Karma: 442

Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios

Simon Lermen, Teun van der Weij and Leon Lang

16 May 2023 10:53 UTC

22 points

0 comments13 min readLW link

Robustness of Model-Graded Evaluations and Automated Interpretability

Simon Lermen and viluon

15 Jul 2023 19:12 UTC

44 points

5 comments9 min readLW link

LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B

Simon Lermen and Jeffrey Ladish

12 Oct 2023 19:58 UTC

148 points

29 comments14 min readLW link

Creating unrestricted AI Agents with Command R+

Simon Lermen16 Apr 2024 14:52 UTC

70 points

12 comments5 min readLW link

Applying refusal-vector ablation to a Llama 3 70B agent

Simon Lermen11 May 2024 0:08 UTC

41 points

7 comments7 min readLW link