Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Simon Lermen
Karma:
442
All
Posts
Comments
New
Top
Old
Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios
Simon Lermen
,
Teun van der Weij
and
Leon Lang
16 May 2023 10:53 UTC
22
points
0
comments
13
min read
LW
link
Robustness of Model-Graded Evaluations and Automated Interpretability
Simon Lermen
and
viluon
15 Jul 2023 19:12 UTC
44
points
5
comments
9
min read
LW
link
LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
Simon Lermen
and
Jeffrey Ladish
12 Oct 2023 19:58 UTC
148
points
29
comments
14
min read
LW
link
Creating unrestricted AI Agents with Command R+
Simon Lermen
16 Apr 2024 14:52 UTC
70
points
12
comments
5
min read
LW
link
Applying refusal-vector ablation to a Llama 3 70B agent
Simon Lermen
11 May 2024 0:08 UTC
41
points
7
comments
7
min read
LW
link
Back to top