Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Simon Lermen
Karma:
399
All
Posts
Comments
New
Top
Old
Creating unrestricted AI Agents with Command R+
Simon Lermen
16 Apr 2024 14:52 UTC
72
points
12
comments
5
min read
LW
link
unRLHF—Efficiently undoing LLM safeguards
Pranav Gade
,
Jeffrey Ladish
and
Simon Lermen
12 Oct 2023 19:58 UTC
117
points
15
comments
20
min read
LW
link
LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
Simon Lermen
and
Jeffrey Ladish
12 Oct 2023 19:58 UTC
148
points
29
comments
14
min read
LW
link
Robustness of Model-Graded Evaluations and Automated Interpretability
Simon Lermen
and
viluon
15 Jul 2023 19:12 UTC
44
points
5
comments
9
min read
LW
link
Evaluating Language Model Behaviours for Shutdown Avoidance in Textual Scenarios
Simon Lermen
,
Teun van der Weij
and
Leon Lang
16 May 2023 10:53 UTC
22
points
0
comments
13
min read
LW
link
Back to top