Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
rusheb
Karma:
94
All
Posts
Comments
New
Top
Old
A starter guide for evals
Marius Hobbhahn
,
Jérémy Scheurer
,
Mikita Balesni
,
rusheb
and
AlexMeinke
8 Jan 2024 18:24 UTC
44
points
2
comments
12
min read
LW
link
(www.apolloresearch.ai)
Scalable And Transferable Black-Box Jailbreaks For Language Models Via Persona Modulation
Soroush Pour
,
rusheb
,
Quentin FEUILLADE--MONTIXI
,
Arush
and
scasper
7 Nov 2023 17:59 UTC
36
points
2
comments
2
min read
LW
link
(arxiv.org)
Understanding mesa-optimization using toy models
tilmanr
,
rusheb
,
Guillaume Corlouer
,
Dan Valentine
,
afspies
,
mivanitskiy
and
Can
7 May 2023 17:00 UTC
42
points
2
comments
10
min read
LW
link
Back to top