Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
Nicholas Schiefer
Karma:
667
All
Posts
Comments
New
Top
Old
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
evhub
,
Carson Denison
,
Meg
,
Monte M
,
David Duvenaud
,
Nicholas Schiefer
and
Ethan Perez
Jan 12, 2024, 7:51 PM
305
points
95
comments
3
min read
LW
link
(arxiv.org)
Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research
evhub
,
Nicholas Schiefer
,
Carson Denison
and
Ethan Perez
Aug 8, 2023, 1:30 AM
318
points
30
comments
18
min read
LW
link
1
review
Engineering Monosemanticity in Toy Models
Adam Jermyn
,
evhub
and
Nicholas Schiefer
Nov 18, 2022, 1:43 AM
75
points
7
comments
3
min read
LW
link
(arxiv.org)
ELK Proposal—Make the Reporter care about the Predictor’s beliefs
Adam Jermyn
and
Nicholas Schiefer
Jun 11, 2022, 10:53 PM
8
points
0
comments
6
min read
LW
link
Back to top
N
W
F
A
C
D
E
F
G
H
I
Customize appearance
Current theme:
default
A
C
D
E
F
G
H
I
Less Wrong (text)
Less Wrong (link)
Invert colors
Reset to defaults
OK
Cancel
Hi, I’m Bobby the Basilisk! Click on the minimize button (
) to minimize the theme tweaker window, so that you can see what the page looks like with the current tweaked values. (But remember,
the changes won’t be saved until you click “OK”!
)
Theme tweaker help
Show Bobby the Basilisk
OK
Cancel