Archive
Sequences
About
Search
Log In
Questions
Events
Shortform
Alignment Forum
AF Comments
Home
Featured
All
Tags
Recent
Comments
RSS
James Hoffend
Karma:
92
All
Posts
Comments
New
Top
Old
Alignment Faking is a Linear Feature in Anthropic’s Hughes Model (Edited 1/11/26)
James Hoffend
9 Jan 2026 12:03 UTC
34
points
4
comments
4
min read
LW
link
From Drift to Snap: Instruction Violation as a Phase Transition
James Hoffend
1 Jan 2026 10:44 UTC
8
points
0
comments
3
min read
LW
link
Jailbreaks Peak Early, Then Drop: Layer Trajectories in Llama-3.1-70B
James Hoffend
27 Dec 2025 12:39 UTC
13
points
0
comments
8
min read
LW
link
When Are Concealment Features Learned? And Does the Model Know Who’s Watching?
James Hoffend
19 Dec 2025 8:19 UTC
13
points
1
comment
6
min read
LW
link
43 SAE Features Differentiate Concealment from Confession in Anthropic’s Deceptive Model Organism
James Hoffend
17 Dec 2025 1:40 UTC
12
points
0
comments
4
min read
LW
link
[Question]
Could you guys help me figure out what I stumbled across? It may be big? Chat inside!
James Hoffend
17 Dec 2025 1:40 UTC
1
point
0
comments
1
min read
LW
link
[Question]
I think I found something on alignment but i dont know. Please Read! Chat Inside!
James Hoffend
17 Dec 2025 1:40 UTC
1
point
0
comments
1
min read
LW
link
Back to top