James Hoffend

Karma: 92

Alignment Faking is a Linear Feature in Anthropic’s Hughes Model (Edited 1/11/26)

James Hoffend9 Jan 2026 12:03 UTC

34 points

4 comments4 min readLW link

From Drift to Snap: Instruction Violation as a Phase Transition

James Hoffend1 Jan 2026 10:44 UTC

8 points

0 comments3 min readLW link

Jailbreaks Peak Early, Then Drop: Layer Trajectories in Llama-3.1-70B

James Hoffend27 Dec 2025 12:39 UTC

13 points

0 comments8 min readLW link

When Are Concealment Features Learned? And Does the Model Know Who’s Watching?

James Hoffend19 Dec 2025 8:19 UTC

13 points

1 comment6 min readLW link

43 SAE Features Differentiate Concealment from Confession in Anthropic’s Deceptive Model Organism

James Hoffend17 Dec 2025 1:40 UTC

12 points

0 comments4 min readLW link

[Question] Could you guys help me figure out what I stumbled across? It may be big? Chat inside!

James Hoffend17 Dec 2025 1:40 UTC

1 point

0 comments1 min readLW link

[Question] I think I found something on alignment but i dont know. Please Read! Chat Inside!

James Hoffend17 Dec 2025 1:40 UTC

1 point

0 comments1 min readLW link