RSS

James Hoffend

Karma: 92

Align­ment Fak­ing is a Lin­ear Fea­ture in An­thropic’s Hughes Model (Edited 1/​11/​26)

James Hoffend9 Jan 2026 12:03 UTC
34 points
4 comments4 min readLW link

From Drift to Snap: In­struc­tion Vio­la­tion as a Phase Transition

James Hoffend1 Jan 2026 10:44 UTC
8 points
0 comments3 min readLW link

Jailbreaks Peak Early, Then Drop: Layer Tra­jec­to­ries in Llama-3.1-70B

James Hoffend27 Dec 2025 12:39 UTC
13 points
0 comments8 min readLW link

When Are Con­ceal­ment Fea­tures Learned? And Does the Model Know Who’s Watch­ing?

James Hoffend19 Dec 2025 8:19 UTC
13 points
1 comment6 min readLW link

43 SAE Fea­tures Differ­en­ti­ate Con­ceal­ment from Con­fes­sion in An­thropic’s De­cep­tive Model Organism

James Hoffend17 Dec 2025 1:40 UTC
12 points
0 comments4 min readLW link

[Question] Could you guys help me figure out what I stum­bled across? It may be big? Chat in­side!

James Hoffend17 Dec 2025 1:40 UTC
1 point
0 comments1 min readLW link

[Question] I think I found some­thing on al­ign­ment but i dont know. Please Read! Chat In­side!

James Hoffend17 Dec 2025 1:40 UTC
1 point
0 comments1 min readLW link