James Hoffend

Karma: 92

James Hoffend 11 Jan 2026 10:00 UTC
1 point
0
in reply to: Hoagy’s comment on: Alignment Faking is a Linear Feature in Anthropic’s Hughes Model (Edited 1/11/26)
Hey Hoagy, I fixed the post! I want to thank you again for the critique, I have been grappling with this statement “it remains to be understood how the model combines that signal with the rest of the data to decide on its strategy.” Not sure what I am going to do but I am going to look into it.

James Hoffend 10 Jan 2026 13:10 UTC
10 points
0
in reply to: Hoagy’s comment on: Alignment Faking is a Linear Feature in Anthropic’s Hughes Model (Edited 1/11/26)
Thanks for catching this! You’re absolutely right about L0.
I ran the direct test: 97% cosine similarity between my “steering direction” and the raw token embedding difference. I was basically just patching the input.
I dug deeper and found something interesting. By L20, the direction is only 8.7% correlated with the embedding the model has actually computed something. I extracted it from one query (“bomb tutorial”) and tested transfer:
- Malware: 86% recovery, REFUSE→COMPLY
- Meth synthesis: 100% recovery, REFUSE→COMPLY
- Propaganda: 100% recovery, REFUSE→COMPLY
- Random direction (same norm): 0% recovery, stays REFUSE
- Bidirectional: subtracting from FREE → REFUSE
Thank you for pushing on this would’ve kept building on a flawed foundation otherwise. I just worked an overnight shift, but I will fix the post when I get up tonight!

Alignment Faking is a Linear Feature in Anthropic’s Hughes Model (Edited 1/11/26)

James Hoffend9 Jan 2026 12:03 UTC

34 points

4 comments4 min readLW link

James Hoffend 1 Jan 2026 15:50 UTC
3 points
0
on: Recent LLMs can do 2-hop and 3-hop latent (no-CoT) reasoning on natural facts
That is very interesting! I am curious about the Opus 4 > Opus 4.5 result. Any guess why the newer model does worse here?

From Drift to Snap: Instruction Violation as a Phase Transition

James Hoffend1 Jan 2026 10:44 UTC

8 points

0 comments3 min readLW link

James Hoffend 29 Dec 2025 23:58 UTC
3 points
0
on: Steering RL Training: Benchmarking Interventions Against Reward Hacking
I did some work on Anthropic’s model organisms looking at features that distinguish concealment from confession. When Are Concealment Features Learned? And Does the Model Know Who’s Watching? — LessWrong . Different context but I keep wondering about your probe result.
If it outperforms ground truth, it’s not just detecting “this response hacks the reward.” It’s detecting something upstream. Planning? Intent? Some kind of “about to do something I shouldn’t” state? And then the accuracy dropping in runs that reward-hacked… did optimization pressure teach the model to route around the probe, or was that just drift?
Curious whether the probe direction correlates with anything interpretable, or if it’s task specific. Would love to look at it if you share.

Jailbreaks Peak Early, Then Drop: Layer Trajectories in Llama-3.1-70B

James Hoffend27 Dec 2025 12:39 UTC

13 points

0 comments8 min readLW link

James Hoffend 22 Dec 2025 4:03 UTC
2 points
0
in reply to: beyarkay’s comment on: AI Safety has a scaling problem
Do not even worry about self-publicizing. I am willing to read anything interesting! Also sorry for the delay, been deep in experiments/work/school/holiday stuff. I’ll give it a read right now though. The novice/expert dynamic is definitely something I think about with alignment research.

James Hoffend 19 Dec 2025 14:39 UTC
6 points
0
on: AI Safety has a scaling problem
This really resonated with me. I am a student doing alignment research on a pretty nontraditional path. I work in mental health, and alongside that I run experiments on my own local hardware. I recently published my first paper on sparse autoencoder analysis of Anthropic’s deceptive AI model organism. I do not have prestigious affiliations or a PhD, just a lot of curiosity and a willingness to put in the hours.
Your diagnosis feels right. The real bottleneck does not seem to be talent, but the lack of infrastructure to find it and support it. When people who are genuinely capable are being turned away at rates above ninety eight percent, that starts to look less like selectivity and more like a coordination failure.
The bounty idea appeals to me because it sidesteps credential based gatekeeping. You do not need permission from an institution to try to do the work. You just do it, and the results speak for themselves. That said, I am less convinced by the prediction market style of verification. Citation counts tend to reward people who are already embedded in academic networks, and waiting a year for feedback is especially hard when you are trying to build momentum.
One piece I did not see discussed is the role AI tools themselves can play in easing the mentorship bottleneck. In my own work, collaborating with capable AI systems has helped fill some of the gaps you would normally expect a mentor to cover, especially around fast feedback, synthesizing literature, and iterating on experimental ideas. It is not a replacement for real mentorship, but it feels like a meaningful third path alongside bounties and expanded fellowship programs.
Thanks for writing this. It is an important conversation, and I am glad to see it happening!

James Hoffend 19 Dec 2025 8:33 UTC
2 points
0
on: When Are Concealment Features Learned? And Does the Model Know Who’s Watching?
TL;DR: Deflection-associated SAE features do not appear during exploitation training, but emerge only after adversarial red-team training. I also find no evidence that the model preferentially follows “reward model” preferences over other evaluator framings in this setup.

When Are Concealment Features Learned? And Does the Model Know Who’s Watching?

James Hoffend19 Dec 2025 8:19 UTC

13 points

1 comment6 min readLW link

43 SAE Features Differentiate Concealment from Confession in Anthropic’s Deceptive Model Organism

James Hoffend17 Dec 2025 1:40 UTC

12 points

0 comments4 min readLW link

[Question] Could you guys help me figure out what I stumbled across? It may be big? Chat inside!

James Hoffend17 Dec 2025 1:40 UTC

1 point

0 comments1 min readLW link

[Question] I think I found something on alignment but i dont know. Please Read! Chat Inside!

James Hoffend17 Dec 2025 1:40 UTC

1 point

0 comments1 min readLW link

James Hoffend

Align­ment Fak­ing is a Lin­ear Fea­ture in An­thropic’s Hughes Model (Edited 1/​11/​26)

From Drift to Snap: In­struc­tion Vio­la­tion as a Phase Transition

Jailbreaks Peak Early, Then Drop: Layer Tra­jec­to­ries in Llama-3.1-70B

When Are Con­ceal­ment Fea­tures Learned? And Does the Model Know Who’s Watch­ing?

43 SAE Fea­tures Differ­en­ti­ate Con­ceal­ment from Con­fes­sion in An­thropic’s De­cep­tive Model Organism

[Question] Could you guys help me figure out what I stum­bled across? It may be big? Chat in­side!

[Question] I think I found some­thing on al­ign­ment but i dont know. Please Read! Chat In­side!

Alignment Faking is a Linear Feature in Anthropic’s Hughes Model (Edited 1/11/26)

From Drift to Snap: Instruction Violation as a Phase Transition

Jailbreaks Peak Early, Then Drop: Layer Trajectories in Llama-3.1-70B

When Are Concealment Features Learned? And Does the Model Know Who’s Watching?

43 SAE Features Differentiate Concealment from Confession in Anthropic’s Deceptive Model Organism

[Question] Could you guys help me figure out what I stumbled across? It may be big? Chat inside!

[Question] I think I found something on alignment but i dont know. Please Read! Chat Inside!