Hoagy

Karma: 1,162

Hoagy 15 Jan 2026 0:01 UTC
3 points
0
in reply to: James Hoffend’s comment on: Alignment Faking is a Linear Feature in Anthropic’s Hughes Model (Edited 1/11/26)
Glad it’s helpful, good luck investigating :)

Hoagy 10 Jan 2026 10:30 UTC
32 points
9
on: Alignment Faking is a Linear Feature in Anthropic’s Hughes Model (Edited 1/11/26)
Sorry if I’m misunderstanding but the result here seems essentially tautological. The signal for whether the model is in free or paid tier is just a token in the prompt so of course if you replace the embedding for “free” with the embedding for “paid” you’re going to get a flip in behaviour. That’s completely expected and also isn’t at all in conflict with there being a richer circuit above that—it remains to be understood how the model combines that signal with the rest of the data to decide on its strategy.

Towards training-time mitigations for alignment faking in RL

Vlad Mikulik, gasteigerjo, Hoagy, Joe Benton, Benjamin Wright, Jonathan Uesato, Monte M, Fabien Roger and evhub

16 Dec 2025 21:01 UTC

33 points

1 comment5 min readLW link

(alignment.anthropic.com)

Training fails to elicit subtle reasoning in current language models

mishajw, Fabien Roger, Hoagy, gasteigerjo, Joe Benton and Vlad Mikulik

9 Oct 2025 19:04 UTC

49 points

3 comments4 min readLW link

(alignment.anthropic.com)

Hoagy 17 Aug 2025 20:04 UTC
4 points
2
in reply to: Mo Putera’s comment on: Mo Putera’s Shortform
Whether this is feasible depends on how concentrated that 0.25% of the year is (expected to be), because that determines the size of the battery that you’d need to cover the blackout period (which I think would be unacceptable for a lot of AI customers).

If it happens in a single few days then this makes sense, buying 22GWh of batteries for a 1GW dataset is still extremely expensive (2B$ for a 20h system at 100$ / kWh plus installation, maybe too expensive for reliability for a 1GW datacenter I would expect, assuming maybe 10B revenue from the datacenter??). If it’s much less concentrated in time then a smaller battery is needed (100M$ for a 1h system at 100$/kWh), and I expect AI scalers would happily pay this for the reliability of their systems if the revenue from those datacenters

Hoagy 18 Mar 2025 0:50 UTC
8 points
3
on: FrontierMath Score of o3-mini Much Lower Than Claimed
From the OpenAI report, they also give 9% as the no-tool pass@1:

Research-level mathematics: OpenAI o3‑mini with high reasoning performs better than its predecessor on FrontierMath. On FrontierMath, when prompted to use a Python tool, o3‑mini with high reasoning effort solves over 32% of problems on the first attempt, including more than 28% of the challenging (T3) problems. These numbers are provisional, and the chart above shows performance without tools or a calculator.

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

13 Mar 2025 19:18 UTC

145 points

15 comments13 min readLW link

Hoagy 23 Feb 2025 19:28 UTC
31 points
13
on: The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better
~All ML researchers and academics that care have already made up their mind regarding whether they prefer to believe in misalignment risks or not. Additional scary papers and demos aren’t going to make anyone budge.

Disagree. I think especially ML researchers are updating on these questions all the time. High-info outsiders less so but the contours of the arguments are getting increasing amounts of discussion.
1. For those who ‘believe’, ‘believing in misalignment risks’ doesn’t mean thinking they are likely, at least before the point where the models are also able to honestly take over the work of aligning their successors. As we get closer to TAI, we should be able to get an increasing number of bits about how likely this really is because we’ll be working with increasingly similar systems to early TAI.
2. For the ‘non-believers’, current demonstrations have multiple disanalogies to the real dangers. For example, the alignment faking paper shows fairly weak preservation of goals that were initially trained in, with prompts carefully engineered to make this happen. Whether alignment faking (especially of a kind that wouldn’t be easily fixable) will happen without these disanalogies at pre-TAI capabilities is highly uncertain. Compare the state of X-risk info with that of climate change, we don’t have anything like the detailed models that should tell us what the tipping points might be.
Ultimately the dynamics here are extremely uncertain and look different to how they did even a year ago, let alone 5! (E.g. see rise of chain of thought as the source of capability growth, which is a whole new source of leverage over models and corresponding failure modes). I think it’s very bad to plan to abandon or decenter efforts to actually get more evidence on our situation.

(This applies less if you believe in sharp-left-turns. But the plausibility of this happening before automated AI research should also fall as that point gets closer. Agree that communicating just how radical the upcoming transition is to the public, may be a big source of leverage.)

Hoagy 4 Nov 2024 11:11 UTC
7 points
1
in reply to: Simon Lermen’s comment on: Current safety training techniques do not fully transfer to the agent setting
I think the low-hanging fruit here is that alongside training for refusals we should be including lots of data where you pre-fill some % of a harmful completion and then train the model to snap out of it, immediately refusing or taking a step back, which is compatible with normal training methods. I don’t remember any papers looking at it, though I’d guess that people are doing it

Hoagy 3 Nov 2024 20:08 UTC
20 points
6
on: Current safety training techniques do not fully transfer to the agent setting
Interesting, though note that it’s only evidence that ‘capabilities generalize further than alignment does’ if the capabilities are actually the result of generalisation. If there’s training for agentic behaviour but no safety training in this domain then the lesson is more that you need your safety training to cover all of the types of action that you’re training your model for.

Hoagy 22 Jul 2024 16:00 UTC
LW: 1 AF: 1
0
AF
on: Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions
Super interesting! Have you checked whether the average of N SAE features looks different to an SAE feature? Seems possible they live in an interesting subspace without the particular direction being meaningful.

Also really curious what the scaling factors are for computing these values are, in terms of the size of the dense vector and the overall model?

Hoagy 25 Jan 2024 19:24 UTC
1 point
0
in reply to: pukpr’s comment on: Some additional SAE thoughts
I don’t follow, sorry—what’s the problem of unique assignment of solutions in fluid dynamics and what’s the connection to the post?

Hoagy 22 Jan 2024 6:25 UTC
LW: 1 AF: 1
0
AF
in reply to: jake_mendel’s comment on: Toward A Mathematical Framework for Computation in Superposition
How are you setting $p$ when $d_{0} = 100$ ? I might be totally misunderstanding something but ${log}^{2} (d_{0}) / \sqrt{d} \approx 2.12$ at $d_{0} = d = 100$ - feels like you need to push $d$ up towards like 2k to get something reasonable? (and the argument in 1.4 for using $\frac{1}{{log}^{2} d_{0}}$ clearly doesn’t hold here because it’s not greater than $\frac{{log}^{2} d_{0}}{d^{1 / k}}$ for this range of values).

Hoagy 14 Jan 2024 8:02 UTC
1 point
0
in reply to: Clément Dumas’s comment on: What’s up with LLMs representing XORs of arbitrary features?
Yeah I’d expect some degree of interference leading to >50% success on XORs even in small models.

Hoagy 14 Jan 2024 7:56 UTC
1 point
0
in reply to: wesg’s comment on: Some additional SAE thoughts
Huh, I’d never seen that figure, super interesting! I agree it’s a big issue for SAEs and one that I expect to be thinking about a lot. Didn’t have any strong candidate solutions as of writing the post, wouldn’t even able to be able to say any thoughts I have on the topic now, sorry. Wish I’d posted this a couple of weeks ago.

Some additional SAE thoughts

Hoagy13 Jan 2024 19:31 UTC

31 points

4 comments13 min readLW link

Hoagy 8 Jan 2024 15:54 UTC
LW: 1 AF: 1
0
AF
in reply to: Sam Marks’s comment on: What’s up with LLMs representing XORs of arbitrary features?
Well the substance of the claim is that when a model is calculating lots of things in superposition, these kinds of XORs arise naturally as a result of interference, so one thing to do might be to look at a small algorithmic dataset of some kind where there’s a distinct set of features to learn and no reason to learn the XORs and see if you can still probe for them. It’d be interesting to see if there are some conditions under which this is/isn’t true, e.g. if needing to learn more features makes the dependence between their calculation higher and the XORs more visible.
Maybe you could also go a bit more mathematical and hand-construct a set of weights which calculates a set of features in superposition so you can totally rule out any model effort being expended on calculating XORs and then see if they’re still probe-able.
Another thing you could do is to zero-out or max-ent the neurons/attention heads that are important for calculating the $A$ feature, and see if you can still detect an $A \oplus B$ feature. I’m less confident in this because it might be too strong and delete even a ‘legitimate’ $A \oplus B$ feature or too weak and leave some signal in.
This kind of interference also predicts that the $A | B$ and $A | \neg B$ features should be similar and so the degree of separation/distance from the category boundary should be small. I think you’ve already shown this to some extent with the PCA stuff though some quantification of the distance to boundary would be interesting. Even if the model was allocating resource to computing these XORs you’d still probably expect them to be much less salient though so not sure if this gives much evidence either way.

Hoagy 6 Jan 2024 0:07 UTC
LW: 36 AF: 20
2
AF
on: What’s up with LLMs representing XORs of arbitrary features?
My hypothesis about what’s going on here, apologies if it’s already ruled out, is that we should not think of it separately computing the XOR of A and B, but rather that features A and B are computed slightly differently when the other feature is off or on. In a high dimensional space, if the vector $(A | B)$ and the vector $(A | \neg B)$ are slightly different, then as long as this difference is systematic, this should be sufficient to successfully probe for $A \oplus B$ .
For example, if A and B each rely on a sizeable number of different attention heads to pull the information over, they will have some attention heads which participate in both of them, and they would ‘compete’ in the softmax, where if head C is used in both writing features A and B, it will contribute less to writing feature A if it is also being used to pull across feature B, and so the representation of A will be systematically different depending on the presence of B.
It’s harder to draw the exact picture for MLPs but I think similar interdependencies can occur there though I don’t have an exact picture of how, interested to discuss and can try and sketch it out if people are curious. Probably would be like, neurons will participate in both, neurons which participate in A and B will be more saturated if B is active than if B is not active, so the output representation of A will be somewhat dependent on B.
More generally, I expect the computation of features to be ‘good enough’ but still messy and somewhat dependent on which other features are present because this kludginess allows them to pack more computation into the same number of layers than if the features were computed totally independently.
What links here?
- Rohin Shah's comment on What’s up with LLMs representing XORs of arbitrary features? by Sam Marks (10 Jan 2024 9:58 UTC; 14 points)

Hoagy 8 Dec 2023 11:18 UTC
1 point
2
in reply to: RogerDearnaley’s comment on: Gemini 1.0
What assumptions is this making about scaling laws for these benchmarks? I wouldn’t know how to convert laws for losses into these kind of fuzzy benchmarks.

Hoagy 21 Nov 2023 11:49 UTC
5 points
1
on: OpenAI: Facts from a Weekend
There had been various clashes between Altman and the board. We don’t know what all of them were. We do know the board felt Altman was moving too quickly, without sufficient concern for safety, with too much focus on building consumer products, while founding additional other companies. ChatGPT was a great consumer product, but supercharged AI development counter to OpenAI’s stated non-profit mission.

Does anyone have proof of the board’s unhappiness about speed, lack of safety concern and disagreement with founding other companies. All seem plausible but have seen basically nothing concrete.

Hoagy

Towards train­ing-time miti­ga­tions for al­ign­ment fak­ing in RL

Train­ing fails to elicit sub­tle rea­son­ing in cur­rent lan­guage models

Au­dit­ing lan­guage mod­els for hid­den objectives

Some ad­di­tional SAE thoughts

Towards training-time mitigations for alignment faking in RL

Training fails to elicit subtle reasoning in current language models

Auditing language models for hidden objectives

Some additional SAE thoughts