Hoagy

Karma: 1,076

Hoagy Mar 18, 2025, 12:50 AM
8 points
3
on: FrontierMath Score of o3-mini Much Lower Than Claimed
From the OpenAI report, they also give 9% as the no-tool pass@1:

Research-level mathematics: OpenAI o3‑mini with high reasoning performs better than its predecessor on FrontierMath. On FrontierMath, when prompted to use a Python tool, o3‑mini with high reasoning effort solves over 32% of problems on the first attempt, including more than 28% of the challenging (T3) problems. These numbers are provisional, and the chart above shows performance without tools or a calculator.

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

Mar 13, 2025, 7:18 PM

141 points

15 comments13 min readLW link

Hoagy Feb 23, 2025, 7:28 PM
31 points
13
on: The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better
~All ML researchers and academics that care have already made up their mind regarding whether they prefer to believe in misalignment risks or not. Additional scary papers and demos aren’t going to make anyone budge.

Disagree. I think especially ML researchers are updating on these questions all the time. High-info outsiders less so but the contours of the arguments are getting increasing amounts of discussion.
1. For those who ‘believe’, ‘believing in misalignment risks’ doesn’t mean thinking they are likely, at least before the point where the models are also able to honestly take over the work of aligning their successors. As we get closer to TAI, we should be able to get an increasing number of bits about how likely this really is because we’ll be working with increasingly similar systems to early TAI.
2. For the ‘non-believers’, current demonstrations have multiple disanalogies to the real dangers. For example, the alignment faking paper shows fairly weak preservation of goals that were initially trained in, with prompts carefully engineered to make this happen. Whether alignment faking (especially of a kind that wouldn’t be easily fixable) will happen without these disanalogies at pre-TAI capabilities is highly uncertain. Compare the state of X-risk info with that of climate change, we don’t have anything like the detailed models that should tell us what the tipping points might be.
Ultimately the dynamics here are extremely uncertain and look different to how they did even a year ago, let alone 5! (E.g. see rise of chain of thought as the source of capability growth, which is a whole new source of leverage over models and corresponding failure modes). I think it’s very bad to plan to abandon or decenter efforts to actually get more evidence on our situation.

(This applies less if you believe in sharp-left-turns. But the plausibility of this happening before automated AI research should also fall as that point gets closer. Agree that communicating just how radical the upcoming transition is to the public, may be a big source of leverage.)

Hoagy Nov 4, 2024, 11:11 AM
7 points
1
in reply to: Simon Lermen’s comment on: Current safety training techniques do not fully transfer to the agent setting
I think the low-hanging fruit here is that alongside training for refusals we should be including lots of data where you pre-fill some % of a harmful completion and then train the model to snap out of it, immediately refusing or taking a step back, which is compatible with normal training methods. I don’t remember any papers looking at it, though I’d guess that people are doing it

Hoagy Nov 3, 2024, 8:08 PM
20 points
6
on: Current safety training techniques do not fully transfer to the agent setting
Interesting, though note that it’s only evidence that ‘capabilities generalize further than alignment does’ if the capabilities are actually the result of generalisation. If there’s training for agentic behaviour but no safety training in this domain then the lesson is more that you need your safety training to cover all of the types of action that you’re training your model for.

Hoagy Jul 22, 2024, 4:00 PM
LW: 1 AF: 1
0
AF
on: Feature Targeted LLC Estimation Distinguishes SAE Features from Random Directions
Super interesting! Have you checked whether the average of N SAE features looks different to an SAE feature? Seems possible they live in an interesting subspace without the particular direction being meaningful.

Also really curious what the scaling factors are for computing these values are, in terms of the size of the dense vector and the overall model?

Hoagy Jan 25, 2024, 7:24 PM
1 point
0
in reply to: pukpr’s comment on: Some additional SAE thoughts
I don’t follow, sorry—what’s the problem of unique assignment of solutions in fluid dynamics and what’s the connection to the post?

Hoagy Jan 22, 2024, 6:25 AM
LW: 1 AF: 1
0
AF
in reply to: jake_mendel’s comment on: Toward A Mathematical Framework for Computation in Superposition
How are you setting $p$ when $d_{0} = 100$ ? I might be totally misunderstanding something but ${log}^{2} (d_{0}) / \sqrt{d} \approx 2.12$ at $d_{0} = d = 100$ - feels like you need to push $d$ up towards like 2k to get something reasonable? (and the argument in 1.4 for using $\frac{1}{{log}^{2} d_{0}}$ clearly doesn’t hold here because it’s not greater than $\frac{{log}^{2} d_{0}}{d^{1 / k}}$ for this range of values).

Hoagy Jan 14, 2024, 8:02 AM
1 point
0
in reply to: Clément Dumas’s comment on: What’s up with LLMs representing XORs of arbitrary features?
Yeah I’d expect some degree of interference leading to >50% success on XORs even in small models.

Hoagy Jan 14, 2024, 7:56 AM
1 point
0
in reply to: wesg’s comment on: Some additional SAE thoughts
Huh, I’d never seen that figure, super interesting! I agree it’s a big issue for SAEs and one that I expect to be thinking about a lot. Didn’t have any strong candidate solutions as of writing the post, wouldn’t even able to be able to say any thoughts I have on the topic now, sorry. Wish I’d posted this a couple of weeks ago.

Some additional SAE thoughts

HoagyJan 13, 2024, 7:31 PM

31 points

4 comments13 min readLW link

Hoagy Jan 8, 2024, 3:54 PM
LW: 1 AF: 1
0
AF
in reply to: Sam Marks’s comment on: What’s up with LLMs representing XORs of arbitrary features?
Well the substance of the claim is that when a model is calculating lots of things in superposition, these kinds of XORs arise naturally as a result of interference, so one thing to do might be to look at a small algorithmic dataset of some kind where there’s a distinct set of features to learn and no reason to learn the XORs and see if you can still probe for them. It’d be interesting to see if there are some conditions under which this is/isn’t true, e.g. if needing to learn more features makes the dependence between their calculation higher and the XORs more visible.
Maybe you could also go a bit more mathematical and hand-construct a set of weights which calculates a set of features in superposition so you can totally rule out any model effort being expended on calculating XORs and then see if they’re still probe-able.
Another thing you could do is to zero-out or max-ent the neurons/attention heads that are important for calculating the $A$ feature, and see if you can still detect an $A \oplus B$ feature. I’m less confident in this because it might be too strong and delete even a ‘legitimate’ $A \oplus B$ feature or too weak and leave some signal in.
This kind of interference also predicts that the $A | B$ and $A | \neg B$ features should be similar and so the degree of separation/distance from the category boundary should be small. I think you’ve already shown this to some extent with the PCA stuff though some quantification of the distance to boundary would be interesting. Even if the model was allocating resource to computing these XORs you’d still probably expect them to be much less salient though so not sure if this gives much evidence either way.

Hoagy Jan 6, 2024, 12:07 AM
LW: 36 AF: 20
2
AF
on: What’s up with LLMs representing XORs of arbitrary features?
My hypothesis about what’s going on here, apologies if it’s already ruled out, is that we should not think of it separately computing the XOR of A and B, but rather that features A and B are computed slightly differently when the other feature is off or on. In a high dimensional space, if the vector $(A | B)$ and the vector $(A | \neg B)$ are slightly different, then as long as this difference is systematic, this should be sufficient to successfully probe for $A \oplus B$ .
For example, if A and B each rely on a sizeable number of different attention heads to pull the information over, they will have some attention heads which participate in both of them, and they would ‘compete’ in the softmax, where if head C is used in both writing features A and B, it will contribute less to writing feature A if it is also being used to pull across feature B, and so the representation of A will be systematically different depending on the presence of B.
It’s harder to draw the exact picture for MLPs but I think similar interdependencies can occur there though I don’t have an exact picture of how, interested to discuss and can try and sketch it out if people are curious. Probably would be like, neurons will participate in both, neurons which participate in A and B will be more saturated if B is active than if B is not active, so the output representation of A will be somewhat dependent on B.
More generally, I expect the computation of features to be ‘good enough’ but still messy and somewhat dependent on which other features are present because this kludginess allows them to pack more computation into the same number of layers than if the features were computed totally independently.
What links here?
- Rohin Shah's comment on What’s up with LLMs representing XORs of arbitrary features? by Sam Marks (Jan 10, 2024, 9:58 AM; 14 points)

Hoagy Dec 8, 2023, 11:18 AM
1 point
2
in reply to: RogerDearnaley’s comment on: Gemini 1.0
What assumptions is this making about scaling laws for these benchmarks? I wouldn’t know how to convert laws for losses into these kind of fuzzy benchmarks.

Hoagy Nov 21, 2023, 11:49 AM
5 points
1
on: OpenAI: Facts from a Weekend
There had been various clashes between Altman and the board. We don’t know what all of them were. We do know the board felt Altman was moving too quickly, without sufficient concern for safety, with too much focus on building consumer products, while founding additional other companies. ChatGPT was a great consumer product, but supercharged AI development counter to OpenAI’s stated non-profit mission.

Does anyone have proof of the board’s unhappiness about speed, lack of safety concern and disagreement with founding other companies. All seem plausible but have seen basically nothing concrete.

Hoagy Nov 12, 2023, 4:40 PM
LW: 6 AF: 4
5
AF
on: AI Timelines
Could you elaborate on what it would mean to demonstrate ‘savannah-to-boardroom’ transfer? Our architecture was selected for in the wilds of nature, not our training data. To me it seems that when we use an architecture designed for language translation for understanding images we’ve demonstrated a similar degree of transfer.

I agree that we’re not yet there on sample efficient learning in new domains (which I think is more what you’re pointing at) but I’d like to be clearer on what benchmarks would show this. For example, how well GPT-4 can integrate a new domain of knowledge from (potentially multiple epochs of training on) a single textbook seems a much better test and something that I genuinely don’t know the answer to.

Hoagy Oct 16, 2023, 7:40 PM
LW: 1 AF: 1
0
AF
in reply to: Adam Jermyn’s comment on: RSPs are pauses done right
Do you know why 4x was picked? I understand that doing evals properly is a pretty substantial effort, but once we get up to gigantic sizes and proto-AGIs it seems like it could hide a lot. If there was a model sitting in training with 3x the train-compute of GPT4 I’d be very keen to know what it could do!

Hoagy Oct 6, 2023, 11:06 AM
1 point
0
in reply to: beren’s comment on: Deep learning models might be secretly (almost) linear
Yes that makes a lot of sense that linearity would come hand in hand with generalization. I’d recently been reading Krotov on non-linear Hopfield networks but hadn’t made the connection. They say that they’re planning on using them to create more theoretically grounded transformer architectures. and your comment makes me think that these wouldn’t succeed but then the article also says:
This idea has been further extended in 2017 by showing that a careful choice of the activation function can even lead to an exponential memory storage capacity. Importantly, the study also demonstrated that dense associative memory, like the traditional Hopfield network, has large basins of attraction of size O(Nf). This means that the new model continues to benefit from strong associative properties despite the dense packing of memories inside the feature space.
which perhaps corresponds to them also being able to find good linear representation and to mix generalization and memorization like a transformer?

Hoagy Oct 4, 2023, 11:15 AM
3 points
0
on: Deep learning models might be secretly (almost) linear
Reposting from a shortform post but I’ve been thinking about a possible additional argument that networks end up linear that I’d like some feedback on:

the tldr is that overcomplete bases necessitate linear representations
- Neural networks use overcomplete bases to represent concepts. Especially in vector spaces without non-linearity, such as the transformer’s residual stream, there are just many more things that are stored in there than there are dimensions, and as Johnson Lindenstrauss shows, there are exponentially many almost-orthogonal directions to store them in (of course, we can’t assume that they’re stored linearly as directions, but if they were then there’s lots of space). (see also Toy models of transformers, sparse coding work)
- Many different concepts may be active at once, and the model’s ability to read a representation needs to be robust to this kind of interference.
- Highly non-linear information storage is going to be very fragile to interference because, by the definition of non-linearity, the model will respond differently to the input depending on the existing level of that feature. For example, if the response is quadratic or higher in the feature direction, then the impact of turning that feature on will be much different depending on whether certain not-quite orthogonal features are also on. If feature spaces are somehow curved then they will be similarly sensitive.
Of course linear representations will still be sensitive to this kind of interferences but I suspect there’s a mathematical proof for why linear features are the most robust to represent information in this kind of situation but I’m not sure where to look for existing work or how to start trying to prove it.

Hoagy Oct 4, 2023, 11:12 AM
8 points
0
on: Hoagy’s Shortform
There’s an argument that I’ve been thinking about which I’d really like some feedback or pointers to literature on:

the tldr is that overcomplete bases necessitate linear representations
- Neural networks use overcomplete bases to represent concepts. Especially in vector spaces without non-linearity, such as the transformer’s residual stream, there are just many more things that are stored in there than there are dimensions, and as Johnson Lindenstrauss shows, there are exponentially many almost-orthogonal directions to store them in (of course, we can’t assume that they’re stored linearly as directions, but if they were then there’s lots of space). (see also Toy models of transformers, my sparse autoencoder posts)
- Many different concepts may be active at once, and the model’s ability to read a representation needs to be robust to this kind of interference.
- Highly non-linear information storage is going to be very fragile to interference because, by the definition of non-linearity, the model will respond differently to the input depending on the existing level of that feature. For example, if the response is quadratic or higher in the feature direction, then the impact of turning that feature on will be much different depending on whether certain not-quite orthogonal features are also on. If feature spaces are somehow curved then they will be similarly sensitive.
Of course linear representations will still be sensitive to this kind of interferences but I suspect there’s a mathematical proof for why linear features are the most robust to represent information in this kind of situation but I’m not sure where to look for existing work or how to start trying to prove it.

Hoagy

Au­dit­ing lan­guage mod­els for hid­den objectives

Some ad­di­tional SAE thoughts

Auditing language models for hidden objectives

Some additional SAE thoughts