Adam Jermyn

Karma: 1,712

Adam Jermyn 2 Apr 2025 15:48 UTC
3 points
1
in reply to: Annapurna’s comment on: Tracing the Thoughts of a Large Language Model
As long as you make it clear at the header that it’s your unofficial translation, go for it!

Adam Jermyn 28 Mar 2025 16:36 UTC
7 points
3
in reply to: derek shiller’s comment on: Tracing the Thoughts of a Large Language Model
I would guess that models plan in this style much more generally. It’s just useful in so many contexts. For instance, if you’re trying to choose what article goes in front of a word, and that word is fixed by other constraints, you need a plan of what that word is (“an astronomer” not “a astronomer”). Or you might be writing code and have to know the type of the return value of a function before you’ve written the body of the function, since Python type annotations come at the start of the function in the signature. Etc. This sort of thing just comes up all over the place.

Adam Jermyn 28 Mar 2025 16:33 UTC
8 points
0
in reply to: Archimedes’s comment on: Tracing the Thoughts of a Large Language Model
It’s not so much that we didn’t think models plan ahead in general, as that we had various hypotheses (including “unknown unknowns”) and this kind of planning in poetry wasn’t obviously the best one until we saw the evidence.

[More generally: in Interpretability we often have the experience of being surprised by the specific mechanism a model is using, even though with the benefit of hindsight it seems obvious. E.g. when we did the work for Towards Monosemanticity we were initially quite surprised to see the “the in <context>” features, thought they were indicative of a bug in our setup, and had to spend a while thinking about them and poking around before we realized why the model wanted them (which now feels obvious).]

Tracing the Thoughts of a Large Language Model

Adam Jermyn27 Mar 2025 17:20 UTC

308 points

23 comments10 min readLW link

(www.anthropic.com)

Auditing language models for hidden objectives

Sam Marks, Johannes Treutlein, dmz, Sam Bowman, Hoagy, Carson Denison, Kei Nishimura-Gasparian, 7vik, Akbir Khan, Austin Meek, Euan Ong, Christopher Olah, Fabien Roger, jeanne_, Meg, Drake Thomas, Adam Jermyn, Monte M and evhub

13 Mar 2025 19:18 UTC

155 points

15 comments13 min readLW link

Adam Jermyn 30 Dec 2024 4:59 UTC
10 points
5
in reply to: habryka’s comment on: evhub’s Shortform
I can also confirm (I have a 3:1 match).

Adam Jermyn 10 Nov 2024 21:09 UTC
13 points
6
in reply to: Milan W’s comment on: Personal AI Planning
Unless we build more land (either in the ocean or in space)?

Adam Jermyn 29 Oct 2024 0:14 UTC
8 points
4
in reply to: Ben Pace’s comment on: Dario Amodei — Machines of Loving Grace
There is Dario’s written testimony before Congress, which mentions existential risk as a serious possibility: https://www.judiciary.senate.gov/imo/media/doc/2023-07-26_-_testimony_-_amodei.pdf
He also signed the CAIS statement on x-risk: https://www.safe.ai/work/statement-on-ai-risk

Adam Jermyn 16 Oct 2024 0:11 UTC
9 points
7
in reply to: Ben Pace’s comment on: Dario Amodei — Machines of Loving Grace
He does start out by saying he thinks & worries a lot about the risks (first paragraph):

I think and talk a lot about the risks of powerful AI. The company I’m the CEO of, Anthropic, does a lot of research on how to reduce these risks… I think that most people are underestimating just how radical the upside of AI could be, just as I think most people are underestimating how bad the risks could be.

He then explains (second paragraph) that the essay is meant to sketch out what things could look like if things go well:

In this essay I try to sketch out what that upside might look like—what a world with powerful AI might look like if everything goes right.

I think this is a coherent thing to do?

Adam Jermyn 16 Oct 2024 0:06 UTC
13 points
0
in reply to: ryan_greenblatt’s comment on: Dario Amodei — Machines of Loving Grace
I get 1e7 using 16 bit-flips per bfloat16 operation, 300K operating temperature, and 312Tflop/s (from Nvidia’s spec sheet). My guess is that this is a little high because a float multiplication involves more operations than just flipping 16 bits, but it’s the right order-of-magnitude.

Adam Jermyn 15 Jun 2024 17:57 UTC
6 points
2
on: Yann LeCun: We only design machines that minimize costs [therefore they are safe]
Another objection is that you can minimize the wrong cost function. Making “cost” go to zero could mean making “the thing we actually care about” go to (negative huge number).

Adam Jermyn 1 Jun 2024 1:37 UTC
25 points
16
in reply to: TurnTrout’s comment on: MIRI 2024 Communications Strategy
One day a mathematician doesn’t know a thing. The next day they do. In between they made no observations with their senses of the world.

It’s possible to make progress through theoretical reasoning. It’s not my preferred approach to the problem (I work on a heavily empirical team at a heavily empirical lab) but it’s not an invalid approach.

Adam Jermyn 6 Dec 2023 4:51 UTC
2 points
0
in reply to: Raemon’s comment on: The LessWrong 2022 Review
I’m guessing that the sales numbers aren’t high enough to make $200k if sold at plausible markups?

Adam Jermyn 3 Dec 2023 3:38 UTC
7 points
0
in reply to: Sam Marks’s comment on: How useful is mechanistic interpretability?
In Towards Monosemanticity we also did a version of this experiment, and found that the SAE was much less interpretable when the transformer weights were randomized (https://transformer-circuits.pub/2023/monosemantic-features/index.html#appendix-automated-randomized).

Adam Jermyn 14 Oct 2023 12:43 UTC
LW: 7 AF: 4
4
AF
in reply to: Zvi’s comment on: RSPs are pauses done right
Anthropic’s RSP includes evals after every 4x increase in effective compute and after every 3 months, whichever comes sooner, even if this happens during training, and the policy says that these evaluations include fine-tuning.

Adam Jermyn 19 Jul 2023 23:05 UTC
LW: 16 AF: 9
3
AF
on: Alignment Grantmaking is Funding-Limited Right Now
This matches my impression. At EAG London I was really stunned (and heartened!) at how many skilled people are pivoting into interpretability from non-alignment fields.

Adam Jermyn 17 May 2023 8:01 UTC
LW: 3 AF: 2
0
AF
on: EIS IX: Interpretability and Adversaries

Second, the measure of “features per dimension” used by Elhage et al. (2022) might be misleading. See the paper for details of how they arrived at this quantity. But as shown in the figure above, “features per dimension” is defined as the Frobenius norm of the weight matrix before the layer divided by the number of neurons in the layer. But there is a simple sanity check that this doesn’t pass. In the case of a ReLU network without bias terms, multiplying a weight matrix by a constant factor will cause the “features per dimension” to be increased by that factor squared while leaving the activations in the forward pass unchanged up to linearity until a non-ReLU operation (like a softmax) is performed. And since each component of a softmax’s output is strictly increasing in that component of the input, scaling weight matrices will not affect the classification.

It’s worth noting that Elhage+2022 studied an autoencoder with tied weights and no softmax, so there isn’t actually freedom to rescale the weight matrix without affecting the loss in their model, making the scale of the weights meaningful. I agree that this measure doesn’t generalize to other models/tasks though.

They also define a more fine-grained measure (the dimensionality of each individual feature) in a way that is scale-invariant and which broadly agrees with their coarser measure...

Conditioning Predictive Models: Open problems, Conclusion, and Appendix

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

10 Feb 2023 19:21 UTC

36 points

3 comments11 min readLW link

Conditioning Predictive Models: Deployment strategy

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

9 Feb 2023 20:59 UTC

28 points

0 comments10 min readLW link

Conditioning Predictive Models: Interactions with other approaches

evhub, Adam Jermyn, Johannes Treutlein, Rubi J. Hudson and kcwoolverton

8 Feb 2023 18:19 UTC

32 points

2 comments11 min readLW link