ojorgensen

Karma: 198

AI Safety Researcher, my website is here.

ojorgensen Nov 30, 2024, 12:21 PM
10 points
8
on: You should consider applying to PhDs (soon!)
Strong upvote!

One thing I’d emphasise is that there’s a pretty big overhead to submitting a single application (getting recommendation letters, writing a generic statement of purpose), but it doesn’t take much effort to apply to more after that (you can rejig your SOP quite easily to fit different universities). Given the application process is noisy and competitive, if you’re submitting one application you should probably submit loads if you can afford the application costs. Good luck to everyone applying! :))

ojorgensen Jan 28, 2024, 7:29 PM
1 point
0
in reply to: Chris_Leong’s comment on: Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic
Yeah I think we have the same understanding here (in hindsight I should have made this more explicit in the post / title).

I would be excited to see someone empirically try to answer the question you mention at the end. In particular, given some direction $u$ and a LayerNormed vector $v$ , one might try to quantify how smoothly rotating from $v$ towards $u$ changes the output of the MLP layer. This seems like a good test of whether the Polytope Lens is helpful / necessary for understanding the MLPs of Transformers (with smooth changes corresponding to your ‘random jostling cancels out’ corresponding to not needing to worry about Polytope Lens style issues).

ojorgensen Dec 11, 2023, 12:54 PM
16 points
7
on: Open Thread – Winter 2023/2024
It would save me a fair amount of time if all lesswrong posts had an “export BibTex citation” button, exactly like the feature on arxiv. This would be particularly useful for alignment forum posts!

ojorgensen Nov 10, 2023, 3:52 PM
1 point
0
on: Against Almost Every Theory of Impact of Interpretability
One central criticism of this post is its pessimism towards enumerative safety. (i.e. finding all features in the model, or at least all important features). I would be interested to hear how the author / others have updated on the potential of enumerative safety in light of recent progress on dictionary learning, and finding features which appear to correspond to high-level concepts like truth, utility and sycophancy. It seems clear that there should be some positive update here, but I would be interested in understanding issues which these approaches will not contribute to solving.

ojorgensen Aug 29, 2023, 12:02 PM
6 points
2
on: Some ML-Related Math I Now Understand Better
But this does not hold for tiny cosine similarities (e.g. 0.01 for $n = 12288$ , which gives a lower bound of 2 using the formula above). I’m not aware of a lower bound better than $n$ for tiny angles.
Unless I’m misunderstanding, a better lower bound for almost orthogonal vectors when cosine similarity is approximately $0$ is just $n$ , by taking an orthogonal basis for the space.

My guess for why the formula doesn’t give this is because it is derived by covering a sphere with non-intersecting spherical caps, which is sufficient for almost orthogonality but not necessary. This is also why the lower bound of $2$ vectors makes sense when we require cosine similarity to be approximately $0$ , since then the only way you can fit two spherical caps onto the surface of a sphere is by dividing it into $2$ hemispheres.
This doesn’t change the headline result (still exponentially much room for almost orthogonal vectors), but the actual numbers might be substantially larger thanks to almost orthogonal vectors being a weaker condition than spherical cap packing.

Understanding Counterbalanced Subtractions for Better Activation Additions

ojorgensenAug 17, 2023, 1:53 PM

21 points

0 comments14 min readLW link

Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic

ojorgensenJul 28, 2023, 7:43 PM

13 points

3 comments13 min readLW link

ojorgensen Jul 17, 2023, 12:08 PM
2 points
0
on: Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo
(Potential spoilers!)

There is some relevant literature which explores this phenomenon, also looking at the cosine similarity between words across layers of transformers. I think the most relevant is (Cai et. al 2021), where they also find this higher than expected cosine similarity between residual stream vectors in some layer for BERT, D-BERT, and GPT. (Note that they use some somewhat confusing terminology: they define inter-type cosine similarity to be cosine similarity between embeddings of different tokens in the same input; and intra-type cosine similarity to be cosine similarity of the same token in different inputs. Inter-type cosine similarity is the one that is most relevant here).

They find that the residual stream vectors for GPT-2 small tend to lie in two distinct clusters. Once you re-centre these clusters, the average cosine similarity between residual stream vectors falls to close to 0 throughout the layers of the model, as you might expect.

UK Foundation Model Task Force—Expression of Interest

ojorgensenJun 18, 2023, 9:43 AM

64 points

2 comments1 min readLW link

(twitter.com)

ojorgensen May 4, 2023, 1:51 PM
1 point
0
on: ojorgensen’s Shortform
Problem: we want to make it hard for ML systems (trained via SGD) to perform naive gradient hacking. By naive gradient hacking, I mean “being able to keep some weights of the network constant for an arbitrary step of SGD”.
Solution: do “stochastic” regularisation, e.g. sample the amount of regularisation we perform randomly (could use quantum stuff if we want true randomness). This seems like it should make naive gradient hacking almost impossible—in order to keep some target weights unchanged, you’d have to match their +ve contribution to the loss to the degree of regularisation. If the degree of regularisation is stochastic, the loss contribution must also be stochastic, which is not possible with a determinstic NN!

The question of how useful this is from preventing any gradient hacking depends on how stable the loss landscape around some “deceptive / gradient hacking minimum is”. Seems possible the surrounding loss landscape could be pretty unstable to random pertubations?

ojorgensen’s Shortform

ojorgensenMay 4, 2023, 1:51 PM

2 points

1 comment LW link

ojorgensen Apr 4, 2023, 8:01 PM
5 points
1
on: Excessive AI growth-rate yields little socio-economic benefit.
Just a nit-pick but to me “AI growth-rate” suggests economic growth due to progress in AI, as opposed to simply techincal progress in AI. I think “Excessive AI progress yields little socio-economic benefit” would make the argument more immediately clear.

ojorgensen Mar 9, 2023, 11:00 AM
LW: 1 AF: 1
0
AF
on: EIS XI: Moving Forward
Rando et al. (2022)
This link is broken btw!

ojorgensen Mar 8, 2023, 6:21 PM
5 points
0
in reply to: Ben Pace’s comment on: Article about abuse in LessWrong and rationalist communities in Bloomberg News
Didn’t get that impression from your previous comment, but this seems like a good strategy!

ojorgensen Mar 8, 2023, 4:44 PM
12 points
7
in reply to: Ben Pace’s comment on: Article about abuse in LessWrong and rationalist communities in Bloomberg News
This seems like a bad rule of thumb. If your social circle is largely comprised of people who have chosen to remain within the community, ignoring information from “outsiders” seems like a bad strategy for understanding issues with the community.

ojorgensen Feb 15, 2023, 4:58 PM
3 points
2
in reply to: ChristianKl’s comment on: Bing Chat is blatantly, aggressively misaligned
Even if OpenAI don’t have the option to stop Bing Chat being released now, this would surely have been discussed during investment negotiations. It seems very unlikely this is being released without approval from decision-makers at OpenAI in the last month or so. If they somehow didn’t foresee that something could go wrong and had no mitigations in place in case Bing Chat started going weird, that’s pretty terrible planning.

ojorgensen Feb 1, 2023, 5:51 PM
LW: 7 AF: 4
1
AF
on: Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
This seems very similar to recent work that has come out of the Stanford AI Lab recently, linked to here.

ojorgensen Jan 25, 2023, 8:28 AM
2 points
0
on: Gradient hacking is extremely difficult
Great post! This helps to clarify and extend lots of fuzzy intuitions I had around gradient hacking, so thanks! If anyone is interested in a different perspective / set of intuitions for how some properties of gradient descent affect gradient hacking, I wrote a small post about this here: https://www.lesswrong.com/posts/Nnb5AqcunBwAZ4zac/extremely-naive-gradient-hacking-doesn-t-work

I’d expect this to mainly be of use if the properties of gradient descent labelled 1, 4, 5 were not immediately obvious to you.

ojorgensen Dec 29, 2022, 8:24 PM
2 points
0
in reply to: jacquesthibs’s comment on: Disagreements about Alignment: Why, and how, we should try to solve them
Hey! Not currently working on anything related to this, but would be excited to read anything related to this you are writing :))

(Extremely) Naive Gradient Hacking Doesn’t Work

ojorgensenDec 20, 2022, 2:35 PM

17 points

0 comments6 min readLW link

ojorgensen

Un­der­stand­ing Coun­ter­bal­anced Sub­trac­tions for Bet­ter Ac­ti­va­tion Additions

Be­cause of Lay­erNorm, Direc­tions in GPT-2 MLP Lay­ers are Monosemantic

UK Foun­da­tion Model Task Force—Ex­pres­sion of Interest

ojor­gensen’s Shortform

(Ex­tremely) Naive Gra­di­ent Hack­ing Doesn’t Work

Understanding Counterbalanced Subtractions for Better Activation Additions

Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic

UK Foundation Model Task Force—Expression of Interest

ojorgensen’s Shortform

(Extremely) Naive Gradient Hacking Doesn’t Work