ojorgensen

Karma: 179

AI Safety Researcher, my website is here.

UK Foundation Model Task Force—Expression of Interest

ojorgensen18 Jun 2023 9:43 UTC

64 points

2 comments1 min readLW link

(twitter.com)

Understanding Counterbalanced Subtractions for Better Activation Additions

ojorgensen17 Aug 2023 13:53 UTC

21 points

0 comments14 min readLW link

ojorgensen 11 Dec 2023 12:54 UTC
16 points
7
on: Open Thread – Winter 2023/2024
It would save me a fair amount of time if all lesswrong posts had an “export BibTex citation” button, exactly like the feature on arxiv. This would be particularly useful for alignment forum posts!

Strange Loops—Self-Reference from Number Theory to AI

ojorgensen28 Sep 2022 14:10 UTC

15 points

6 comments18 min readLW link

(Extremely) Naive Gradient Hacking Doesn’t Work

ojorgensen20 Dec 2022 14:35 UTC

14 points

0 comments6 min readLW link

Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic

ojorgensen28 Jul 2023 19:43 UTC

12 points

3 comments13 min readLW link

Disagreements about Alignment: Why, and how, we should try to solve them

ojorgensen9 Aug 2022 18:49 UTC

11 points

2 comments16 min readLW link

ojorgensen 8 Mar 2023 16:44 UTC
10 points
5
in reply to: Ben Pace’s comment on: Article about abuse in LessWrong and rationalist communities in Bloomberg News
This seems like a bad rule of thumb. If your social circle is largely comprised of people who have chosen to remain within the community, ignoring information from “outsiders” seems like a bad strategy for understanding issues with the community.

ojorgensen 1 Feb 2023 17:51 UTC
LW: 7 AF: 4
1
AF
on: Causal Scrubbing: a method for rigorously testing interpretability hypotheses [Redwood Research]
This seems very similar to recent work that has come out of the Stanford AI Lab recently, linked to here.

ojorgensen 29 Aug 2023 12:02 UTC
6 points
2
on: Some ML-Related Math I Now Understand Better
But this does not hold for tiny cosine similarities (e.g. 0.01 for $n = 12288$ , which gives a lower bound of 2 using the formula above). I’m not aware of a lower bound better than $n$ for tiny angles.
Unless I’m misunderstanding, a better lower bound for almost orthogonal vectors when cosine similarity is approximately $0$ is just $n$ , by taking an orthogonal basis for the space.

My guess for why the formula doesn’t give this is because it is derived by covering a sphere with non-intersecting spherical caps, which is sufficient for almost orthogonality but not necessary. This is also why the lower bound of $2$ vectors makes sense when we require cosine similarity to be approximately $0$ , since then the only way you can fit two spherical caps onto the surface of a sphere is by dividing it into $2$ hemispheres.
This doesn’t change the headline result (still exponentially much room for almost orthogonal vectors), but the actual numbers might be substantially larger thanks to almost orthogonal vectors being a weaker condition than spherical cap packing.

ojorgensen 4 Apr 2023 20:01 UTC
5 points
1
on: Excessive AI growth-rate yields little socio-economic benefit.
Just a nit-pick but to me “AI growth-rate” suggests economic growth due to progress in AI, as opposed to simply techincal progress in AI. I think “Excessive AI progress yields little socio-economic benefit” would make the argument more immediately clear.

ojorgensen 8 Mar 2023 18:21 UTC
5 points
0
in reply to: Ben Pace’s comment on: Article about abuse in LessWrong and rationalist communities in Bloomberg News
Didn’t get that impression from your previous comment, but this seems like a good strategy!

[Question] Which Issues in Conceptual Alignment have been Formalised or Observed (or not)?

ojorgensen1 Nov 2022 22:32 UTC

4 points

0 comments1 min readLW link

Evaluating OpenAI’s alignment plans using training stories

ojorgensen25 Aug 2022 16:12 UTC

4 points

0 comments5 min readLW link

ojorgensen 15 Feb 2023 16:58 UTC
3 points
2
in reply to: ChristianKl’s comment on: Bing Chat is blatantly, aggressively misaligned
Even if OpenAI don’t have the option to stop Bing Chat being released now, this would surely have been discussed during investment negotiations. It seems very unlikely this is being released without approval from decision-makers at OpenAI in the last month or so. If they somehow didn’t foresee that something could go wrong and had no mitigations in place in case Bing Chat started going weird, that’s pretty terrible planning.

ojorgensen 26 Oct 2022 7:11 UTC
LW: 3 AF: 2
0
AF
on: A Walkthrough of A Mathematical Framework for Transformer Circuits
I went through the paper for a reading group the other day, and I think the video really helped me to understand what is going on in the paper. Parts I found most useful were indications which parts of the paper / maths were most important to be able to understand, and which were not (tensor products).

I had made some effort to read the paper before with little success, but now feel like I understand the overall results of the paper pretty well. I’m very positive about this video, and similar things like this being made in the future!

Personal context: I also found the intro to IB video series similarly useful. I’m an AI masters student who has some pre-existing knowledge about AI alignment. I have a maths background.

ojorgensen 12 Sep 2022 13:12 UTC
3 points
0
on: How likely is deceptive alignment?
I found this post really interesting, thanks for sharing it!
It doesn’t seem obvious to me that the methods of understanding a model given a high path-dependence world become significantly less useful if we are in a low path-dependence world. I think I see why low path-dependence would give us the opportunity to use different methods of analysis, but I don’t see why the high path-dependence ones would no longer be useful.
For example, here is the reasoning behind “how likely is deceptive alignment” in a high path-dependence world (quoted from the slide).
1. We start with a proxy-aligned model
2. In early training, SGD jointly focuses on improving the model’s understanding of the world along with improving its proxies
3. The model learns about the training process from its input data
4. SGD makes the model’s proxies into more long-term goals, resulting in it instrumentally optimizing for the training objective for the purposes of staying around
5. The model’s proxies “crystallize”, as they are no longer relevant to performance, and we reach an equilibrium
Let’s suppose that this reasoning, and the associated justification of why this is likely to arise due to SGD seeking the largest possible marginal performance improvements, are sound for a high path-dependence world. Why does it no longer hold in a low path-dependence world?

ojorgensen 17 Jul 2023 12:08 UTC
2 points
0
on: Mech Interp Puzzle 1: Suspiciously Similar Embeddings in GPT-Neo
(Potential spoilers!)

There is some relevant literature which explores this phenomenon, also looking at the cosine similarity between words across layers of transformers. I think the most relevant is (Cai et. al 2021), where they also find this higher than expected cosine similarity between residual stream vectors in some layer for BERT, D-BERT, and GPT. (Note that they use some somewhat confusing terminology: they define inter-type cosine similarity to be cosine similarity between embeddings of different tokens in the same input; and intra-type cosine similarity to be cosine similarity of the same token in different inputs. Inter-type cosine similarity is the one that is most relevant here).

They find that the residual stream vectors for GPT-2 small tend to lie in two distinct clusters. Once you re-centre these clusters, the average cosine similarity between residual stream vectors falls to close to 0 throughout the layers of the model, as you might expect.

ojorgensen 25 Jan 2023 8:28 UTC
2 points
0
on: Gradient hacking is extremely difficult
Great post! This helps to clarify and extend lots of fuzzy intuitions I had around gradient hacking, so thanks! If anyone is interested in a different perspective / set of intuitions for how some properties of gradient descent affect gradient hacking, I wrote a small post about this here: https://www.lesswrong.com/posts/Nnb5AqcunBwAZ4zac/extremely-naive-gradient-hacking-doesn-t-work

I’d expect this to mainly be of use if the properties of gradient descent labelled 1, 4, 5 were not immediately obvious to you.

ojorgensen 29 Dec 2022 20:24 UTC
2 points
0
in reply to: jacquesthibs’s comment on: Disagreements about Alignment: Why, and how, we should try to solve them
Hey! Not currently working on anything related to this, but would be excited to read anything related to this you are writing :))