ojorgensen

Karma: 180

AI Safety Researcher, my website is here.

Understanding Counterbalanced Subtractions for Better Activation Additions

ojorgensen17 Aug 2023 13:53 UTC

21 points

0 comments14 min readLW link

Because of LayerNorm, Directions in GPT-2 MLP Layers are Monosemantic

ojorgensen28 Jul 2023 19:43 UTC

13 points

3 comments13 min readLW link

UK Foundation Model Task Force—Expression of Interest

ojorgensen18 Jun 2023 9:43 UTC

64 points

2 comments1 min readLW link

(twitter.com)

ojorgensen’s Shortform

ojorgensen4 May 2023 13:51 UTC

2 points

1 comment1 min readLW link

(Extremely) Naive Gradient Hacking Doesn’t Work

ojorgensen20 Dec 2022 14:35 UTC

14 points

0 comments6 min readLW link

[Question] Which Issues in Conceptual Alignment have been Formalised or Observed (or not)?

ojorgensen1 Nov 2022 22:32 UTC

4 points

0 comments1 min readLW link

Strange Loops—Self-Reference from Number Theory to AI

ojorgensen28 Sep 2022 14:10 UTC

15 points

6 comments18 min readLW link

Evaluating OpenAI’s alignment plans using training stories

ojorgensen25 Aug 2022 16:12 UTC

4 points

0 comments5 min readLW link

Disagreements about Alignment: Why, and how, we should try to solve them

ojorgensen9 Aug 2022 18:49 UTC

11 points

2 comments16 min readLW link