RSS

ojorgensen

Karma: 180

AI Safety Researcher, my website is here.

Un­der­stand­ing Coun­ter­bal­anced Sub­trac­tions for Bet­ter Ac­ti­va­tion Additions

ojorgensen17 Aug 2023 13:53 UTC
21 points
0 comments14 min readLW link

Be­cause of Lay­erNorm, Direc­tions in GPT-2 MLP Lay­ers are Monosemantic

ojorgensen28 Jul 2023 19:43 UTC
13 points
3 comments13 min readLW link

UK Foun­da­tion Model Task Force—Ex­pres­sion of Interest

ojorgensen18 Jun 2023 9:43 UTC
64 points
2 comments1 min readLW link
(twitter.com)

ojor­gensen’s Shortform

ojorgensen4 May 2023 13:51 UTC
2 points
1 comment1 min readLW link

(Ex­tremely) Naive Gra­di­ent Hack­ing Doesn’t Work

ojorgensen20 Dec 2022 14:35 UTC
14 points
0 comments6 min readLW link

[Question] Which Is­sues in Con­cep­tual Align­ment have been For­mal­ised or Ob­served (or not)?

ojorgensen1 Nov 2022 22:32 UTC
4 points
0 comments1 min readLW link

Strange Loops—Self-Refer­ence from Num­ber The­ory to AI

ojorgensen28 Sep 2022 14:10 UTC
15 points
6 comments18 min readLW link

Eval­u­at­ing OpenAI’s al­ign­ment plans us­ing train­ing stories

ojorgensen25 Aug 2022 16:12 UTC
4 points
0 comments5 min readLW link

Disagree­ments about Align­ment: Why, and how, we should try to solve them

ojorgensen9 Aug 2022 18:49 UTC
11 points
2 comments16 min readLW link