TurnTrout

Karma: 17,742

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Steering GPT-2-XL by adding an activation vector

TurnTrout, Monte M, David Udell, lisathiergart and Ulisse Mini

13 May 2023 18:42 UTC

423 points

97 comments50 min readLW link

Reward is not the optimization target

TurnTrout25 Jul 2022 0:03 UTC

348 points

123 comments10 min readLW link 3 reviews

Lessons I’ve Learned from Self-Teaching

TurnTrout23 Jan 2021 19:00 UTC

339 points

74 comments9 min readLW link 1 review

Looking back on my alignment PhD

TurnTrout1 Jul 2022 3:19 UTC

318 points

63 comments11 min readLW link

Understanding and controlling a maze-solving policy network

TurnTrout, peligrietzer, Ulisse Mini, Monte M and David Udell

11 Mar 2023 18:59 UTC

312 points

22 comments23 min readLW link

AI presidents discuss AI alignment agendas

TurnTrout and Garrett Baker

9 Sep 2023 18:55 UTC

216 points

22 comments1 min readLW link

(www.youtube.com)

Humans provide an untapped wealth of evidence about alignment

TurnTrout and Quintin Pope

14 Jul 2022 2:31 UTC

197 points

94 comments9 min readLW link 1 review

Do a cost-benefit analysis of your technology usage

TurnTrout27 Mar 2022 23:09 UTC

191 points

53 comments13 min readLW link

[April Fools’] Definitive confirmation of shard theory

TurnTrout1 Apr 2023 7:27 UTC

166 points

7 comments2 min readLW link

Parametrically retargetable decision-makers tend to seek power

TurnTrout18 Feb 2023 18:41 UTC

166 points

9 comments2 min readLW link

(arxiv.org)

Seeking Power is Often Convergently Instrumental in MDPs

TurnTrout and Logan Riggs

5 Dec 2019 2:33 UTC

162 points

39 comments17 min readLW link 2 reviews

(arxiv.org)

Many arguments for AI x-risk are wrong

TurnTrout5 Mar 2024 2:31 UTC

153 points

76 comments12 min readLW link

Emotionally Confronting a Probably-Doomed World: Against Motivation Via Dignity Points

TurnTrout10 Apr 2022 18:45 UTC

151 points

7 comments9 min readLW link

Inner and outer alignment decompose one hard problem into two extremely hard problems

TurnTrout2 Dec 2022 2:43 UTC

139 points

22 comments47 min readLW link 3 reviews

Insights from Euclid’s ‘Elements’

TurnTrout4 May 2020 15:45 UTC

126 points

17 comments4 min readLW link

Think carefully before calling RL policies “agents”

TurnTrout2 Jun 2023 3:46 UTC

124 points

35 comments4 min readLW link

Transcript: “You Should Read HPMOR”

TurnTrout2 Nov 2021 18:20 UTC

123 points

12 comments5 min readLW link 1 review

Problem relaxation as a tactic

TurnTrout22 Apr 2020 23:44 UTC

119 points

8 comments7 min readLW link

Predictions for shard theory mechanistic interpretability results

TurnTrout, Ulisse Mini and peligrietzer

1 Mar 2023 5:16 UTC

105 points

10 comments5 min readLW link

TurnTrout 20 Jun 2022 16:45 UTC
103 points
44
in reply to: concernedcitizen64’s comment on: Where I agree and disagree with Eliezer
I don’t care what you think you’re saying—the primary operative takeaway for a large proportion of people, maybe everybody except recurring characters like Paul Christiano, is that even if their internal models say they have a solution, they should just shut up because they’re not you and can’t think correctly about these sorts of issues.
I think this is, unfortunately, true. One reason people might feel this way is because they view LessWrong posts through a social lens. Eliezer posts about how doomed alignment is and how stupid everyone else’s solution attempts are, that feels bad, you feel sheepish about disagreeing, etc.
But despite understandably having this reaction to the social dynamics, the important part of the situation is not the social dynamics. It is about finding technical solutions to prevent utter ruination. When I notice the status-calculators in my brain starting to crunch and chew on Eliezer’s posts, I tell them to be quiet, that’s not important, who cares whether he thinks I’m a fool. I enter a frame in which Eliezer is a generator of claims and statements, and often those claims and statements are interesting and even true, so I do pay attention to that generator’s outputs, but it’s still up to me to evaluate those claims and statements, to think for myself.
If Eliezer says everyone’s ideas are awful, that’s another claim to be evaluated. If Eliezer says we are doomed, that’s another claim to be evaluated. The point is not to argue Eliezer into agreement, or to earn his respect. The point is to win in reality, and I’m not going to do that by constantly worrying about whether I should shut up.
If I’m wrong on an object-level point, I’m wrong, and I’ll change my mind, and then keep working. The rest is distraction.