RSS

TurnTrout

Karma: 17,369

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Steer­ing GPT-2-XL by adding an ac­ti­va­tion vector

13 May 2023 18:42 UTC
416 points
97 comments50 min readLW link

Re­ward is not the op­ti­miza­tion target

TurnTrout25 Jul 2022 0:03 UTC
347 points
122 comments10 min readLW link3 reviews

Les­sons I’ve Learned from Self-Teaching

TurnTrout23 Jan 2021 19:00 UTC
337 points
74 comments9 min readLW link1 review

Look­ing back on my al­ign­ment PhD

TurnTrout1 Jul 2022 3:19 UTC
319 points
63 comments11 min readLW link

Un­der­stand­ing and con­trol­ling a maze-solv­ing policy network

11 Mar 2023 18:59 UTC
312 points
22 comments23 min readLW link

AI pres­i­dents dis­cuss AI al­ign­ment agendas

9 Sep 2023 18:55 UTC
213 points
22 comments1 min readLW link
(www.youtube.com)

Hu­mans provide an un­tapped wealth of ev­i­dence about alignment

14 Jul 2022 2:31 UTC
196 points
94 comments9 min readLW link1 review

Do a cost-benefit anal­y­sis of your tech­nol­ogy usage

TurnTrout27 Mar 2022 23:09 UTC
191 points
53 comments13 min readLW link

[April Fools’] Defini­tive con­fir­ma­tion of shard theory

TurnTrout1 Apr 2023 7:27 UTC
166 points
7 comments2 min readLW link

Para­met­ri­cally re­tar­getable de­ci­sion-mak­ers tend to seek power

TurnTrout18 Feb 2023 18:41 UTC
166 points
9 comments2 min readLW link
(arxiv.org)

Seek­ing Power is Often Con­ver­gently In­stru­men­tal in MDPs

5 Dec 2019 2:33 UTC
161 points
39 comments17 min readLW link2 reviews
(arxiv.org)

Many ar­gu­ments for AI x-risk are wrong

TurnTrout5 Mar 2024 2:31 UTC
151 points
68 comments12 min readLW link

Emo­tion­ally Con­fronting a Prob­a­bly-Doomed World: Against Mo­ti­va­tion Via Dig­nity Points

TurnTrout10 Apr 2022 18:45 UTC
151 points
7 comments9 min readLW link

In­ner and outer al­ign­ment de­com­pose one hard prob­lem into two ex­tremely hard problems

TurnTrout2 Dec 2022 2:43 UTC
139 points
22 comments47 min readLW link3 reviews

In­sights from Eu­clid’s ‘Ele­ments’

TurnTrout4 May 2020 15:45 UTC
126 points
17 comments4 min readLW link

Think care­fully be­fore call­ing RL poli­cies “agents”

TurnTrout2 Jun 2023 3:46 UTC
124 points
35 comments4 min readLW link

Tran­script: “You Should Read HPMOR”

TurnTrout2 Nov 2021 18:20 UTC
122 points
12 comments5 min readLW link1 review

Prob­lem re­lax­ation as a tactic

TurnTrout22 Apr 2020 23:44 UTC
118 points
8 comments7 min readLW link

Pre­dic­tions for shard the­ory mechanis­tic in­ter­pretabil­ity results

1 Mar 2023 5:16 UTC
105 points
10 comments5 min readLW link