RSS

TurnTrout

Karma: 21,265

I don’t use LessWrong much anymore. Find me at www.turntrout.com.

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

Train­ing a Re­ward Hacker De­spite Perfect Labels

14 Aug 2025 23:57 UTC
127 points
45 comments4 min readLW link

Op­ti­miz­ing The Fi­nal Out­put Can Obfus­cate CoT (Re­search Note)

30 Jul 2025 21:26 UTC
195 points
22 comments6 min readLW link

English writes num­bers backwards

TurnTrout25 Jul 2025 23:00 UTC
8 points
23 comments12 min readLW link
(turntrout.com)

We Built a Tool to Pro­tect Your Dataset From Sim­ple Scrapers

25 Jul 2025 5:44 UTC
55 points
9 comments3 min readLW link

A Sim­ple Ex­pla­na­tion of AGI Risk

TurnTrout1 Jul 2025 16:18 UTC
66 points
4 comments5 min readLW link
(turntrout.com)

Authors Have a Re­spon­si­bil­ity to Com­mu­ni­cate Clearly

TurnTrout1 Jul 2025 15:41 UTC
125 points
29 comments6 min readLW link
(turntrout.com)

Distil­la­tion Ro­bus­tifies Unlearning

13 Jun 2025 13:45 UTC
232 points
43 comments8 min readLW link
(arxiv.org)

Self-fulfilling mis­al­ign­ment data might be poi­son­ing our AI models

TurnTrout2 Mar 2025 19:51 UTC
154 points
29 comments1 min readLW link
(turntrout.com)

Steer­ing Gem­ini with BiDPO

TurnTrout31 Jan 2025 2:37 UTC
104 points
5 comments1 min readLW link
(turntrout.com)

In­sights from “The Manga Guide to Phys­iol­ogy”

TurnTrout24 Jan 2025 5:18 UTC
26 points
3 comments1 min readLW link
(turntrout.com)

De­cep­tive Align­ment and Homuncularity

16 Jan 2025 13:55 UTC
26 points
12 comments22 min readLW link

Gam­ing Truth­fulQA: Sim­ple Heuris­tics Ex­posed Dataset Weaknesses

TurnTrout16 Jan 2025 2:14 UTC
64 points
3 comments1 min readLW link
(turntrout.com)

Re­view: Break­ing Free with Dr. Stone

TurnTrout18 Dec 2024 1:26 UTC
47 points
5 comments1 min readLW link
(turntrout.com)

Gra­di­ent Rout­ing: Mask­ing Gra­di­ents to Lo­cal­ize Com­pu­ta­tion in Neu­ral Networks

6 Dec 2024 22:19 UTC
169 points
12 comments11 min readLW link
(arxiv.org)

Deep Causal Transcod­ing: A Frame­work for Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

3 Dec 2024 21:19 UTC
106 points
8 comments41 min readLW link

In­trin­sic Power-Seek­ing: AI Might Seek Power for Power’s Sake

TurnTrout19 Nov 2024 18:36 UTC
40 points
5 comments1 min readLW link
(turntrout.com)

An­nounc­ing turn­trout.com, my new digi­tal home

TurnTrout17 Nov 2024 17:42 UTC
108 points
33 comments1 min readLW link
(turntrout.com)

I found >800 or­thog­o­nal “write code” steer­ing vectors

15 Jul 2024 19:06 UTC
104 points
19 comments7 min readLW link
(jacobgw.com)

Mechanis­ti­cally Elic­it­ing La­tent Be­hav­iors in Lan­guage Models

30 Apr 2024 18:51 UTC
214 points
43 comments45 min readLW link

Many ar­gu­ments for AI x-risk are wrong

TurnTrout5 Mar 2024 2:31 UTC
168 points
87 comments12 min readLW link