RSS

TurnTrout

Karma: 22,676

I don’t use LessWrong much anymore. Find me at www.turntrout.com.

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

punc­tilio: the best text prettifier

TurnTrout11 Feb 2026 4:49 UTC
24 points
0 comments5 min readLW link
(github.com)

No in­stru­men­tal con­ver­gence with­out AI psychology

TurnTrout20 Jan 2026 22:16 UTC
68 points
7 comments6 min readLW link
(turntrout.com)

Ap­ply for Align­ment Men­tor­ship from TurnTrout and Alex Cloud

26 Dec 2025 17:20 UTC
41 points
0 comments2 min readLW link
(turntrout.com)

2025-Era “Re­ward Hack­ing” Does Not Show that Re­ward Is the Op­ti­miza­tion Target

TurnTrout19 Dec 2025 6:09 UTC
46 points
9 comments7 min readLW link
(turntrout.com)

Au­to­matic alt text generation

TurnTrout22 Nov 2025 17:57 UTC
27 points
1 comment1 min readLW link
(turntrout.com)

[Paper] Out­put Su­per­vi­sion Can Obfus­cate the CoT

20 Nov 2025 22:41 UTC
78 points
3 comments5 min readLW link
(arxiv.org)

GDM: Con­sis­tency Train­ing Helps Limit Sy­co­phancy and Jailbreaks in Gem­ini 2.5 Flash

4 Nov 2025 16:25 UTC
53 points
2 comments6 min readLW link
(arxiv.org)

An Opinionated Guide to Pri­vacy De­spite Authoritarianism

TurnTrout29 Oct 2025 20:32 UTC
181 points
30 comments4 min readLW link
(turntrout.com)

Re­con­tex­tu­al­iza­tion Miti­gates Speci­fi­ca­tion Gam­ing Without Mod­ify­ing the Specification

14 Oct 2025 0:53 UTC
142 points
15 comments10 min readLW link

Train­ing a Re­ward Hacker De­spite Perfect Labels

14 Aug 2025 23:57 UTC
139 points
45 comments4 min readLW link

[Re­search Note] Op­ti­miz­ing The Fi­nal Out­put Can Obfus­cate CoT

30 Jul 2025 21:26 UTC
201 points
23 comments6 min readLW link

English writes num­bers backwards

TurnTrout25 Jul 2025 23:00 UTC
14 points
23 comments12 min readLW link
(turntrout.com)

We Built a Tool to Pro­tect Your Dataset From Sim­ple Scrapers

25 Jul 2025 5:44 UTC
65 points
9 comments3 min readLW link

A Sim­ple Ex­pla­na­tion of AGI Risk

TurnTrout1 Jul 2025 16:18 UTC
58 points
4 comments5 min readLW link
(turntrout.com)

Authors Have a Re­spon­si­bil­ity to Com­mu­ni­cate Clearly

TurnTrout1 Jul 2025 15:41 UTC
125 points
29 comments6 min readLW link
(turntrout.com)

Distil­la­tion Ro­bus­tifies Unlearning

13 Jun 2025 13:45 UTC
236 points
43 comments8 min readLW link
(arxiv.org)

Self-fulfilling mis­al­ign­ment data might be poi­son­ing our AI models

TurnTrout2 Mar 2025 19:51 UTC
162 points
29 comments1 min readLW link
(turntrout.com)

Steer­ing Gem­ini with BiDPO

TurnTrout31 Jan 2025 2:37 UTC
104 points
5 comments1 min readLW link
(turntrout.com)

In­sights from “The Manga Guide to Phys­iol­ogy”

TurnTrout24 Jan 2025 5:18 UTC
27 points
3 comments1 min readLW link
(turntrout.com)

De­cep­tive Align­ment and Homuncularity

16 Jan 2025 13:55 UTC
26 points
12 comments22 min readLW link