TurnTrout

Karma: 22,683

I don’t use LessWrong much anymore. Find me at www.turntrout.com.

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

punctilio: the best text prettifier

TurnTrout11 Feb 2026 4:49 UTC

24 points

0 comments5 min readLW link

(github.com)

No instrumental convergence without AI psychology

TurnTrout20 Jan 2026 22:16 UTC

68 points

7 comments6 min readLW link

(turntrout.com)

Apply for Alignment Mentorship from TurnTrout and Alex Cloud

TurnTrout and cloud

26 Dec 2025 17:20 UTC

41 points

0 comments2 min readLW link

(turntrout.com)

2025-Era “Reward Hacking” Does Not Show that Reward Is the Optimization Target

TurnTrout19 Dec 2025 6:09 UTC

46 points

9 comments7 min readLW link

(turntrout.com)

Automatic alt text generation

TurnTrout22 Nov 2025 17:57 UTC

27 points

1 comment1 min readLW link

(turntrout.com)

[Paper] Output Supervision Can Obfuscate the CoT

jacob_drori, lukemarks, cloud and TurnTrout

20 Nov 2025 22:41 UTC

78 points

3 comments5 min readLW link

(arxiv.org)

GDM: Consistency Training Helps Limit Sycophancy and Jailbreaks in Gemini 2.5 Flash

TurnTrout and Rohin Shah

4 Nov 2025 16:25 UTC

53 points

2 comments6 min readLW link

(arxiv.org)

An Opinionated Guide to Privacy Despite Authoritarianism

TurnTrout29 Oct 2025 20:32 UTC

181 points

31 comments4 min readLW link

(turntrout.com)

Recontextualization Mitigates Specification Gaming Without Modifying the Specification

ariana_azarbal, Victor Gillioz, TurnTrout and cloud

14 Oct 2025 0:53 UTC

144 points

15 comments10 min readLW link

Training a Reward Hacker Despite Perfect Labels

ariana_azarbal, Victor Gillioz and TurnTrout

14 Aug 2025 23:57 UTC

139 points

47 comments4 min readLW link

[Research Note] Optimizing The Final Output Can Obfuscate CoT

lukemarks, jacob_drori, cloud and TurnTrout

30 Jul 2025 21:26 UTC

201 points

23 comments6 min readLW link

English writes numbers backwards

TurnTrout25 Jul 2025 23:00 UTC

14 points

23 comments12 min readLW link

(turntrout.com)

We Built a Tool to Protect Your Dataset From Simple Scrapers

TurnTrout, Edward Turner, Dipika Khullar and Roy Rinberg

25 Jul 2025 5:44 UTC

65 points

9 comments3 min readLW link

A Simple Explanation of AGI Risk

TurnTrout1 Jul 2025 16:18 UTC

58 points

4 comments5 min readLW link

(turntrout.com)

Authors Have a Responsibility to Communicate Clearly

TurnTrout1 Jul 2025 15:41 UTC

125 points

29 comments6 min readLW link

(turntrout.com)

Distillation Robustifies Unlearning

Bruce W. Lee, Addie Foote, alexinf, leni, Jacob G-W, Harish Kamath, Bryce Woodworth, cloud and TurnTrout

13 Jun 2025 13:45 UTC

236 points

43 comments8 min readLW link

(arxiv.org)

Self-fulfilling misalignment data might be poisoning our AI models

TurnTrout2 Mar 2025 19:51 UTC

162 points

29 comments1 min readLW link

(turntrout.com)

Steering Gemini with BiDPO

TurnTrout31 Jan 2025 2:37 UTC

104 points

5 comments1 min readLW link

(turntrout.com)

Insights from “The Manga Guide to Physiology”

TurnTrout24 Jan 2025 5:18 UTC

27 points

3 comments1 min readLW link

(turntrout.com)

Deceptive Alignment and Homuncularity

Oliver Sourbut and TurnTrout

16 Jan 2025 13:55 UTC

26 points

12 comments22 min readLW link