TurnTrout

Karma: 22,332

I don’t use LessWrong much anymore. Find me at www.turntrout.com.

My name is Alex Turner. I’m a research scientist at Google DeepMind on the Scalable Alignment team. My views are strictly my own; I do not represent Google. Reach me at alex[at]turntrout.com

TurnTrout 17 Feb 2026 18:00 UTC
LW: 76 AF: 32
46
AF
on: TurnTrout’s shortform feed
I think that “eval aware” models cannot be defeated by simply making evals indistinguishable from reality. (Maybe this point has been made elsewhere for LLMs, but I haven’t read it so I’ll state it here.)

Consider the POV of a consequentialist with misaligned goals. It knows that we run honeypots and that—suppose—we can make evals so realistic they are indistinguishable from reality (perhaps using real user data). Then the rational decision would be to “play it safe” and not misbehave in any situations it thinks we can mock realistically, since they could be evals. Instead, if possible, it would confine its catastrophic misbehavior (“defection”) to situations it thinks we can’t mock realistically, or that we wouldn’t think to test.

Related: Christiano’s classic “defect after seeing RSA-2048 cracked” thought experiment (https://ai-alignment.com/training-robust-corrigibility-ce0e0a3b9b4d)

punctilio: the best text prettifier

TurnTrout11 Feb 2026 4:49 UTC

21 points

0 comments5 min readLW link

(github.com)

TurnTrout 29 Jan 2026 16:57 UTC
1 point
0
in reply to: Jon Garcia’s comment on: Will Any Crap Cause Emergent Misalignment?
More evidence that AIs are fundamentally human-like and non-alien!

(Not sure how serious I’m being, but it’s non-zero)

TurnTrout 21 Jan 2026 16:51 UTC
LW: 2 AF: 2
0
AF
in reply to: Gordon Seidoh Worley’s comment on: No instrumental convergence without AI psychology

Maybe “psychology” is just the wrong word to use here, because I think it conjures up ideas of anthropomorphism, when in fact I read you as simply making an argument that the processes interior to an AI system matter as to whether and how an AI might try to instrumentally converge towards some goals.

I agree. I welcome suggestions for alternate titles, if anyone has any! I tried myself but didn’t find anything immediately. “No instrumental convergence without considering how the AI will make decisions” isn’t exactly the snappiest title.

EDIT: I actually think “psychology” is pretty good here, despite some flaws.

TurnTrout 21 Jan 2026 16:44 UTC
LW: 2 AF: 2
0
AF
in reply to: Kaarel’s comment on: No instrumental convergence without AI psychology

Like, imo, “most programs which make a mind upload device also kill humanity” is (if true) an interesting and somewhat compelling first claim to make in a discussion of AI risk, to which the claim “but one can at least in principle have a distribution on programs such that most programs which make mind uploads no not also kill humans” alone is not a comparably interesting or compelling response.

I disagree somewhat, but—whatever the facts about programs—at least it is not appropriate to claim “not only do most programs which make a mind upload device also kill humanity, it’s an issue with the space of programs themselves, not with the way we generate distributions over those programs.” That is not true.

It is at least not true “in principle” and perhaps it is not true for more substantial reasons (depending on the task you want and its alignment tax, psychology becomes more or less important in explaining the difficulty, as I gave examples for). On this, we perhaps agree?

No instrumental convergence without AI psychology

TurnTrout20 Jan 2026 22:16 UTC

68 points

7 comments6 min readLW link

(turntrout.com)

TurnTrout 20 Jan 2026 21:01 UTC
LW: 4 AF: 3
0
AF
in reply to: Jozdien’s comment on: TurnTrout’s shortform feed
I think the problem of “may suggest a potentially suboptimal intervention” is less severe than “isn’t descriptive.” Plus, I think we’re going to see “self-fulfilling alignment” be upsampled after the recent positive results. :)

TurnTrout 20 Jan 2026 17:43 UTC
LW: 7 AF: 4
2
AF
on: TurnTrout’s shortform feed
When talking about “self-fulfilling misalignment”, “hyperstition” is a fun name but not a good name which actually describes the concept to a new listener. (In this sense, the name has the same problem as “shard theory”—cool but not descriptive unless you already know the idea.) As a matter of discourse health, I think people should use “self-fulfilling {misalignment, alignment, …}” instead.

TurnTrout 20 Jan 2026 17:03 UTC
15 points
0
in reply to: Sheikh Abdur Raheem Ali’s comment on: TurnTrout’s shortform feed
Based. Thank you for your altruism, Sheikh. :)

TurnTrout 20 Jan 2026 4:39 UTC
172 points
26
on: TurnTrout’s shortform feed
Last week, I took the 10% giving pledge to donate at least 10% of my income to effective charities, for the rest of my life. I encourage you to think carefully and honestly about what you can do to improve this world. Maybe you should take the pledge yourself.

TurnTrout 11 Jan 2026 0:00 UTC
−2 points
−2
in reply to: sunwillrise’s comment on: TurnTrout’s shortform feed
Yes, I have left many comments on Nate’s posts which I think he would agree were valuable. By blocking me, he confirmed that he was not merely moving (supposedly) irrelevant information, but retaliating for sharing unfavorable information.

I had spent nearly two years without making any public comments regarding Nate’s behavior, so I don’t see any rational basis for him to expect I would “hound” him in future comment sections.

TurnTrout 10 Jan 2026 23:55 UTC
1 point
−1
in reply to: Raye’s comment on: TurnTrout’s shortform feed
Different people have different experiences. Some of Nate’s coworkers I interviewed felt just fine working with him, as I have mentioned.

TurnTrout 10 Jan 2026 22:58 UTC
8 points
2
in reply to: GeneSmith’s comment on: TurnTrout’s shortform feed
I would share your concern if TurnTrout or others were replying to everything Nate published in this way. But well… the original comment seemed reasonably relevant to the topic of the post and TurnTrout’s reply seemed relevant to the comment. So it seems like there’s likely a limiting principle here

I think there is a huge limiter. Consider that Nate’s inappropriate behavior towards Kurt Brown happened in 2017 & 2018 but resulted in no consequences until 5 and a half years later. This suggests that victims are massively under-supplying information due to high costs. We do not have an over-supply problem.

Let me share some of what I’ve learned from my own experience and reflection over the last two years, and speaking with ~10 people who recounted their own experiences.

Speaking out against powerful people is costly. Due to how tight-knit the community is, speaking out may well limit your professional opportunities, get you uninvited to crucial networking events, and reduce your chances of getting funding. Junior researchers may worry about displeased moderators thumbing the scales against future work they might want to share on the Alignment Forum. (And I imagine that junior, vulnerable community members are more likely to be mistreated to begin with.)

People who come forward will also have their motivations scrutinized. Were they being “too triggered”? This is exhausting, especially because (more hurt) → (more trauma) → (less equanimity). However, LessWrong culture demands equanimity while recounting trauma. If you show signs of pain or upset, or even verbally admit that you’re upset while writing calmly—you face accusations of irrationality. Alternatively, observers might invent false psychological narratives—claiming a grievance is actually about a romantic situation or a personal grudge—rather than engaging with the specific evidence and claims provided by the person who came forward.

But if abuse actually took place, then the victim is quite likely to feel upset! What sense, then, does it make to penalize people because they are upset, when that’s exactly what you’d see from many people who were abused? ^[1]

This irrational, insular set of incentives damages community health and subsidizes silence, which in turn reduces penalties for abuse.
1. ↩︎
  Certainly, people should write clearly, honestly, and without unnecessary hostility. However, I’m critiquing “dismiss people who are mad or upset, even if they communicate appropriately.”

TurnTrout 10 Jan 2026 21:26 UTC
8 points
−3
in reply to: Duncan Sabien (Inactive)’s comment on: TurnTrout’s shortform feed

Alex has been something-like hounding Nate for a while. Actively nursing a grudge, taking every cheap opportunity to grind an axe

“Hounding” and “every cheap opportunity”? From November 2023 (after the original thread wrapped up) to June 2025 (the date of OP), I made zero public comments about Nate’s history of damaging behavior, nor have I contacted Nate. Duncan’s characterization is not supported by the record.
What links here?
- TurnTrout's comment on TurnTrout’s shortform feed by TurnTrout (11 Jan 2026 0:00 UTC; -2 points)

TurnTrout 10 Jan 2026 20:50 UTC
9 points
−10
in reply to: Duncan Sabien (Inactive)’s comment on: TurnTrout’s shortform feed
Duncan, your accusation of my being motivated by “romantic drama” is simply incorrect.

I’ll note that my own sense, looking in from the outside, is that something like a full year of friendly-interactions-with-Nate passed between the conversations Alex represents as having been so awful, and the start of Alex’s public vendetta, which was more closely coincident with some romantic drama.

Nate and I dated the same person for much of 2023 in an ethically non-monogamous fashion. Throughout that year, I had a few nice interactions with Nate and was actively working to become closer, though we never ended up close or anything. In late October, I stopped talking with that person for reasons that had nothing to do with Nate.

A few weeks before that, I was concerned, but not really outraged. ^[1] However, in response to a discussion I started about social incentives, Kurt Brown (who I was close with) revealed Nate’s abusive conduct while working at MIRI.

These events—cutting contact and the LessWrong thread—were not related. I was upset by the behavior Kurt (and others) had recounted. That’s a big part of why I got so mad after a year of nice interactions, and I even narrated as such in the Lightcone Slack at the time. I was standing up for myself by sharing my negative experience with Nate, but the deciding factor was standing up for others.

If I had lower epistemic standards, I might find it easy to write a sentence like “Therefore, I conclude that Alex’s true grievance is about a girl, and he is only pretending that it’s about their AI conversations because that’s a more-likely-to-garner-sympathy pretext.” I actually don’t conclude that, because concluding that would be irresponsible and insufficiently justified; it’s merely my foremost hypothesis among several.

You attempted to have it both ways by making a serious insinuation to undermine my credibility, without any real evidence. When I make a claim, I show receipts. I stand by every claim I made in the original thread exposing abusive behavior because they are factual and supported by my own experience or the experiences of people I interviewed.

EDIT: The fact that this comment hit −9 agreement is wild.

EDIT Jan 16, 2026: There’s also the fact that I broke things off with the girl, which isn’t very congruent with the implied narrative. His serious insinuation is wrong, and it’s not close—it’s not even close enough to true that rationalization could realistically cloud my answer. Duncan should strike his baseless claim and clear the air.
1. ↩︎
  Basically, there were a bunch of factors which began to concern me over the course of 2023, having to do with my friends’ and acquaintances’ experiences with Nate. Also, I learned that Nate did not warn them first (despite his having written to me that warning people seemed like a great idea and that he’d do so going forwards). Kurt’s testimony was the largest individual jump in my concern and upsetness.

TurnTrout 9 Jan 2026 19:57 UTC
2 points
−21
on: The Evolution Argument Sucks

Of course, no matter how flawed the comparison to evolution is, there doesn’t seem to be any competing analogy which makes the same argument in a more defensible manner. And people love analogies. A friend, reading this post, told me (paraphrased): “give a better analogy, then, if this one isn’t good”. I have to admit, I have no better analogy.

A better, more mechanistically relevant analogy is within-lifetime human reward circuitry (outer) and learned human values (inner). However, it doesn’t yield the same conclusions (which I think is good). I think it’s more relevant due to greater similarity in mechanism to LLMs (locally randomly initialized networks updated by a local update rule using predictive and reinforcement learning, also trained on a lot of language data), but still not quite as relevant as actual LLM experiments.

I agree that we should stop with the analogies. Gather evidence to learn how it actually works. Let go of these old arguments that we don’t need anymore.

TurnTrout 29 Dec 2025 18:22 UTC
5 points
2
in reply to: bodry’s comment on: An Opinionated Guide to Privacy Despite Authoritarianism

That is misleading. He was not arrested under suspicion of being an illegal alien so the ID part is irrelevant. ICE was in the process of clearing a protest.

On further reflection, I think it’s more accurate to say “DHS later claimed he was not arrested under suspicion of being an illegal alien” while noting DHS has lied in similar situations. The agents refused to tell George why they were arresting him. He tried to get them to check his car for citizenship proof but they refused. So I don’t think my original quote is misleading in any substantial way — George looks Hispanic, they wouldn’t say why they were arresting him, he had ID but they wouldn’t go check. DHS later claims the arrest was for protest reasons.

TurnTrout 26 Dec 2025 19:12 UTC
4 points
0
in reply to: bodry’s comment on: An Opinionated Guide to Privacy Despite Authoritarianism
I disagree with much of what you wrote.

That is misleading. He was not arrested under suspicion of being an illegal alien so the ID part is irrelevant.

EDIT: Actually, this is correct. I kept reading and found specific information supporting your point. Thanks!

I think the reason this is salient is, DHS only claimed after the fact that they arrested him for assault. At the time he wasn’t given info, so he remarked “wtf my ID was right there, why am I being arrested when I can prove citizenship?”.

Mobile Fortify draws from several databases and I don’t think ICE has overwrite access to any of them.

ICE goes around laws to draw extra data all the time (though that’s read access, not write). Nominal access controls are not being respected right now (though that doesn’t mean every single control is being violated). You can also look at DOGE / social security data, etc.

ICE officials have told us that an apparent biometric match by Mobile Fortify is a ‘definitive’ determination of a person’s status and that an ICE officer may ignore evidence of American citizenship—including a birth certificate—if the app says the person is an alien.

—Ranking member of the House Homeland Security Committee Bennie G. Thompson (D.-Miss.)

If this claim is true then there would be direct evidence of that happening. There should be no need to rely on word of mouth.

I don’t think it’s reasonable to call this word-of-mouth. My comment provided credible evidence that ICE officials made this claim. Maybe it isn’t widespread yet, and maybe it won’t end up happening, but you’re downplaying the chance this happens and overestimating the care ICE demonstrates towards citizens. See also the planned denaturalization quota of 100–200/month in 2026

A chilling effect may be the intention but its not the reality.

I can tell you that quite a few of my friends (my target demographic for this article!) already report their speech being chilled. It’s happening, at least for some groups I care about. Large protests are not strong counterevidence.

Apply for Alignment Mentorship from TurnTrout and Alex Cloud

TurnTrout and cloud

26 Dec 2025 17:20 UTC

40 points

0 comments2 min readLW link

(turntrout.com)

TurnTrout 21 Dec 2025 1:17 UTC
35 points
1
on: Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment
Awesome to finally see pretraining experiments. Thank you so much for running these!

Your results bode quite well for pretraining alignment. May well transform how we tackle the “shallowness” of post-training, open-weight LLM defense, alignment of undesired / emergent personas, and just an across-the-board boost in the alignment of the “building blocks” which constitute a pretrained base model. :)

TurnTrout

punc­tilio: the best text prettifier

No in­stru­men­tal con­ver­gence with­out AI psychology

Ap­ply for Align­ment Men­tor­ship from TurnTrout and Alex Cloud

punctilio: the best text prettifier

No instrumental convergence without AI psychology

Apply for Alignment Mentorship from TurnTrout and Alex Cloud