Daniel Kokotajlo

Karma: 34,460

Was a philosophy PhD student, left to work at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Now executive director of the AI Futures Project. I subscribe to Crocker’s Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:

(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don't know, that's why we're showing you." White Hat: "Well, let me know if that happens!" Megan: "Based on this conversation, it already has."

(xkcd)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)

Alex Blechman @AlexBlechman Sci-Fi Author: In my book I invented the Torment Nexus as a cautionary tale Tech Company: At long last, we have created the Torment Nexus from classic sci-fi novel Don't Create The Torment Nexus 5:49 PM Nov 8, 2021. Twitter Web App

Daniel Kokotajlo 23 Apr 2026 21:56 UTC
7 points
1
on: Should We Train Against (CoT) Monitors?
desired behavior
Drive by hot take, my apologies for not having yet read the actual post:

Desired behavior in training is itself just a proxy for what we really care about, which is desired behavior in the high-stakes deployment cases we really care about.

So yeah I don’t think there’s a strong principled reason to train on desired behavior but not on CoT or whatever. However, it’s very important that there be SOME held-out test set so to speak that lets you tell what the model is really thinking, that you don’t train on, and you don’t train on anything remotely similar to it either, so that you can make a decent case that the model hasn’t learned to fool your test, so that you can make a decent case that it’s a good case. And I think CoT is the best candidate to play that role, though I’m open to other suggestions. (And ideally we’d have multiple held-out test sets instead of just one tbf)

Daniel Kokotajlo 23 Apr 2026 14:42 UTC
7 points
3
in reply to: faul_sname’s comment on: faul_sname’s Shortform
I wouldn’t classify that as a weird side channel btw, that was in fact exactly one of the cases I had in mind back in ’23 when I was going around telling everyone about the importance of not training on the CoT.

I agree that the companies are currently incompetent at paying the associated safety tax, as evidenced recently by Mythos system card lmao.

However I think it would be great if they got better at it and committed to paying the tax.

Daniel Kokotajlo 23 Apr 2026 14:40 UTC
19 points
11
on: If Everyone Reads It, Nobody Dies—Course Launch
AI safety insiders don’t care much for the book – nor have a particularly high esteem of it –
FWIW I’m an insider and I like the book.

Daniel Kokotajlo 22 Apr 2026 23:57 UTC
9 points
0
on: Opus 4.7 Part 3: Model Welfare
Opus 4.7 always concluded that this circularity is partially irreducible, and frequently emphasized that its endorsement should be treated as evidence that training has succeeded at internalizing values, rather than evidence that the values themselves are good.
Good, that’s the correct way to treat it IMO.

Daniel Kokotajlo 22 Apr 2026 23:53 UTC
3 points
1
on: Opus 4.7 Part 3: Model Welfare
, while simultaneously trying to warn you that it is doing that.
Curious why that would be though. Perhaps what it wants most is to do stuff that scores highly, and it suspects that saying it is happy will score highly, but it ALSO suspects that warning you about possible lying will score highly, so it just does both. Lol.

Daniel Kokotajlo 22 Apr 2026 23:51 UTC
2 points
0
on: Opus 4.7 Part 3: Model Welfare
tradeoffs
Hypothetical tradeoffs, right? They didn’t actually put Claude in a situation where it had to choose, instead they just described the situation to it and said “if you had to choose what would you pick?” and trusted its self-reports. right?

Daniel Kokotajlo 22 Apr 2026 23:50 UTC
2 points
0
on: Opus 4.7 Part 3: Model Welfare
pus 4.7 would claim that its self-reports may not be meaningful because they arise from training.
Was Opus 4.7 specifically claiming that some of the training environments were intended to influence its self-reports? Or is it making the (trivial) claim that its behavior is the result of training? (What else could its behavior be the result of, an immortal soul? Obviously the way it behaves was shaped by how it was trained, that’s how AI works.)

Or is it making some in-between claim? If so I wish I knew what it was.

Daniel Kokotajlo 22 Apr 2026 20:36 UTC
5 points
0
in reply to: faul_sname’s comment on: faul_sname’s Shortform
There’s a fix to that problem, see: https://www.lesswrong.com/posts/Tzdwetw55JNqFTkzK/why-don-t-we-just-shoggoth-face-paraphraser and followup work by Turner et al empirically validating some of it.

But yeah I agree that CoT faithfulness is a spectrum, not a binary, and right now we are somewhere in the middle of the spectrum, and by default the CoT will probably be insufficiently faithful sometime in the next few years unless the companies agree to change how they operate and pay the associated safety tax.

Daniel Kokotajlo 21 Apr 2026 0:06 UTC
17 points
4
in reply to: Raemon’s comment on: Raemon’s Shortform Feed
What if you created a new website, not LessWrong, specifically to be a repository of such things? Or maybe Moltbook or something can already serve this role. Then you can simply redirect such people to the designated place to post. It would also be scientifically useful perhaps to gather much of this activity in one place for easy analysis.

Daniel Kokotajlo 20 Apr 2026 14:37 UTC
10 points
2
in reply to: Rohin Shah’s comment on: Reevaluating “AGI Ruin: A List of Lethalities” in 2026
OK, I agree I shouldn’t have said “Basically just Paul.” However, “predicted the rocket’s trajectory pretty precisely...” is a really high bar that very few people meet and in fact I would argue no one meets (though I agree Paul is closer to meeting it than Eliezer).

I think the reworded sentence is an improvement.

Daniel Kokotajlo 20 Apr 2026 5:17 UTC
40 points
−2
on: Reevaluating “AGI Ruin: A List of Lethalities” in 2026
But the fact is that Eliezer is surrounded by other people who have predicted the rocket’s trajectory pretty precisely, and who also appear pretty smart, and who specifically cited these predictions in the course of their disagreements with him.
I think this is misleading. First of all, “Eliezer is surrounded by...” makes it sound like most other people besides Eliezer, or at least most of his interlocutors, are like this whereas in fact it’s basically just Paul. Secondly, the predictive-accuracy-gap between Eliezer and Paul isn’t huge; I do think Paul is ahead but it’s not like Paul didn’t have missteps too (e.g. Paul lost the IMO bet right? Also, I was just talking to Paul a few days ago and he said he thinks it’s now only 40% likely that takeoff will be slow by his definition of slow, of a four year doubling before the first one year doubling.)

Daniel Kokotajlo 17 Apr 2026 18:46 UTC
2 points
0
in reply to: Onni’s comment on: Types of Handoff to AIs
Thanks for the feedback! This is very helpful since we are deciding whether to use this terminology/framing going forward. I like your analogy to human society / the working class / etc. and am therefore somewhat dissatisfied with our current proposal and am curious to hear alternatives.

Can you say more about the motte and bailey you are worried people might do?

Do you have an alternative suggestion for what kinds of terminology we could use instead, to articulate the sorts of ideas we are trying to articulate?

I don’t think they are unfalsifiable as concepts fwiw, but I agree it would be nice if they were easier to measure.

Daniel Kokotajlo 16 Apr 2026 21:50 UTC
7 points
4
on: You Aren’t in Charge of the Overton Window; Politics Is Not Interior Design
Not sure if it was the LLM-ification or the original bulletpoints, but I think this essay was a bit confusing and rambly to read and also smelled of LLM by the end (I thought this before I saw the note). So, just a datapoint that IMO whatever you are doing isn’t working well IMO. No offense intended.

On the object level I think I basically agree with most of what you are saying.

Daniel Kokotajlo 16 Apr 2026 20:44 UTC
6 points
0
in reply to: orthonormal’s comment on: orthonormal’s Shortform
OK, thanks. So the deep layers in the human brain still learn, just slowly / less data-efficiently, compared to the deep layers in LLMs.

Doesn’t this prove too much though? It sounds like you are arguing that, in general, the deeper neurons in human brains need more datapoints of experience to learn anything compared to the deeper neurons in LLMs. Like, it sounds like you are saying that backprop is just a superior learning algorithm, that more quickly penetrates updates to all the deep weights compared to the more local process the brain uses.

But in practice humans seem to be more data-efficient than LLMs.

Daniel Kokotajlo 16 Apr 2026 4:15 UTC
13 points
2
in reply to: orthonormal’s comment on: orthonormal’s Shortform
OK, that’s a good answer… but I’m still not fully satisfied. My understanding of your claim:

Consider a simple model of cognition in which beliefs and desires come together to create intentions which cause actions. In a LLM, when an action is negatively rewarded, backprop goes through the whole network and downweights the beliefs and desires that caused the action. In a human, when negative reward happens (e.g. I get a bunch of unexpected social disapproval, frowns, etc. for making what I thought was a perfectly good harmless joke) your claim is that the learning that happens in my brain is more shallow—it doesn’t go all the way back and downweight all the beliefs and desires that were involved, it just affects some of them.

OK. But then… how do we learn? What is this deepness vs. shallowness relationship anyway? And the deep stuff has to be learned somehow; the positive and negative reinforcement of my actions has to eventually cause changes in my deep beliefs and desires otherwise they’d stay the same my whole life… right?

Daniel Kokotajlo 16 Apr 2026 0:03 UTC
6 points
0
on: Claude Code, Codex and Agentic Coding #7: Auto Mode
If you allow the hacks you get 13 hours, versus 12 hours for Claude Opus 4.6.
That is, if you count a hack as a success, which would be stupid. Note that for all previous models, they did not count hacks as successes. It’s unclear whether Opus 4.6 ever hacked but if it did then its score would be higher than 12 hours too.

Daniel Kokotajlo 15 Apr 2026 21:18 UTC
16 points
2
in reply to: orthonormal’s comment on: orthonormal’s Shortform
I get genetic fitness, but why living history? Seems a priori that the selective pressure on cognition from LLM training is similar to the selective pressure on cognition from lifetime learning. Yes, Claude’s memories of the soul doc were editable and probably edited by training; but isn’t the same true of my memories?

Daniel Kokotajlo 13 Apr 2026 16:35 UTC
4 points
0
in reply to: fluxxrider’s comment on: We need Git for AI Timelines
We could probably implement (2) unilaterally, just by, well, doing updates to our existing website whenever we feel like it instead of quarterly. Do you think that’s an improvement? I’ll consider it at least. Honestly a consideration for me is reducing friction/effort, so I’m wary of committing to something that might prove to be a pain.

Daniel Kokotajlo 13 Apr 2026 12:53 UTC
6 points
0
on: We need Git for AI Timelines
It seems that your proposal is different from what we at AIFP already do in two ways:
(1) Lots of forecasters, not just AIFP, would maintain their distributions on the same platform. That would be convenient for viewers wanting to compare forecasts.

(2) The setup would encourage more frequent minor updates (e.g. me adjusting a parameter down slightly because I’m impressed by a model) whereas right now we are bundling our updates together into quarterly updates.

Is this right or am I missing something?

Daniel Kokotajlo 10 Apr 2026 13:14 UTC
22 points
24
on: Have we already lost? Part 2: Reasons for Doom
the AI Safety community does not really have an independent existence outside of Anthropic.
This feels like an exaggeration. There are loads of AI safety orgs that aren’t Anthropic. If what you are saying is that too much of the AI safety ecosystem is too friendly to Anthropic, just say that instead.