Daniel Kokotajlo

Karma: 18,808

Philosophy PhD student, worked at AI Impacts, then Center on Long-Term Risk, then OpenAI. Quit OpenAI due to losing confidence that it would behave responsibly around the time of AGI. Not sure what I’ll do next yet. Views are my own & do not represent those of my current or former employer(s). I subscribe to Crocker’s Rules and am especially interested to hear unsolicited constructive criticism. http://sl4.org/crocker.html

Some of my favorite memes:

(by Rob Wiblin)

Comic. Megan & Cueball show White Hat a graph of a line going up, not yet at, but heading towards, a threshold labelled "BAD". White Hat: "So things will be bad?" Megan: "Unless someone stops it." White Hat: "Will someone do that?" Megan: "We don

(xkcd)

My EA Journey, depicted on the whiteboard at CLR:

(h/t Scott Alexander)

Daniel Kokotajlo 12 May 2024 15:47 UTC
6 points
4
on: Applying refusal-vector ablation to a Llama 3 70B agent
In general, Llama 3 70B is a competent agent with appropriate scaffolding, and Llama 3 8B also has decent performance.

I’m curious about, and skeptical of, this claim. If you set it up in an Auto-GPT-esque scaffold with connections to the internet and ability to edit docs and make forum comments and emails and so forth, and set it loose with some long-term goal like “accumulate money” or “befriend people” or whatever… does it actually chug along for hours and hours moving vaguely in the right direction, or does it e.g. get stuck pretty quickly or go into some sort of confused doom spiral?

Daniel Kokotajlo 10 May 2024 19:27 UTC
109 points
14
in reply to: robo’s comment on: simeon_c’s Shortform
The latter. Yeah idk whether the sacrifice was worth it but thanks for the support. Basically I wanted to retain my ability to criticize the company in the future. I’m not sure what I’d want to say yet though & I’m a bit scared of media attention.

Daniel Kokotajlo 10 May 2024 15:23 UTC
209 points
2
in reply to: habryka’s comment on: simeon_c’s Shortform
To clarify: I did sign something when I joined the company, so I’m still not completely free to speak (still under confidentiality obligations). But I didn’t take on any additional obligations when I left.

Unclear how to value the equity I gave up, but it probably would have been about 85% of my family’s net worth at least. But we are doing fine, please don’t worry about us.

Daniel Kokotajlo 8 May 2024 16:26 UTC
2 points
0
in reply to: Sammy Martin’s comment on: Uncontrollable Super-Powerful Explosives
Fair enough.

Daniel Kokotajlo 7 May 2024 23:52 UTC
2 points
1
in reply to: Jiro’s comment on: Q&A on Proposed SB 1047
What do you mean “of this type?” Why not just say “laws” full stop? What type are you referring to?

Daniel Kokotajlo 7 May 2024 0:27 UTC
LW: 9 AF: 6
0
AF
on: Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Do you have any sense of whether or not the models thought they were in a simulation?

Daniel Kokotajlo 4 May 2024 20:50 UTC
21 points
10
in reply to: Akram Choudhary’s comment on: Please stop publishing ideas/insights/research about AI
I don’t think our chances of survival will increase if LessWrong becomes substantially more risk-averse about publishing research and musings about AI. I think they will decrease.

Daniel Kokotajlo 2 May 2024 18:55 UTC
32 points
28
on: Please stop publishing ideas/insights/research about AI
“If nobody publishes anything, how will alignment get solved?” — sure, it’s harder for alignment researchers to succeed if they don’t communicate publicly with one another — but it’s not impossible. That’s what dignity is about. A

Huh, I have the opposite intuition. I was about to cite that exact same “Death with dignity” post as an argument for why you are wrong; it’s undignified for us to stop trying to solve the alignment problem and publicly discussing the problem with each other, out of fear that some of our ideas might accidentally percolate into OpenAI and cause them to go slightly faster, and that this increased speedup might have made the difference between victory and defeat. The dignified thing to do is think and talk about the problem.

Daniel Kokotajlo 1 May 2024 20:26 UTC
15 points
9
in reply to: cubefox’s comment on: TurnTrout’s shortform feed
I think this is also what I was confused about—TurnTrout says that AIXI is not a shard-theoretic agent because it just has one utility function, but typically we imagine that the utility function itself decomposes into parts e.g. +10 utility for ice cream, +5 for cookies, etc. So the difference must not be about the decomposition into parts, but the possibility of independent activation? but what does that mean? Perhaps it means: The shards aren’t always applied, but rather only in some circumstances does the circuitry fire at all, and there are circumstances in which shard A fires without B and vice versa. (Whereas the utility function always adds up cookies and ice cream, even if there are no cookies and ice cream around?) I still feel like I don’t understand this.

Daniel Kokotajlo 1 May 2024 19:14 UTC
5 points
0
in reply to: Daniel Kokotajlo’s comment on: Shane Legg’s necessary properties for every AGI Safety plan
Did Shane leave a way for people who didn’t attend the talk, such as myself, to contact him? Do you think he would welcome such a thing?

Daniel Kokotajlo 1 May 2024 19:12 UTC
18 points
13
on: Shane Legg’s necessary properties for every AGI Safety plan
Thanks for sharing!

So, it seems like he is proposing something like AutoGPT except with a carefully thought-through prompt laying out what goals the system should pursue and constraints the system should obey. Probably he thinks it shouldn’t just use a base model but should include instruction-tuning at least, and maybe fancier training methods as well.

This is what labs are already doing? This is the standard story for what kind of system will be built, right? Probably 80%+ of the alignment literature is responding at least in part to this proposal. Some problems:
- How do we get a model that is genuinely robustly trying to obey the instruction text, instead of e.g. choosing actions on the basis of a bunch of shards of desire/drives that were historically reinforced, and which are not well-modelled as “it just tries to obey the instruction text?” Or e.g. trying to play the training game well? Or e.g. nonrobustly trying to obey the instruction text, such that in some future situation it might cease obeying and do something else? (See e.g. Ajeya Cotra’s training game report)
- Suppose we do get a model that is genuinely robustly trying to obey the instruction text. How do we ensure that its concepts are sufficiently similar to ours, that it interprets the instructions in the ways we would have wanted them to be interpreted? (This part is fairly easy I think but probably not trivial and at least conceptually deserves mention)
- Suppose we solve both of the above problems, what exactly should the instruction text be? Normal computers ‘obey’ their ‘instructions’ (the code) perfectly, yet unintended catastrophic side-effects (bugs) are typical. Another analogy is to legal systems—laws generally end up with loopholes, bugs, unintended consequences, etc… (As above, this part seems to me to be probably fairly easy but nontrivial.)

Daniel Kokotajlo 27 Apr 2024 15:34 UTC
14 points
4
in reply to: Rudi C’s comment on: AI Regulation is Unsafe
I think most people pushing for a pause are trying to push against a ‘selective pause’ and for an actual pause that would apply to the big labs who are at the forefront of progress. I agree with you, however, that the current overton window seems unfortunately centered around some combination of evals-and-mitigations that is at IMO high risk of regulatory capture (i.e. resulting in a selective pause that doesn’t apply to the big corporations that most need to pause!) My disillusionment about this is part of why I left OpenAI.

Daniel Kokotajlo 27 Apr 2024 15:00 UTC
4 points
2
in reply to: ryan_greenblatt’s comment on: AI Regulation is Unsafe
Good point, I guess I was thinking in that case about people who care a bunch about a smaller group of humans e.g. their family and friends.

Daniel Kokotajlo 27 Apr 2024 0:19 UTC
5 points
3
in reply to: Matthew Barnett’s comment on: AI Regulation is Unsafe
I agree that 0.7% is the number to beat for people who mostly focus on helping present humans and who don’t take acausal or simulation argument stuff or cryonics seriously. I think that even if I was much more optimistic about AI alignment, I’d still think that number would be fairly plausibly beaten by a 1-year pause that begins right around the time of AGI.

What are the mechanisms people have given and why are you skeptical of them?

Daniel Kokotajlo 26 Apr 2024 23:12 UTC
4 points
0
in reply to: Sammy Martin’s comment on: Uncontrollable Super-Powerful Explosives
Perhaps. I don’t know much about the yields and so forth at the time, nor about the specific plans if any that were made for nuclear combat.

But I’d speculate that dozens of kiloton range fission bombs would have enabled the US and allies to win a war against the USSR. Perhaps by destroying dozens of cities, perhaps by preventing concentrations of defensive force sufficient to stop an armored thrust.

Daniel Kokotajlo 26 Apr 2024 23:07 UTC
2 points
0
in reply to: Matthew Barnett’s comment on: AI Regulation is Unsafe
OK, thanks for clarifying.

Personally I think a 1-year pause right around the time of AGI would give us something like 50% of the benefits of a 10-year pause. That’s just an initial guess, not super stable. And quantitatively I think it would improve overall chances of AGI going well by double-digit percentage points at least. Such that it makes sense to do a 1-year pause even for the sake of an elderly relative avoiding death from cancer, not to mention all the younger people alive today.

Daniel Kokotajlo 26 Apr 2024 14:52 UTC
4 points
2
in reply to: Wei Dai’s comment on: AI Regulation is Unsafe
Big +1 to that. Part of why I support (some kinds of) AI regulation is that I think they’ll reduce the risk of totalitarianism, not increase it.

Daniel Kokotajlo 26 Apr 2024 14:51 UTC
3 points
0
in reply to: Matthew Barnett’s comment on: AI Regulation is Unsafe
So, it sounds like you’d be in favor of a 1-year pause or slowdown then, but not a 10-year?

(Also, I object to your side-swipe at longtermism. Longtermism according to wikipedia is “Longtermism is the ethical view that positively influencing the long-term future is a key moral priority of our time.” “A key moral priority” doesn’t mean “the only thing that has substantial moral value.” If you had instead dunked on classic utilitarianism, I would have agreed.)

Daniel Kokotajlo 25 Apr 2024 18:10 UTC
8 points
3
in reply to: Richard_Ngo’s comment on: AI Regulation is Unsafe
Hmm, I’m a bit surprised to hear you say that. I feel like I myself brought up regulatory capture a bunch of times in our conversations over the last two years. I think I even said it was the most likely scenario, in fact, and that it was making me seriously question whether what we were doing was helpful. Is this not how you remember it? Wanna hop on a call to discuss?

As for arguments of that form… I didn’t say X is terrible, I said it often goes badly. If you round that off to “X is terrible” and fit it into the argument-form you are generally against, then I think to be consistent you’d have to give a similar treatment to a lot of common sense good things. Like e.g. doing surgery on a patient who seems likely to die absent treatment.

I also think we might be talking past each other re regulation. As I said elsewhere in this discussion (on the OP’s blog) I am not in favor of generically increasing the amount of AI regulation in the world. I would instead advocate for something more targeted—regulation that I actually think would really work, if implemented well. And I’m very concerned about the “if implemented well” part and have a lot to say about that too.

What are the regulations that you are concerned about, that (a) are being seriously advocated by people with P(doom) >80%, and (b) that have a significant chance of causing x-risk via totalitarianism or technical-alignment-incompetence?

Daniel Kokotajlo 23 Apr 2024 19:23 UTC
4 points
0
in reply to: Davidmanheim’s comment on: Uncontrollable Super-Powerful Explosives
Also, the US did consider the possibility of waging a preemptive nuclear war on the USSR to prevent it from getting nukes. (von Neumann advocated for this I think?) If the US was more of a warmonger, they might have done it, and then there would have been a more unambiguous world takeover.