Daniel Kokotajlo comments on The Problem

Daniel Kokotajlo 15 Aug 2025 23:03 UTC
45 points
30
Here’s my attempt to put your point in my words, such that I endorse it:
Philosophy hats on. What is the difference between a situation where you have to get it right on the first try, and a situation in which you can test in advance? In both cases you’ll be able to glean evidence from things that have happened in the past, including past tests. The difference is that in a situation worthy of the descriptor “you can test in advance,” the differences between the test environment and the high-stakes environment are unimportant. E.g. if a new model car is crash-tested a bunch, that’s considered strong evidence about the real-world safety of the car, because the real-world cars are basically exact copies of the crash-test cars. They probably aren’t literally exact copies, and moreover the crash test environment is somewhat different from real crashes, but still. In satellite design, the situation is more fraught—you can test every component in a vacuum chamber, for example, but even then there’s still gravity to contend with. Also what about the different kinds of radiation and so forth that will be encountered in the void of space? Also, what about the mere passage of time—it’s entirely plausible that e.g. some component will break down after two years, or that an edge case will come up in the code after four years. So… operate an exact copy of the satellite in a vacuum chamber bombarded by various kinds of radiation for four years? That would be close but still not a perfect test. But maybe it’s good enough in practice… most of the time. (Many satellites do in fact fail, though also, many succeed on the first try.)

Anyhow, now we ask: Does preventing ASI takeover involve any succeed-on-the-first-try situations?

We answer: Yes, because unlike basically every other technology or artifact, the ASI will be aware of whether it is faced with a genuine opportunity to take over or not. It’s like, imagine if your satellite had “Test mode” and “Launch mode” with significantly different codebases and a switch on the outside that determined which mode it was in, and for some reason you were legally obligated to only test it in Test Mode and only launch it in Launch Mode. It would be a nightmare, you’d be like “OK we think we ironed out all the bugs… in Test Mode. Still have no idea what’ll happen when it switches to Real Mode, but hopefully enough of the code is similar enough that it’ll still work… smh...”

A valid counterargument to this would be “Ah, but we can construct extremely accurate honeypots / testing environments that simulate a real-world opportunity to take over, and then see what the ASI does.” Valid, but not sound, because we probably can’t actually do that.

Another valid counterargument to this would be “Before there is an opportunity to take over the whole world with high probability, there will be an opportunity to take over the world with low probability, such as 1%, and an AI system risk-seeking enough to go for it. And this will be enough to solve the problem, because something something it’ll keep happening and let us iterate until we get a system that doesn’t take the 1% chance despite being risk averse...” ok yeah maybe this one is worse.

Responding more directly to Buck’s comment, I disagree with this part:
If the capability level at which AIs start wanting to kill you is way lower than the capability level at which they are way better than you at everything, then, before AIs are dangerous, you have the opportunity to empirically investigate the phenomenon of AIs wanting to kill you. For example, you can try out your ideas for how to make them not want to kill you, and then observe whether those worked or not. If they’re way worse than you at stuff, you have a pretty good chance at figuring out when they’re trying to kill you.
...unless we lean into the “way” part of “way lower.” But then I’d say there is a different important distribution shift, namely, the shift from AIs which are way lower capability, to the AIs which are high capability.
- Buck 15 Aug 2025 23:37 UTC
  5 points
  0
  Parent
  “Ah, but we can construct extremely accurate honeypots / testing environments that simulate a real-world opportunity to take over, and then see what the ASI does.” Valid, but not sound, because we probably can’t actually do that.
  I also think it’s important that you can do this with AIs weaker than the ASI, and iterate on alignment in that context.
- Buck 15 Aug 2025 23:38 UTC
  2 points
  0
  Parent
  But then I’d say there is a different important distribution shift, namely, the shift from AIs which are way lower capability, to the AIs which are high capability.
  As with Eliezer, I think it’s important to clarify which capability you’re talking about; I think Eliezer’s argument totally conflates different capabilities.