ryan_greenblatt comments on An alignment safety case sketch based on debate

ryan_greenblatt 14 May 2025 18:27 UTC
LW: 19 AF: 14
0
AF
While I think debate might be a useful prosaic method for some levels of capability, I have a variety of concerns about this approach resulting in worst case guarantees^[1]:

Exploration hacking

I think the argument pushes large parts of the problem into requiring a universal solution for exploration hacking. I think prosaically handling certain cases of exploration hacking for certain levels of capability seems very doable, but the approach discussed in the paper requires a very strong notion of solving exploration hacking (e.g., the AI would tell you any secret it might know via any mechanisms which you can verify, even if that secret would be arbitrarily hard to explore into such as if the AIs know of some arbitrary password you can check).

A universal solution for exploration hacking would itself require solving a particular alignment problem which IMO contains most of the difficulty of the overall alignment problem: fully solving exploration hacking effectively requires “we can ensure the AI will give us any knowledge that we can verify” which is effectively the same as “we can align an AI to trying to give us any knowledge that we can verify” and this doesn’t obviously seem importantly easier than the overall alignment problem without further restrictions. Importantly, “giving knowledge” might itself be a difficult task due to ontology translation (see e.g. counterexamples in the ELK report). Another way to put this, is that solving exploration hacking (in the worst case) seems almost as difficult as being able to elicit arbitrary knowledge.

What exactly is the limit that this approaches?

The paper discusses an assumption of “good human input” as being sufficient. However, the relevant question is whether the idealized system (the limit of a well elicited debate) actually finds truth in a given case. And, I think it seems very unclear whether it will find truth in the cases we care about. (E.g., does a reasonable human judge result in good behavior in the infinite structure or does something crazy happen? Do we somehow do better due to approximation?)

Minimally, answering some questions might involve hard to justify heuristics and there are issues with persuasion.

In some insane limit, anything is deductively verifiable (except the prior) using Solomonoff induction or similar, but surely this isn’t going to actually save us! (For instance, the AIs will be incapable of arguing about this, and there isn’t any reason why a plausible amount of training would elicit this without a magic solution to exploration hacking which put all the difficulty in this part. At this point, you could also just debate over purely logical formal statements going down to Solomonoff induction or some other intractable computational structure.)

Mundane exploration difficulties might be fatal as the debate tree could be arbitrarily large

Putting aside exploration hacking, it’s unclear this yields reasonable results for more mundane reasons, like exploration being hard. And, the relevant level of exploration we need is extremely good as we’re trying to elicit good performance over an exponentially large structure that we can’t explore.

It also seems like you won’t know if you’ve reached a sufficiently optimal level as I expect that you’ll always see returns in terms of debater performance (as measured in e.g. elo) with further RL. So, you’ll need to have some way of understanding what optimality suffices for eliciting the knowledge the AI already had.
1. ↩︎
  I’ve also discussed these concerns in a call where the authors were present, but I thought it might be helpful to quickly write up my concerns.
What links here?
- ryan_greenblatt's comment on ryan_greenblatt’s Shortform by ryan_greenblatt (21 Jun 2025 0:14 UTC; 63 points)
- maxnadeau 14 May 2025 18:52 UTC
  LW: 8 AF: 6
  0
  AF Parent
  For more discussion of the hard cases of exploration hacking, readers should see the comments of this post.
- Julian Stastny 15 May 2025 15:39 UTC
  1 point
  0
  Parent
  I also recommend Buck and my recent post on exploration hacking for further reading :)

ryan_greenblatt comments on An alignment safety case sketch based on debate

Exploration hacking

What exactly is the limit that this approaches?

Mundane exploration difficulties might be fatal as the debate tree could be arbitrarily large