Fabien Roger comments on Davidad’s Bold Plan for Alignment: An In-Depth Explanation

Fabien Roger 10 Apr 2024 21:36 UTC
LW: 22 AF: 15
7
AF
I don’t think I understand what is meant by “a formal world model”.
For example, in the narrow context of “I want to have a screen on which I can see what python program is currently running on my machine”, I guess the formal world model should be able to detect if the model submits an action that exploits a zero-day that tampers with my ability to see what programs are running. Does that mean that the formal world model has to know all possible zero-days? Does that mean that the software and the hardware have to be formally verified? Are formally verified computers roughly as cheap as regular computers? If not, that would be a clear counter-argument to “Davidad agrees that this project would be one of humanity’s most significant science projects, but he believes it would still be less costly than the Large Hadron Collider.”
Or is the claim that it’s feasible to build a conservative world model that tells you “maybe a zero-day” very quickly once you start doing things not explicitly within a dumb world model?
I feel like this formally-verifiable computers claim is either a good counterexample to the main claims, or an example that would help me understand what the heck these people are talking about.
What links here?
- ryan_greenblatt's comment on Davidad’s Bold Plan for Alignment: An In-Depth Explanation by Charbel-Raphaël (13 Jan 2025 23:48 UTC; 33 points)
- ryan_greenblatt's comment on Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems by Gunnar_Zarncke (17 May 2024 4:08 UTC; 26 points)
- Paul W 16 May 2024 8:25 UTC
  3 points
  2
  Parent
  I believe that the current trends for formal verification, say, of traditional programs or small neural networks, are more about conservative overapproximations (called abstract interpretations). You might want to have a look at this: https://caterinaurban.github.io/pdf/survey.pdf
  To be more precise, it appears that so-called “incomplete formal methods” (3.1.1.2 in the survey I linked) are more computationally efficient, even though they can produce false negatives.
  Does that answer your question ?
  - Fabien Roger 17 May 2024 12:51 UTC
    2 points
    0
    Parent
    Not entirely. This makes me slightly more hopeful that we can have formal guarantees of computer systems, but is the field advanced enough that it would be feasible to have a guaranteed no-zero-day evaluation and deployment codebase that is competitive with a regular codebase? (Given a budget of 1 LHC for both the codebase inefficiency tax + the time to build the formal guarantees for the codebase.)
    (And computer systems are easy mode, I don’t even know how you would start to build guarantees like “if you say X, then it’s proven that it doesn’t persuade humans of things in ways they would not have approved of beforehand.”)
    - Paul W 18 May 2024 16:04 UTC
      1 point
      2
      Parent
      Is the field advanced enough that it would be feasible to have a guaranteed no-zero-day evaluation and deployment codebase that is competitive with a regular codebase?
      As far as I know (I’m not an expert), such absolute guarantees are too hard right now, especially if the AI you’re trying to verify is arbitrarily complex. However, the training process ought to yield an AI with specific properties. I’m not entirely sure I got what you meant by “a guaranteed no-zero-day evaluation and deployment codebase”. Would you mind explaining more ?
      
      “Or is the claim that it’s feasible to build a conservative world model that tells you “maybe a zero-day” very quickly once you start doing things not explicitly within a dumb world model?”
      I think that’s closer to the idea: you {reject and penalize, during training} as soon as the AI tries something that might be “exploiting a zero-day”, in the sense that the world-model can’t rule out this possibility with high confidence^[1]. That way, the training process is expected to reward simpler, more easily verified actions.
      
      Then, a key question is “what else you do want from your AI ?”: of course, it is supposed to perform critical tasks, not just “let you see what program is running”^[2], so there is tension between the various specifications you enter. The question of how far you can actually go, how much you can actually ask for, is both crucial, and wide open, as far as I can tell.
      ^
      Some of the uncertainty lies in how accurate and how conservative the world-model is; you won’t get a “100% guarantee” anyway, especially since you’re only aiming for probabilistic bounds within the model.
      ^
      Otherwise, a sponge would do.
      - Fabien Roger 20 May 2024 12:06 UTC
        2 points
        0
        Parent
        I was thinking that the formal guarantees would be about state evaluations (i.e. state → badness bounds) - which would require sth like “showing there is no zero-day” (since “a code-base with a zero-day” might be catastrophically bad if no constraints are put on actions). Thanks for pointing out they can be about action (i.e. (state, action) → badness bounds), which seem intuitively easier to get good bounds for (you don’t need to show there are no zero-days, just that the currently considered action is extremely unlikely to exploit a potential zero-day).
        I’d be curious to know what kind of formal process could prove that (codebase, codebase-interaction) pairs are provably not-bad (with high probability, and with a false positive rate low enough if you trained an AI to minimize it). My guess is that there is nothing like that on the horizon (that could become competitive at all), but I could be wrong.
        (“let you see what program is running” was an example of a very minimal safety guarantee I would like to have, not a representative safety guarantee. My point is that I’d be surprised if people got even such a simple and easy safety guarantee anytime soon, using formal methods to check AI actions that actually do useful stuff.)