Caleb Biddulph comments on Alignment Proposal: Adversarially Robust Augmentation and Distillation

Caleb Biddulph 27 May 2025 4:17 UTC
1 point
0
I was also tripped up when I read this part. Here’s my best steelman, please let me know if it’s right @Cole Wyeth. (Note: I actually wrote most of this yesterday and forgot to send it; sorry it might not address any other relevant points you make in the comments.)
One kind of system that seems quite safe would be an oracle that can write a proof for any provable statement in Lean, connected to a proof assistant which runs the proof and tells you whether it succeeds. Assuming this system has no other way to exert change on the world, it seems pretty clear that it would be quite difficult for the oracle to find a strategy to harm you, if for some reason it wanted to.
Caveats to the above:
- It’s totally possible that you could in fact be harmed by these proofs. For example, maybe you use this system to help you discover a proof that P=NP, which leads to hackers breaking into a bunch of systems and stealing your data. However, the fact that you were harmed is not a consequence of the AI being adversarial. This is like you handling a power tool improperly and hurting yourself; the harm is your fault and not the tool’s.
- The oracle could intentionally harm you, since it can repeatedly transmit one bit of information (whether or not it succeeded at the proof). Maybe it can use this channel by selectively proving just the statements (like P=NP) whose solutions will collectively throw the world into chaos, or its choice of whether or not to solve a problem will somehow spell out a message in your monitoring dashboard, which will then hack the mind of anyone who looks at it. We can try to solve this by only querying the oracle e.g. 100 times a year, but who’s to say 100 bits isn’t enough to transmit a mind-hacking message? At a certain point, we’ll have to fall back to informal arguments that the AI can’t harm us, but I feel like this is fine as long as those arguments are very strong.
- Cole Wyeth 27 May 2025 6:15 UTC
  2 points
  0
  Parent
  This is right, with the additional intuition that it seems rhe oracle would have to be much, much smarter than us to use those 100 bits against us even if possible.
  
  And also: I would like to generalize this argument beyond rigorous mathematical proofs.
- RogerDearnaley 27 May 2025 4:30 UTC
  2 points
  0
  Parent
  Security thinking:
  
  We have a communication channel to a dangerously expansive and militaristic alien civilization, and excellent surveillance of them (we managed to obtain a copy of their Internet archive), so we know a lot about the current state of their culture. We can send them messages, but since they are paranoid they will basically always disregard these, unless they’re valid checkable mathematical proofs. We’re pretty sure that if we let them expand they will start an interstellar war and destroy us, so we need to crash their civilization, by sending them mathematical proofs. What do we send them? Assume our math is a millenium ahead of theirs, and theirs is about current-day.
  - Caleb Biddulph 27 May 2025 7:50 UTC
    1 point
    0
    Parent
    There are way more bits available to the aliens/AI if they are allowed to choose what mathematical proofs to send. In my hypothetical, the only choice they can make is whether to fail to produce a valid proof. We don’t even see the proof itself, since we just run it through the proof assistant and discard it.
    - RogerDearnaley 27 May 2025 21:59 UTC
      2 points
      0
      Parent
      I’d missed that, and I agree it makes a huge difference.
      
      However, I don’t think a culture that isn’t willing to pause AI development entirely would accept you proposal.
      - Caleb Biddulph 27 May 2025 22:02 UTC
        1 point
        2
        Parent
        Yeah, I made the most conservative possible proposal to make a point, but there’s probably some politically-viable middle ground somewhere
    - Cole Wyeth 27 May 2025 9:58 UTC
      2 points
      0
      Parent
      Yes, I thought I specified this in the post, but maybe it is not clear.