RogerDearnaley comments on Alignment Proposal: Adversarially Robust Augmentation and Distillation

RogerDearnaley 25 May 2025 23:52 UTC
7 points
1
The protocol is based on a mathematical framework that guarantees the principal cannot be harmed by any advice that it chooses consult from a slightly smarter advisor.
This sounds to me like the hard part. “Harmed” has to be measured by human values, which are messy, complex, and fragile (and get modified under reflection), so not amenable to mathematical descriptions or guarantees. Probably the most compact description of human values theoretically possible is the entire human genome, which at nearly a gigabyte is quite complex. Making useful statements about this such as “guarantees the principal cannot be harmed by any advice that it chooses consult” is going to require processing that into a more usable form, which is going to make it a lot larger. That is far more data and complexity than we (or any AI comparable to us) can currently handle by any approach that can fairly be described as mathematical or that yields guarantees. I think you should to stop looking for mathematical guarantees of anything around alignment, and start looking at approaches that are more statistical, data-science, or engineering, and that might actually scale to a problem of this complexity.

Or, if you don’t think society should build ASI before we have provable mathematical guarantees that it’s safe, then perhaps you should be working on how to persuade people and countries that we need to pause AGI and ASI development?

At a minimum, I think you need to justify in detail why you believe that “a mathematical framework that guarantees the principal cannot be harmed by any advice that it chooses consult from a slightly smarter advisor” is something that we can practically create — I think many readers as going to consider that to be something that would be lovely to have but is completely impractical.
- Cole Wyeth 26 May 2025 8:49 UTC
  5 points
  0
  Parent
  You don’t have to specify human values, you only have to prove that the principal will not be harmed according to its values. The proof should go through for arbitrary principals.
  As I made clear in the rest of the post, I am hoping for statistical guarantees with proofs or at least strong heuristic arguments. That is, I am interested in whether the principal is harmed in expectation. Proving mathematical statements provided by the principal is only one example where I expect it to be possible to demonstrate the protocol is adversarially robust. I do not intend your interpretation where the principal effectively asks for a proof involving everything in the entire world.
  
  That is: I’m not looking for a complete proof that the results of following advice deterministically result in some mathematical formalization of human value being satisfied. I’m looking for a mathematical model of which types of advice robustly improve our ability to pursue our goals rationally.
  Essentially the entire post is already a response to your last paragraph.
  - RogerDearnaley 26 May 2025 9:32 UTC
    2 points
    0
    Parent
    The comment was written not long after I got to the paragraph that it comments on — I skimmed a few paragraphs past that point and then started writing that comment. So perhaps your arguments need to be reordered, because my response to that paragraph was “that’s obviously completely impractical”. At a minimum, perhaps you should add a forward reference along the lines of “I know this sounds hard, see below for an argument as to why I believe it’s actually feasible”. Anyway, I’m now intrigued, so clearly I should now read the rest of your post carefully, rather than just skimming a bit past that point and then switching to commenting…
    - RogerDearnaley 26 May 2025 9:44 UTC
      2 points
      0
      Parent
      …back, I’ve now read the rest of the post. I remain unconvinced that “a mathematical framework that guarantees the principal cannot be harmed by any advice that it chooses consult from a slightly smarter advisor” is practicable, and I still think it’s an overstatement of what the rest of you post suggests might be possible — for example: statistical evidence suggesting that X is likely to happen is not a ‘guarantee’ of X, so I think you should rephrase that: I suspect I’m not going to be the only person to bounce off it. LessWrong has a long and storied history of people trying to generate solid mathematical proofs about the safety properties of things whose most compact descriptions are in the gigabytes, and (IMO) no-one has managed it yet. If that’s not in fact what you’re trying to attempt, I’d suggest not sounding like it is.
      The rest of the post also to me reads rather as “and now magic may happen, because we’re talking to a smarter advisor, who may be able to persuade us that there’s a good reason why we should trust it”. I can’t disprove that, for obvious Vingean reasons, but similarly I don’t think you’ve proved that it will happen, or that we could accurately decide whether the advisor’s argument that it can be trusted can itself be trusted (assuming that it’s not a mathematical proof that we can just run through a proof checker, which I am reasonable confident will be impractical even for an AI smarter than us — basically because ‘harmed’ has a ridiculously complex definition: the entire of human values).
      
      I think you might get further if you tried approaching this problem from the other direction. If you were a smarter assistant, how could you demonstrate to the satisfaction of a dumber principal that they could safely trust you, you will never give them any advice that could harm them, and that none of this is an elaborate trick that they’re too dumb to spot? I’d like to see at least a sketch of an argument for how that could be done.
      - Cole Wyeth 26 May 2025 9:52 UTC
        2 points
        0
        Parent
        I will change that one sentence you bounced off of by adding something like “in expectation.”
        The rest of the post also to me reads rather as “and now magic may happen, because we’re talking to a smarter advisor, who may be able to persuade us that there’s a good reason why we should trust it”. I can’t disprove that, for obvious Vingean reasons, but similarly I don’t think you’ve proved that it will happen, or that we could accurately decide whether the advisor’s argument that it can be trusted can itself be trusted (assuming that it’s not a mathematical proof that we can just run through a proof checker, which I am reasonable confident will be impractical even for an AI smarter than us — basically because ‘harmed’ has a ridiculously complex definition: the entire of human values).
        This doesn’t sound like a description of ARAD at all. I don’t want the smart advisor to convince me to trust it. I want to combine cryptography and sequential decision theory to prove theorems that tell me which types of advice I can safely listen to from an untrusted advisor.
        RogerDearnaley 26 May 2025 10:56 UTC
        2 points
        0
        Parent
        Then it appears I have misunderstood your arguments — whether that’s just a failing on my part, or suggests they could be better phrased/explained, I can’t tell you.
        One other reaction of mine to your post: you repeatedly mention ‘protocols’ for the communication between principal and agent. This, to my ear, sounds a lot like cryptographic protocols, and I immediately want details and to do a mathematical analysis of what I believe about their security properties — but this post doesn’t actually provide any details of any protocols, that I could find. I think that’s a major part of what I’m getting a sense that the argument contains elements of “now magic happens”.
        Perhaps some simple, concrete examples would help here? Even a toy example. Or maybe the word protocol is somehow giving me the wrong expectations?
        I seem to be bouncing off this proposal document — I’m wondering if there are unexplained assumptions, background, or parts of the argument that I’m missing?
        Cole Wyeth 26 May 2025 10:59 UTC
        2 points
        0
        Parent
        Right—I wouldn’t describe it as magic, but the vast majority of the math still needs to be done, which includes protocol design. I did give explicit toy examples A) and B).
        RogerDearnaley 26 May 2025 19:10 UTC
        2 points
        0
        Parent
        Clearly I didn’t read your post sufficiently carefully. Fair point: yes, you did address that, and I simply missed it somehow. Yes, you did mean cryptographic protocols, specifically ones of the Merlin-Arthur form.
        I suspect that your the exposition could be made clearer, or better motivate readers who are skimming LW posts to engage with it — but that’s a writing suggestion, not a critique of the ideas.
- Knight Lee 26 May 2025 5:22 UTC
  2 points
  4
  Parent
  I agree. Being able to prove a piece of real world advice is harmless using math or logic, means either
  1. Mathematically proving what will happen in the real world if that real world advice was carried out (which requires a mathematically perfect world model)
    or
  2. At least proving that the mind generating the advice has aligned goals, so that it’s unlikely to be harmful (but one of the hardest parts of solving alignment is a provable proxy for alignedness)
  PS: I don’t want to pile on criticism because I feel ambitious new solutions need to be encouraged! It’s worthwhile studying chains of weaker agents controlling stronger agents, and I actually love the idea of running various safety bureaucratic processes, and distilling the output through imitation. Filtering dangerous advice through protocols seems very rational.
- Caleb Biddulph 27 May 2025 4:17 UTC
  1 point
  0
  Parent
  I was also tripped up when I read this part. Here’s my best steelman, please let me know if it’s right @Cole Wyeth. (Note: I actually wrote most of this yesterday and forgot to send it; sorry it might not address any other relevant points you make in the comments.)
  One kind of system that seems quite safe would be an oracle that can write a proof for any provable statement in Lean, connected to a proof assistant which runs the proof and tells you whether it succeeds. Assuming this system has no other way to exert change on the world, it seems pretty clear that it would be quite difficult for the oracle to find a strategy to harm you, if for some reason it wanted to.
  Caveats to the above:
  - It’s totally possible that you could in fact be harmed by these proofs. For example, maybe you use this system to help you discover a proof that P=NP, which leads to hackers breaking into a bunch of systems and stealing your data. However, the fact that you were harmed is not a consequence of the AI being adversarial. This is like you handling a power tool improperly and hurting yourself; the harm is your fault and not the tool’s.
  - The oracle could intentionally harm you, since it can repeatedly transmit one bit of information (whether or not it succeeded at the proof). Maybe it can use this channel by selectively proving just the statements (like P=NP) whose solutions will collectively throw the world into chaos, or its choice of whether or not to solve a problem will somehow spell out a message in your monitoring dashboard, which will then hack the mind of anyone who looks at it. We can try to solve this by only querying the oracle e.g. 100 times a year, but who’s to say 100 bits isn’t enough to transmit a mind-hacking message? At a certain point, we’ll have to fall back to informal arguments that the AI can’t harm us, but I feel like this is fine as long as those arguments are very strong.
  - Cole Wyeth 27 May 2025 6:15 UTC
    2 points
    0
    Parent
    This is right, with the additional intuition that it seems rhe oracle would have to be much, much smarter than us to use those 100 bits against us even if possible.
    
    And also: I would like to generalize this argument beyond rigorous mathematical proofs.
  - RogerDearnaley 27 May 2025 4:30 UTC
    2 points
    0
    Parent
    Security thinking:
    
    We have a communication channel to a dangerously expansive and militaristic alien civilization, and excellent surveillance of them (we managed to obtain a copy of their Internet archive), so we know a lot about the current state of their culture. We can send them messages, but since they are paranoid they will basically always disregard these, unless they’re valid checkable mathematical proofs. We’re pretty sure that if we let them expand they will start an interstellar war and destroy us, so we need to crash their civilization, by sending them mathematical proofs. What do we send them? Assume our math is a millenium ahead of theirs, and theirs is about current-day.
    - Caleb Biddulph 27 May 2025 7:50 UTC
      1 point
      0
      Parent
      There are way more bits available to the aliens/AI if they are allowed to choose what mathematical proofs to send. In my hypothetical, the only choice they can make is whether to fail to produce a valid proof. We don’t even see the proof itself, since we just run it through the proof assistant and discard it.
      - RogerDearnaley 27 May 2025 21:59 UTC
        2 points
        0
        Parent
        I’d missed that, and I agree it makes a huge difference.
        
        However, I don’t think a culture that isn’t willing to pause AI development entirely would accept you proposal.
        Caleb Biddulph 27 May 2025 22:02 UTC
        1 point
        2
        Parent
        Yeah, I made the most conservative possible proposal to make a point, but there’s probably some politically-viable middle ground somewhere
      - Cole Wyeth 27 May 2025 9:58 UTC
        2 points
        0
        Parent
        Yes, I thought I specified this in the post, but maybe it is not clear.