RogerDearnaley comments on Alignment Proposal: Adversarially Robust Augmentation and Distillation

RogerDearnaley 26 May 2025 9:32 UTC
2 points
0
The comment was written not long after I got to the paragraph that it comments on — I skimmed a few paragraphs past that point and then started writing that comment. So perhaps your arguments need to be reordered, because my response to that paragraph was “that’s obviously completely impractical”. At a minimum, perhaps you should add a forward reference along the lines of “I know this sounds hard, see below for an argument as to why I believe it’s actually feasible”. Anyway, I’m now intrigued, so clearly I should now read the rest of your post carefully, rather than just skimming a bit past that point and then switching to commenting…
- RogerDearnaley 26 May 2025 9:44 UTC
  2 points
  0
  Parent
  …back, I’ve now read the rest of the post. I remain unconvinced that “a mathematical framework that guarantees the principal cannot be harmed by any advice that it chooses consult from a slightly smarter advisor” is practicable, and I still think it’s an overstatement of what the rest of you post suggests might be possible — for example: statistical evidence suggesting that X is likely to happen is not a ‘guarantee’ of X, so I think you should rephrase that: I suspect I’m not going to be the only person to bounce off it. LessWrong has a long and storied history of people trying to generate solid mathematical proofs about the safety properties of things whose most compact descriptions are in the gigabytes, and (IMO) no-one has managed it yet. If that’s not in fact what you’re trying to attempt, I’d suggest not sounding like it is.
  The rest of the post also to me reads rather as “and now magic may happen, because we’re talking to a smarter advisor, who may be able to persuade us that there’s a good reason why we should trust it”. I can’t disprove that, for obvious Vingean reasons, but similarly I don’t think you’ve proved that it will happen, or that we could accurately decide whether the advisor’s argument that it can be trusted can itself be trusted (assuming that it’s not a mathematical proof that we can just run through a proof checker, which I am reasonable confident will be impractical even for an AI smarter than us — basically because ‘harmed’ has a ridiculously complex definition: the entire of human values).
  
  I think you might get further if you tried approaching this problem from the other direction. If you were a smarter assistant, how could you demonstrate to the satisfaction of a dumber principal that they could safely trust you, you will never give them any advice that could harm them, and that none of this is an elaborate trick that they’re too dumb to spot? I’d like to see at least a sketch of an argument for how that could be done.
  - Cole Wyeth 26 May 2025 9:52 UTC
    2 points
    0
    Parent
    I will change that one sentence you bounced off of by adding something like “in expectation.”
    The rest of the post also to me reads rather as “and now magic may happen, because we’re talking to a smarter advisor, who may be able to persuade us that there’s a good reason why we should trust it”. I can’t disprove that, for obvious Vingean reasons, but similarly I don’t think you’ve proved that it will happen, or that we could accurately decide whether the advisor’s argument that it can be trusted can itself be trusted (assuming that it’s not a mathematical proof that we can just run through a proof checker, which I am reasonable confident will be impractical even for an AI smarter than us — basically because ‘harmed’ has a ridiculously complex definition: the entire of human values).
    This doesn’t sound like a description of ARAD at all. I don’t want the smart advisor to convince me to trust it. I want to combine cryptography and sequential decision theory to prove theorems that tell me which types of advice I can safely listen to from an untrusted advisor.
    - RogerDearnaley 26 May 2025 10:56 UTC
      2 points
      0
      Parent
      Then it appears I have misunderstood your arguments — whether that’s just a failing on my part, or suggests they could be better phrased/explained, I can’t tell you.
      One other reaction of mine to your post: you repeatedly mention ‘protocols’ for the communication between principal and agent. This, to my ear, sounds a lot like cryptographic protocols, and I immediately want details and to do a mathematical analysis of what I believe about their security properties — but this post doesn’t actually provide any details of any protocols, that I could find. I think that’s a major part of what I’m getting a sense that the argument contains elements of “now magic happens”.
      Perhaps some simple, concrete examples would help here? Even a toy example. Or maybe the word protocol is somehow giving me the wrong expectations?
      I seem to be bouncing off this proposal document — I’m wondering if there are unexplained assumptions, background, or parts of the argument that I’m missing?
      - Cole Wyeth 26 May 2025 10:59 UTC
        2 points
        0
        Parent
        Right—I wouldn’t describe it as magic, but the vast majority of the math still needs to be done, which includes protocol design. I did give explicit toy examples A) and B).
        RogerDearnaley 26 May 2025 19:10 UTC
        2 points
        0
        Parent
        Clearly I didn’t read your post sufficiently carefully. Fair point: yes, you did address that, and I simply missed it somehow. Yes, you did mean cryptographic protocols, specifically ones of the Merlin-Arthur form.
        I suspect that your the exposition could be made clearer, or better motivate readers who are skimming LW posts to engage with it — but that’s a writing suggestion, not a critique of the ideas.