Cole Wyeth comments on Alignment Proposal: Adversarially Robust Augmentation and Distillation

Cole Wyeth 26 May 2025 9:52 UTC
2 points
0
I will change that one sentence you bounced off of by adding something like “in expectation.”
The rest of the post also to me reads rather as “and now magic may happen, because we’re talking to a smarter advisor, who may be able to persuade us that there’s a good reason why we should trust it”. I can’t disprove that, for obvious Vingean reasons, but similarly I don’t think you’ve proved that it will happen, or that we could accurately decide whether the advisor’s argument that it can be trusted can itself be trusted (assuming that it’s not a mathematical proof that we can just run through a proof checker, which I am reasonable confident will be impractical even for an AI smarter than us — basically because ‘harmed’ has a ridiculously complex definition: the entire of human values).
This doesn’t sound like a description of ARAD at all. I don’t want the smart advisor to convince me to trust it. I want to combine cryptography and sequential decision theory to prove theorems that tell me which types of advice I can safely listen to from an untrusted advisor.
- RogerDearnaley 26 May 2025 10:56 UTC
  2 points
  0
  Parent
  Then it appears I have misunderstood your arguments — whether that’s just a failing on my part, or suggests they could be better phrased/explained, I can’t tell you.
  One other reaction of mine to your post: you repeatedly mention ‘protocols’ for the communication between principal and agent. This, to my ear, sounds a lot like cryptographic protocols, and I immediately want details and to do a mathematical analysis of what I believe about their security properties — but this post doesn’t actually provide any details of any protocols, that I could find. I think that’s a major part of what I’m getting a sense that the argument contains elements of “now magic happens”.
  Perhaps some simple, concrete examples would help here? Even a toy example. Or maybe the word protocol is somehow giving me the wrong expectations?
  I seem to be bouncing off this proposal document — I’m wondering if there are unexplained assumptions, background, or parts of the argument that I’m missing?
  - Cole Wyeth 26 May 2025 10:59 UTC
    2 points
    0
    Parent
    Right—I wouldn’t describe it as magic, but the vast majority of the math still needs to be done, which includes protocol design. I did give explicit toy examples A) and B).
    - RogerDearnaley 26 May 2025 19:10 UTC
      2 points
      0
      Parent
      Clearly I didn’t read your post sufficiently carefully. Fair point: yes, you did address that, and I simply missed it somehow. Yes, you did mean cryptographic protocols, specifically ones of the Merlin-Arthur form.
      I suspect that your the exposition could be made clearer, or better motivate readers who are skimming LW posts to engage with it — but that’s a writing suggestion, not a critique of the ideas.