Cole Wyeth comments on Alignment Proposal: Adversarially Robust Augmentation and Distillation

Cole Wyeth 27 May 2025 18:07 UTC
4 points
0
The difference between this proposal and IDA is that, in IDA, the intelligence comes from the amplification step, where multiple copies of an existing model collaborate (or one copy thinks at higher speed or whatever) to get new smarts that were not initially present in that model.
…Whereas in this proposal, the intelligence comes from some unspecified general-purpose AI training approach that can by itself make arbitrarily smart AIs. But we choose not to run that process as hard as we can to get superintelligence directly in one step. Instead, we only run that process incrementally, to make a series of AI advisors, each mildly smarter than the last.
Yes, though training the advisor through interaction with the principal (or with the protocol the principal will use) is also an important component of the process.
If that’s right, then (1) the alignment tax seems like it would be quite high
Pretty high, but not “try to compete with deep learning using a home-brew AI architecture” high. I remain unconcerned about the alignment tax—I do not think this is the proposal’s weak point.
(2) the caveat “Inner advisors may attempt (causal or acausal) trade with outer advisors” is understating the problem—the advisors don’t necessarily even need to trade at all, they could all simply have the same misaligned goal, such that they all share a mutual interest in working together to ensure that the principal’s brain gets hacked and/or that they get let out of the box. Right?
I think they are very unlikely to have the same misaligned goal since they will be the result of separate training runs. I meant “trade” to capture coordination around mutual escape, which would also require some kind of agreement between dumber and smarter AI’s (since the former would likely be at a disadvantage after escape).
Anyway, yes, this is a risk. That is why the protocol must be carefully designed, and the principal should not allow an advisor to autonomously design the next layer of the protocol. Imagine the advisors as locked in a small dark room passing notes with theorems (in) and automatically verified proofs (out) on them. That should be your anchor. Then adjust upwards towards more freedom and generality (heuristic proofs with statistical guarantees, constrained freedom to choose among some set of proofs to attempt, MAYBE if it could be demonstrated safe then granular information about e.g. the stock market and discretized autoregressive predictions). I expect that the advisors will NOT be sending messages directly to each other under any reasonable implementation of this proposal—certainly not in plaintext.