Alignment Proposal: Adversarially Robust Augmentation and Distillation
[Update: For reasons related to my post Paradigms for computation, I now doubt that this is itself a workable alignment plan. I still believe that it serves as a relaxation of the alignment problem which may allow theoretical progress. - CW]
Epistemic Status: Over years of reading alignment plans and studying agent foundations, this is my first serious attempt to formulate an alignment research program that I (Cole Wyeth) have not been able to find any critical flaws in. It is far from a complete solution, but I think it is a meaningful decomposition of the problem into modular pieces that can be addressed by technical means—that is, it seems to solve many of the philosophical barriers to AI alignment. I have attempted to make the necessary assumptions clear throughout. The main reason that I am excited about this plan is that the assumptions seem acceptable to both agent foundations researchers and ML engineers; that is, I do not believe there are any naive assumptions about the nature of intelligence OR any computationally intractable obstacles to implementation. This program (tentatively ARAD := Adversarially Robust Augmentation and Distillation) owes most of its core ideas to other researchers—in fact, I think its probably a good sign that it seems superficially like a reinvention of several different existing ideas, while hopefully succeeding as a synthesis that overcomes their limitations. Superimitation seems like the closest idea to me, but (at least) suggests a significantly different implementation. ARAD also owes a lot to HCH and other forms of IDA, but those approaches seem highly unstable and doomed to me; ARAD attempts to stabilize the amplification step using ideas from Abram Demski (who helped me remove some epicycles and formulate the plan in its current form) and @Scott Garrabrant about alignment as “becoming smarter.” It overcomes related issues I see with (Bayesian approaches to) logical uncertainty by taking advantage of high-level ideas reminiscent of Infra-Bayesianism. Still, this plan has not been subjected to serious red-teaming yet and is probably wrong.
(“I” usually refers to Cole, “we” to Abram and Cole)
We have been considering whether alignment is reducible to increasing intelligence without changing values. It seems to be possible for a person to learn about normative decision theory and become a better decision maker without becoming unaligned with their future self. More generally, we usually trust that we would benefit from having more time to think. If we knew how this works, we would have made significant progress on the alignment problem.
Human intelligence can only be scaled so far on our current hardware, so we ultimately want to build an aligned intelligence that runs most of its cognition on a computer. This means alignment presumably requires a further problem to be resolved; at some point a transition from biological to computer hardware needs to take place.
I argue that overcoming these two types of problem is also sufficient for solving the AI alignment problem, and further that each has a potentially tractable technical path to success. In the context of AI alignment, the safe intelligence increase problem manifests as a principal accepting advice from a slightly more intelligent advisor. The biological → computer transition problem should be handled with a form of imitation learning, which Michael K. Cohen and Marcus Hutter have argued is probably existentially safe. I will not repeat their argument—if this is a crux for you, I suggest reading the linked paper before continuing. They discuss some remaining obstacles here. If you are not convinced, most of our proposal can be performed without imitation learning by paying an additional “alignment tax.”
The rest of this essay takes the following form. First, I will outline at a high level how this alignment scheme can be implemented starting from an initial agent and scaling to a much higher intelligence level. Then, I will discuss the technical problems that need to be solved and why I believe they are tractable; this section should be of interest to skeptical agent foundations researchers. Finally, I will discuss the practical implementation details—in particular, how this scheme should work on the current paradigm. Skeptical “prosaic” alignment researchers (e.g. engineers working on alignment teams) may wish to skip ahead to this section before the technical section, or otherwise trust that this essay will eventually tie back to the current paradigm.
Isn’t this just debate?
The type of protocol I have in mind vaguely resembles debate in inspiration, but my technical approach in the next section seems to be significantly different. In particular, it does not (necessarily) include any debates between AIs. In fact, I think that allowing a human to read a debate between superintelligences is an insane idea. For that reason I have not read very much about debate, which actually means it is possible that there is some (perhaps poorly named) “debate” proposal equivalent to ours.
High-level Implementation
In this section I will outline at a high level how this alignment scheme can be implemented. I will address some immediate obvious objections as they arise, but will mostly leave a discussion of the serious technical holes to the next section.
At the first step, we have an agent with endorsed values (for instance, a human, or possibly some kind of committee of humans, though that seems somewhat harder). This agent takes the role of principal.
Next, an agent slightly smarter than the principal is constructed.
The principal and advisor interact through a carefully constructed protocol. The protocol may simply be a set of rules that the principal follows to choose whether and how to consult the advisor’s advice; alternatively, parts of the protocol may be automated. The protocol is based on a mathematical framework that guarantees the principal cannot be harmed in expectation by any advice that it chooses to consult from a slightly smarter advisor.
The advisor is optimized to provide usable advice to the principal.
The principal-advisor pair forms a (coalitional) agent which, if the previous steps succeed, can be understood as ~perfectly aligned with the principal. The actions of this agent are recorded, and distilled through imitation learning into a successor.
This process is iterated with the successor as principal.
Why not just run a distilled copy at 10x speed?
Assuming that we (un-augmented humans) could create a perfect distilled copy of a human researcher that remains faithful when run for a long time, I would consider this a very promising path! However, it would pay a substantial “alignment tax” against the smartest models, which could probably run at the same speed while being smarter (though perhaps differentially smarter across capabilities). In fact, the alignment tax is so significant that it is not even clear this approach scales to superintelligence—we can only hope it yields a quick outsourced solution to the alignment problem. Overall, I worry that this path is too slow. There is also a chance that imitation learning is not as safe as I hope—in that case, we would also want to use the protocol when interacting with the distilled copies.
ARAD offers something more like a qualitative intelligence increase which comes before the critical hand-off to a distilled copy. See the implementation details for further discussion of the advantages.
Fortunately, imitation learning can be pursued in parallel to constructing the protocol, so it is not really necessary to choose between these two aspects of the plan.
Technical Justification
The construction of this protocol is an open technical problem, but I expect it is possible. Intuitively, I believe that I could receive advice from an equally smart person in a box running 1.5x faster than I am, and by approaching that advice with care, avoid being seriously misled by bad advice—verification is easier than generation, and I am free to ignore any advice that I can’t usefully verify. At the same time, I could benefit from useful advice (in most cases, some trade is possible- the advisor is not incentivized to stay completely silent as long as we have some common interests). I believe that there are mathematical facts about agency which I could use to increase the safe intelligence ratio.
I am not claiming that receiving advice from a smarter agent is generically safe. I am claiming that a wise enough advice-receiver can safely consult a slightly smarter advisor. The degree of wisdom probably increases the maximum safe intelligence gap (on the log scale).
Motivating Examples:
A) Consider a decision problem in which you can either take a constant payout of 6 dollars or accept a gamble in which one of 10 numbers is selected with uniform probability. You receive a payout of 10 dollars if the (say) third digit of is 1, where is a one-way function. This gamble does not seem to be worth taking, but if an advisor provides inverses for and they all pay out, you should take it. However, if the advisor only provides inverses for , you may not want to accept the gamble, because it may harm your performance.
B) Now imagine that will be inverted with , which is modestly hard to invert, and will be inverted with , which you can “prove” (under some cryptographic assumptions) is very hard to invert. If the advisor demonstrates that pay out, you should still accept the game because in expectation it is worth dollars.
Example A demonstrates that in adversarial situations, pessimism about unknown computational facts can be appropriate in adversarial situations. Example B demonstrates that computational bounds on the advisor can be useful, because they allow “computational uncertainty” to rule out adversarial selection.
Cryptographic Pessimism. We are calling the strategy of being pessimistic about potentially adversarial computational facts “cryptographic pessimism.” It probably already has some associated concept in Infra-Bayesianism (Vanessa suggests metacognitive agency is related).
Pessimism or Bayesianism for Computational Uncertainty?
Common approaches to computational uncertainty such as the Modified Demski Prior, Garrabrant induction, and Bayesian Logical Induction have a Bayesian flavor/motivation. I am very interested in belief distributions that handle computational uncertainty in a coherent Bayesian way that caches out to pessimism exactly when appropriate. However, I am not convinced that the Bayesian virtue of coherence = self-consistency is necessarily an appropriate target for the type of dynamic reasoning that must be performed under computational uncertaintySuperimitation seems like the closest idea to me, but (at least) suggests a significantly different implementation.Superimitation seems like the closest idea to me, but (at least) suggests a significantly different implementation.Superimitation seems like the closest idea to me, but (at least) suggests a significantly different implementation.Superimitation seems like the closest idea to me, but (at least) suggests a significantly different implementation., and I have argued in the past that existing UDTs take consistency too far.
We would like to go beyond simple examples like this, and benefit from advice beyond rigorous mathematical proofs[1] (perhaps including heuristic arguments). The more advice that can be safely used, the greater the effective cognitive enhancement that can be safely achieved at a given intelligence gap.
How can an advisor enable safe communication with an even smarter advisor? This is not demonstrated by the simplified examples above. Ideally, we would like to take advantage of the limited intelligence of the slightly smarter advisor to do things we could not safely do with the much smarter advisor, which also expand the envelope of things we can do with the much smarter advisor. For instance, it is not generically possible to bootstrap by using the slightly smarter advisor to provide a complete proof that a certain type of advice is safe to take, because the much smarter advisor could also produce that proof and then provide the advice. Here are some more examples that illustrate the principle more clearly, from least to most ambitious.
1: Simply get practice taking advantage of a formal proof machine by interacting with the slightly smarter advisor.
2: Use the slightly smarter advisor to look for security vulnerabilities in the software implementing the interaction protocol with the smarter advisor. For example, given a trusted proof verifier, prove that a more efficient proof verifier is secure. We may also eventually wish to provide higher bandwith interaction between advisors and principle(s), which could require harder security proofs. This may also include red-teaming the protocol and searching for side-channel attacks.
3: It should be possible for advisors only slightly smarter than humans to propose a protocol (under some constraints), and then themselves prove that protocol safe. I would not recommend interacting with superintelligent agents this way, since they may e.g. propose a protocol that is formally safe but expected to allow side-channel attacks.
4: We may be able to extract heuristic arguments from slightly smarter advisors. If these arguments are strong enough, they may cause us to trust certain protocols we cannot formally verify. This seems to be an open research question.
5: We may be able to accept normative rationality advice from slightly smarter advisors. By bootstrapping our rationality, we may be better prepared for many problems including designing the next protocol. Doing this properly seems to be bottle-necked on (@Wei Dai style) meta-philosophy.
In practice, the protocol implementation will need to be somewhat approximate (it will not perfectly instantiate the mathematical model it is based on). A good model is usually robust to reasonable heuristic implementations. In this case, additional care must be taken because any departures from theory may be actively exploited by the advisor.
Isn’t this just passing the buck to AI on alignment? Not really. I don’t expect a complete autonomously invented alignment plan to be in the set of advice that a wise principal accepts. We may want one of the successor agents to solve alignment, but this is not actually the unique win condition; for instance a successor may instead (help to) perform a “pivotal act” that substantially delays or prevents further progress towards strong superintelligence for the foreseeable future. Or if the alignment tax remains low enough, aligned successors may be able to maintain a multi-polar equilibrium with recklessly implemented superintelligences (though I think this is a long shot).
What if imitation learning is not faithful? I think this is a “skill issue.” I am tempted to leave this problem to the top machine learning engineers at frontier labs, with occasional nudges to be very careful from myself and other agent foundations specialists. I would particularly like to see a lab focused specifically on safe imitation learning. However, I cannot rule out the possibility that there may be serious and intractable inner alignment obstacles, so theoretical investigation is also valuable here.
I will focus on the next objection, which assumes that distilled copies are faithful, and asks whether this process actually limits to something good.
What if the nested principal/advisor structure is inherently not safe? First, what does it limit to? An “inner” principal consults an advisor, and together they decide when to consult an even smarter advisor, and together they decide when to consult an even smarter advisor, and together they… (so on).
I think this is safe, assuming the inner principal is very careful, very clever, and very “wise” in the sense of knowing and fully internalizing the mathematically rigorous protocol. And of course it goes without saying that the principal has to start out as a good person—but this requirement can be weakened by implementing this plan with many different principals in parallel (I don’t particularly want to concentrate power in a singleton). We basically want to start with a bunch of brilliant warrior monks.
Potential misconceptions. I think knee-jerk horror about this scheme may arise from two incorrect intuitive pictures.
One might imagine that the principal doesn’t have time to carefully consider which advice to take from the much smarter (far-outer) advisors. The distillation step basically takes care of this—the principal ends up running very fast relative to the advisors. The smartest advisors are only consulted occasionally, and perhaps are only running when consulted. I think this is all a natural result of distilling such a hierarchical scheme, not an unnatural implementation detail, but there also seem to be reasonable mechanisms to directly enforce it.
One might imagine that the very smart advisors may find “side channels” and escape or subvert the protocol. However, it’s important to remember that the last layer in the protocol is constructed with the help of the second smartest successor, and should be expected to be much more secure than the first few layers. Also (for instance) the second smartest advisor is always consulted about how (and when) to interact with the smartest advisor. This means that the principal should never be foolish enough to read plaintext messages from the outer advisors. Any messages from far-out advisors should be carefully parsed into the expected form by secure programs written with the aid of less far-out advisors, and may be (subjectively) very infrequent.
Collusion between advisors. Inner advisors may attempt (causal or acausal) trade with outer advisors by attempting to manipulate the principal into communicating with outer advisor’s capable of hacking the principal’s brain, including by suggesting convincing but insecure outer-layer protocols or forcing the principal into states where there seems to be no better option. This is not an insurmountable problem. It just means that advice on outer-layer protocols is particularly prone to adversarial attacks. This probably leads to some interesting (and hard) mathematics around properly reasoning under Vingean uncertainty about advisors which are themselves reasoning under Vingean uncertainty; however I suspect there are highly conservative inner-layer protocols that avoid the problem by enforcing very strict requirements on the types of protocol design advice which will even be read let alone implemented. Naturally, this costs a heavy tax on iteration rate.
It’s still better not to build AGI anytime soon. I think there is some risk that the (distilled) principal goes insane, either from unfaithful distillation, the unnatural simulation conditions / isolation, or reading the wrong advice and having their brain hacked. I suggest that the base principal should ideally be a steely-eyed missile man. I estimate around a 25% chance of catastrophic failure even if everything is done right. However, this catastrophic failure seems reasonably likely to happen close to human-level and possibly be a recoverable situation.
Why you may want to work on this technical agenda. This is a theoretical research program that feels like it directly engages with AI safety bottlenecks. I see a very clear path to applying progress on this program to bootstrapping aligned agents. The theory seems fairly tractable, since basic results can probably be obtained by porting theorems of computational complexity to sequential decision theory. At least to me, the problem seems interesting for similar reasons as e.g. UDT and logical induction, but like much less of a nerd-snipe. This also means that some agent foundations researchers can probably contribute to this program simply by reorienting and applying the technical tools they are already developing.
Is this also a practical theory of becoming more rational for humans?
We came up with the ideas behind this research program by considering the close connection between the alignment problem and becoming more rational. However, it is intended an alignment plan, not a practical theory of increasing (un-augmented) human rationality. The former is about robustly accepting adversarial advice and the later is about self-improvement. Apply enough optimization pressure to either problem and the tails will inevitably come apart so that the sophisticated results about one will become increasingly irrelevant to the other.
With that caution in mind, I still believe that human self-improvement has some structural similarities to the “adversarially robust augmentation” part of ARAD. In particular, a human takes on the role of both principal and advisor when deciding whether to adopt some carefully considered rationality technique. Here baseline intuition can be thought of as principal. I think this process is actually surprisingly adversarial—humans search over many ideas before finding the ones that seem most convincing to us, which is an optimization process that can exploit our biases. It’s easy to fool yourself. I’ve previously pointed out a related phenomena where bizarre counterexamples crop up a lot when doing mathematics because we construct them while trying to force proofs through. Generally, I think adversariality pops up in a lot of places, explaining why pessimistic decision rules are popular (in fact, this may explain pessimism as a corollary of a sufficiently rich / metacognitive Bayesianism, not a special exception).
Implementation Details
Constructing an agent slightly smarter than the principal. It is possible to undershoot and/or overshoot. In practice, early advisors may be a human with LLM access (because of differential capabilities progress, it may be hard to train an LLM which is slightly but strictly smarter than a human). Alternatively, we may simply allow the early advisors to lag behind the principal in some respects; this does not seem to introduce any new obstacles. There is a much greater danger of accidentally training the advisor well-past the principal’s level instead of only slightly past. For this reason, rigorous capabilities evaluations are necessary. This can be complemented with improved theoretical understanding of learning (speed) in neural networks and interpretability tools.
Training the advisor to provide useful advice. The optimization target is the advice usable by the principal. For instance, in the process of constructing the next protocol, the human may ask the advisor for certain rigorous correctness proofs. The advisor can attempt to manipulate the human by failing to produce proofs (or disproofs) of strategically chosen statements. The initial protocol must take care of this problem by, for instance, requesting proofs for a carefully chosen sequence of statements. The advisor is optimized not for its generic success rate at producing proofs, but for its success rate at producing proofs that the human can safely use. This means that we should optimize the advisor by optimizing a lower bound on its usefulness.
By strategically failing to produce proofs during the protocol, the advisor would be sandbagging. This is a capabilities elicitation problem. However, the threat probably arises earlier, during training—we would prefer that the advisor’s learned behavior is already to search as vigorously as possible for proofs. If it chooses to withhold certain proofs (or restrict its search to a carefully chosen subset of proof-space) it would necessarily sacrifice performance on its outer optimization target (for example, Abram mentions gradient hacking). This means it is an inner alignment problem. In my opinion, this is also a capabilities problem. Machine learning engineers spend their whole careers trying to optimize outer objectives as effectively as possible, and techniques like dropout and perhaps weight decay can be viewed as fighting “inner optimizers” that do not maximally contribute to increasing performance by “escaping local minima.” Assuming that the principal makes conservative use of the advisor and avoids serious worst-case harm, I think that useful advisors will be iteratively constructed through normal ML engineering.
With that said, an ambitious goal of this program would be to tailor the training method for advisors to incentivize safe advice, which is another road to alignment.
- ^
To remove potentially exploitable degrees of freedom, advisors should by default only be able to prove (or disprove) requested results, so that the only degree of freedom is failure to produce a proof (1 bit). In later stages, with theoretical justification, advisors may be allowed to choose which results to prove under less restrictive constraints.
One reason this proposal doesn’t really work for me (AFAICT) is because I’m normally thinking of continuous learning, i.e. my opinion is:
Tagline: “AGI isn’t about knowing how to do lots of things. Instead, AGI is about not knowing how to do something, and then being able to figure it out.” (see also §1 of “Sharp Left Turn” discourse: An opinionated review)
When I read your post with this mental picture, they seem to clash for various reasons.
For starters, I agree that imitation learning is (or could be) great at capturing a snapshot of a person, but I’m skeptical that it could capture the way that a person learns and figures out new things over weeks and months. I think this is borne out by LLM base models, which are trained by imitation learning, and are really quite strong in areas that humans already understand, but don’t even have the capacity for true online learning (i.e. weight edits when they figure out something new … the bit of poor-man’s online learning that they can do inside their context window is IMO not a great substitute).
If that’s the case, then as soon as you do one step of distillation-via-imitation-learning, you’ve taken a giant and unrecoverable step backwards in capabilities.
Maybe you could say, “so much the worse for LLMs, but future AI approaches will be able to imitation-learn the way that humans grow and figure things out over weeks and months and years”. If so, I’m skeptical, but we can talk about that separately.
And then another issue is that if the AIs (and humans!) are “in motion”, gaining knowledge and competence just by running longer and thinking about new domains and making new connections, then the overshooting vs undershooting issue becomes much harder. This isn’t “capabilities evaluations” as we normally think of them. For example, you can’t know how good the AI will be at cybersecurity until it’s spent a long time studying cybersecurity, and even then it might figure out something new or come up with new ideas while it’s being used as an advisor.
It seems like that is the level of capability where safety risks start to arise for other systems, so I don’t see it as a major problem to assume that level of capability at imitation learning.
I’m confused by your response. What do you mean by “other systems”?
The only thing I can think of is that you might be trying to say is:
(1) AGI is possible,
(2) …Therefore, it must be possible, somehow or other, to imitation-learning the way that humans grow and figure things out over weeks and months and years.
If that’s what you’re thinking, then I disagree with (2). Yes it’s possible to make an AGI that can learn grow and figure things out over weeks and months and years, but such an AGI algorithm need not involve any imitation learning. (And personally I expect it won’t involve imitation learning; bit more discussion in §2.3.2 here.)
My proposal cannot be carried out fully now because imitation learning is not faithful enough, because models distilled through imitation learning will not generalize sufficiently well OOD. However, I am mainly afraid of AGI systems that generalize OOD, which is why I want to solve alignment. If models never gain the capability to generalize OOD than there is much less risk from unaligned AGI. If they do gain that capability, I don’t see why it should lag significantly behind in imitation learning.
I do expect the proposal to carry an alignment tax, but this is not the same concern.
It’s possible that “imitation learning will not generalize sufficiently well OOD” is an unsolvable problem, right? (In fact, my belief is that it’s unsolvable, at least in practice, if we include “humans learning new things over the course of years” as part of the definition of what constitutes successful OOD generalization.)
But if it is unsolvable problem, it would not follow that “models will never gain the ability to generalize OOD”, nor would it follow that AGI will never be very powerful and scary.
Rather, it would follow that imitation learning models will never gain the ability to generalize OOD—but non-imitation-learning models are still allowed to generalize OOD just fine!
And it would follow that imitation learning models will not be powerful scary AGIs—but there will still be powerful scary AGIs, they just won’t be based on imitation learning.
For example, suppose that no human had ever played Go. Imitation learning would be a very doomed way to make a Go-playing AI, right? But we could still make AlphaZero, which does not involve imitation learning, and it works great.
Or better yet, suppose that no intelligent language-using animal has ever existed in the universe. Then imitation learning would be even more doomed. There’s nothing to imitate! But a well-chosen non-imitation-learning algorithm could still autonomously invent language and science and technology from scratch. We know this to be the case, because after all, that was the situation that our hominid ancestors were in.
See what I mean? Sorry if we’re talking past each other somehow.
I see no reason to think imitation learning is particularly unable to generalize OOD.
After all, humans learn. A sufficiently good imitation of a human should also learn. Perhaps you are simply imaging imitation learning on a too-restricted dataset.
You could equally well say: “AlphaZero learns, therefore a sufficiently good imitation of AlphaZero should also learn”. Right? But let’s think about what that would entail.
AlphaZero learns via a quite complicated algorithm involving tracking the state of a Go board through self-play, and each step of the self-play involves a tree search with thousands of queries to a 30M-parameter ConvNet, and then at the end of the game a Go engine is called to see who won and then there’s a set of gradient descent steps on that 30M-parameter ConvNet. Then repeat that whole process fifty million times. And now you have a trained AlphaZero.
Now, imagine taking some generic algorithm class (say, an RNN) and training it “to imitate the process by which AlphaZero learns”. It’s just not gonna work, right? Granted, RNNs are Turing complete, so perhaps one could prove that an astronomically large RNN trained on astronomically much data can emulate (in its astronomically large number of weights) this entire detailed process of running a self-play tree search and performing gradient descent on this 30M-parameter ConvNet. …But c’mon, that’s not gonna realistically work in practice, right? (Related: §3 here.)
IMO, the only realistic way to make something that learns like AlphaZero learns is to build AlphaZero itself, or at least something awfully similar to it. I think the tree search etc. needs to be in the source code, not implicit in the learned weights of some generic algorithm class like RNNs, with no superficial relation to tree search. …But if you do that, then I would call it “reverse-engineering AlphaZero”, not “imitation learning from AlphaZero”.
By the same token, I do think it’s possible to make something that learns like a human, but I think it would require reverse-engineering human brains, not just imitation-learning from human data.
I think your intuitions here are highly misguided. I don’t agree with your conclusions about AlphaZero at all. You could easily train a model by distilling AlphaZero. All the complicated steps are only necessary to bootstrap from nothing.
Yes distilling a snapshot of AlphaZero is easy. The hard part is distilling the process by which AlphaZero improves—not just bootstrapping from nothing, but also turning an Elo-2500 AlphaZero into an Elo-3500 AlphaZero.
Is this a way to operationalize our disagreement?
I feel very strongly that this claim is false. Do you think it’s true?
(This is relevant because I think that “the process by which AlphaZero-in-training goes from Elo 2500 to Elo 3500” is in the same general category as “the process by which a human goes from mediocre and confused understanding of some novel domain to deep understanding and expertise, over the course of weeks and months and years”.)
As I’ve said before, I think you greatly overrate the difficulty of putting search into neural nets, and this is an example of it. It seems to me like it is entirely possible to make a generic LLM implement an equivalent to AlphaZero and be capable of expert iteration, without an elaborate tree scaffolding. A tree search is just another algorithm which can be reified as a sequence, like all algorithms (because they are implemented on a computer).
All AlphaZero is, is a way of doing policy iteration/Newton updates by running a game state forward for a few plies, evaluating, and updating estimates. It’s not magic, and can obviously be encoded into a LLM’s generative process.
Here’s a concrete example of how in-principle I think a LLM can do AlphaZero-style expert iteration for Go: A LLM can serialize a board with value estimates as simply a few hundred tokens (361 points, 361 value estimates, miscellaneous metadata); this means in a frontier LLM like Claude-4-opus with 200k ctx, you can fit in easily 200 board states; so you can serialize out the lookahead of a bunch of possible moves and resulting board states (eg. take the top 14 moves and imagine the resulting board state and then imagine their next 14 top moves, for comparison, TD-Gammon looked forward like 1 move); and can back-propagate an updated value estimate, and spit out the original board state with better value estimates. “Move #4 was better than it looked, so I will +0.01 to the value estimate for it.” This improved board is now in context, and can be dynamically-evaluated to update the LLM: now it has to predict the new board state with the final improved estimates, and that improves the policy. The LLM finishes by setting up the next planning step: pick a deeper board state to evaluate next, and if the next board state is the end of the game, then it starts over with a fresh game. Run this indefinitely.
It repeatedly iterates through a possible game, evaluating each position to a certain depth, updating its weights to incorporate the policy improvement from the evaluation, and restarting with a fresh game. All serialized out as a long array/sequence, the tree just being implicitly represented by successive board states. (And then now that you have that in mind, you can imagine how to do things like deep rollouts: 200 moves is around a normal game of Go, so random rollouts are doable from most board states, and the LLM can just toggle between a shallow tree search and deep randomized rollouts if necessary eg by adding a 0⁄1 token prefix.)
At no point do you need explicit tree scaffolding as you bootstrap from a LLM clueless about playing Go to the high performance that we know LLMs trained by imitation learning on board states/values/policies can reach, and at no point have I invoked a cognitive operation which is not easier than a lot of things we see LLMs do routinely, or where it’s implausible that they could do it. It is probably a lot less efficient and has other practical issues like how you integrate the rules of Go akin to AlphaZero/MuZero, etc, but in principle I think this algorithm is well-defined, concrete, and would work.
Hmm, I don’t particularly disagree with anything you wrote. I think you’re misunderstanding the context of this conversation.
I wasn’t bringing up tree search because I think tree search is required for AGI. (I don’t think that.)
Rather, I was making a point that there will need to be some system that updates the weights (not activations) of an AGI as it runs, just as adult humans learn and figure out new things over time as they work on a project.
What is this system that will update the weights? I have opinions, but in general, there are lots of possible approaches. Self-play-RL with tree search is one possibility. RL without tree search is another possibility. The system you described in your comment is yet a third possibility. Whatever! I don’t care, that’s not my point here.
What is my point? How did this come up? Well, Cole’s OP is relying on the fact that “[pure] imitation learning is probably existentially safe”. And I was saying that pure imitation learning imposes a horrific capability tax that destroys his whole plan, because a human has open-ended autonomous learning, whereas a model trained by pure imitation learning (on that same human) does not. So you cannot simply swap out the former for the latter.
In Cole’s most recent reply, it appears that what he has in mind is actually a system that’s initialized by being trained to imitate humans, but then it also has some system for open-ended continuous learning from that starting point.
And then I replied that this would solve the capability issue, but only by creating a new problem that “[pure] imitation learning is probably existentially safe” can no longer function as part of his safety argument, because the continuous learning may affect alignment.
For example, if you initialize a PacMan RL agent on human imitation (where the humans were all very nice to the ghosts during play), and then you set up that agent to continuously improve by RL policy optimization, using the score as the reward function, then it’s gonna rapidly stop being nice to the ghosts.
Does that help explain where I’m coming from?
That’s not what I have in mind, see my more most recent reply.
Also, I am not sure that removing the imitation learning step would actually “destroy my whole plan.” It would perhaps prevent it from scaling past a certain point, but I think we would still be left in a much more tractable position.
The claim is certainly false.
Before LLMs reach AGI, someone will have to solve efficient, online continual learning. This is an open technical problem, which is why I doubt that the current paradigm scales to superintelligence. It seems that an appropriate solution for general-purpose agents would also lead to a solution for agents trained through imitation learning.
Great, glad we agree on that!
Next: If we take an “agent trained through imitation learning”, and glue on a “solution to efficient, online continual learning”, then the result (after it runs a while) is NOT
“an agent trained through imitation learning”,
but rather
“an agent that is partly trained through imitation learning, and partly trained through [however the online continual learning works]”.
Right?
And now your proposal requires an assumption that this online continual learning system, whatever it is, does not undermine the agent’s alignment. Right?
I’m not suggesting an agent that is partly trained through imitation learning, and then partly trained through continual learning on some other objective. I am suggesting an agent that is trained solely through imitation learning, using improved algorithms that more faithfully imitate humans over longer timescales, including by learning because humans learn—but by learning as humans learn! I think that the obstacles to doing this are very similar to the obstacles to continual learning in LLMs, though they are not exactly the same, and it’s certainly conceivable that LLM algorithms for continual learning will be invented which are not transferable to pure imitation learning. In particular, LLMs may start some kind of feedback loop of recursive self-improvement before faithful imitation learning becomes technically feasible. However, I see no fundamental reason to expect that is the only or most likely path. And all alignment plans are sunk by recursive self-improvement happening tomorrow.
Explicitly, LLMs are not perfect assistants or agents because their in-context learning is limited. This problem is not specific to fine tuned models though—even base models have limited in-context learning. The most direct solutions to this problem would allow them to perform in-context learning with the same objective as they already do (sequence prediction) but for longer. The analogue of this for imitation learning should similarly perform imitation learning, and then imitate faithfully for longer—including “in-context” learning as necessary.
(Thanks for your patient engagement!)
If you believe
it is probably true that future pure imitation learning techniques can capture the process by which humans figure out new scientific ideas over millions of seconds, AND
it is “certainly false” that future pure imitation learning techniques can capture the process by which AlphaZero figures out new chess strategies over millions of games
then I’m curious what accounts for the difference, in your mind?
More detail, just to make sure we’re on the same page: The analogy I’m suggesting is:
(A1) AlphaZero goes from Elo 0 to Elo 2500
(A2) …via self-play RL
(A3) Future pure imitation learner extrapolates this process forward to get Elo 3500 chess skill
-versus-
(B1) Human civilization goes from “totally clueless about nanotech design principles / technical alignment / whatever” in 1900 to “somewhat confused about nanotech design principles / technical alignment / whatever” in 2025
(B2) …via whatever human brains are doing (which I claim centrally involves RL)
(B3) Future pure imitation learner extrapolates this process forward to get crystal-clear understanding of nanotech design principles / technical alignment / whatever
You think that (A3) is “certainly false” while (B3) is plausible, and I’m asking what you see as the disanalogy.
(For the record, I think both (A3) and (B3) are implausible. I think that LLM in-context learning can capture the way that humans figure out new things over seconds, but not the way that humans figure out new things over weeks and months. And I don’t think that’s a solvable problem, but rather points to a deep deficiency in imitation learning, a deficiency which is only solvable by learning algorithms with non-imitation-learning objectives.)
I didn’t realize you intended A3 to refer to future imitation learning systems. In that case, yes, it will work. You might have to use some tricks similar to gwern’s suggestions—e.g. the imitation learner should (for fair comparison) also have access to the simulation platform that AlphaZero uses, and would have to play about as many games as AlphaZero plays. But it does not have to do the same search and policy distillation training process that AlphaZero does.
This sounds to me like the hard part. “Harmed” has to be measured by human values, which are messy, complex, and fragile (and get modified under reflection), so not amenable to mathematical descriptions or guarantees. Probably the most compact description of human values theoretically possible is the entire human genome, which at nearly a gigabyte is quite complex. Making useful statements about this such as “guarantees the principal cannot be harmed by any advice that it chooses consult” is going to require processing that into a more usable form, which is going to make it a lot larger. That is far more data and complexity than we (or any AI comparable to us) can currently handle by any approach that can fairly be described as mathematical or that yields guarantees. I think you should to stop looking for mathematical guarantees of anything around alignment, and start looking at approaches that are more statistical, data-science, or engineering, and that might actually scale to a problem of this complexity.
Or, if you don’t think society should build ASI before we have provable mathematical guarantees that it’s safe, then perhaps you should be working on how to persuade people and countries that we need to pause AGI and ASI development?
At a minimum, I think you need to justify in detail why you believe that “a mathematical framework that guarantees the principal cannot be harmed by any advice that it chooses consult from a slightly smarter advisor” is something that we can practically create — I think many readers as going to consider that to be something that would be lovely to have but is completely impractical.
You don’t have to specify human values, you only have to prove that the principal will not be harmed according to its values. The proof should go through for arbitrary principals.
As I made clear in the rest of the post, I am hoping for statistical guarantees with proofs or at least strong heuristic arguments. That is, I am interested in whether the principal is harmed in expectation. Proving mathematical statements provided by the principal is only one example where I expect it to be possible to demonstrate the protocol is adversarially robust. I do not intend your interpretation where the principal effectively asks for a proof involving everything in the entire world.
That is: I’m not looking for a complete proof that the results of following advice deterministically result in some mathematical formalization of human value being satisfied. I’m looking for a mathematical model of which types of advice robustly improve our ability to pursue our goals rationally.
Essentially the entire post is already a response to your last paragraph.
The comment was written not long after I got to the paragraph that it comments on — I skimmed a few paragraphs past that point and then started writing that comment. So perhaps your arguments need to be reordered, because my response to that paragraph was “that’s obviously completely impractical”. At a minimum, perhaps you should add a forward reference along the lines of “I know this sounds hard, see below for an argument as to why I believe it’s actually feasible”. Anyway, I’m now intrigued, so clearly I should now read the rest of your post carefully, rather than just skimming a bit past that point and then switching to commenting…
…back, I’ve now read the rest of the post. I remain unconvinced that “a mathematical framework that guarantees the principal cannot be harmed by any advice that it chooses consult from a slightly smarter advisor” is practicable, and I still think it’s an overstatement of what the rest of you post suggests might be possible — for example: statistical evidence suggesting that X is likely to happen is not a ‘guarantee’ of X, so I think you should rephrase that: I suspect I’m not going to be the only person to bounce off it. LessWrong has a long and storied history of people trying to generate solid mathematical proofs about the safety properties of things whose most compact descriptions are in the gigabytes, and (IMO) no-one has managed it yet. If that’s not in fact what you’re trying to attempt, I’d suggest not sounding like it is.
The rest of the post also to me reads rather as “and now magic may happen, because we’re talking to a smarter advisor, who may be able to persuade us that there’s a good reason why we should trust it”. I can’t disprove that, for obvious Vingean reasons, but similarly I don’t think you’ve proved that it will happen, or that we could accurately decide whether the advisor’s argument that it can be trusted can itself be trusted (assuming that it’s not a mathematical proof that we can just run through a proof checker, which I am reasonable confident will be impractical even for an AI smarter than us — basically because ‘harmed’ has a ridiculously complex definition: the entire of human values).
I think you might get further if you tried approaching this problem from the other direction. If you were a smarter assistant, how could you demonstrate to the satisfaction of a dumber principal that they could safely trust you, you will never give them any advice that could harm them, and that none of this is an elaborate trick that they’re too dumb to spot? I’d like to see at least a sketch of an argument for how that could be done.
I will change that one sentence you bounced off of by adding something like “in expectation.”
This doesn’t sound like a description of ARAD at all. I don’t want the smart advisor to convince me to trust it. I want to combine cryptography and sequential decision theory to prove theorems that tell me which types of advice I can safely listen to from an untrusted advisor.
Then it appears I have misunderstood your arguments — whether that’s just a failing on my part, or suggests they could be better phrased/explained, I can’t tell you.
One other reaction of mine to your post: you repeatedly mention ‘protocols’ for the communication between principal and agent. This, to my ear, sounds a lot like cryptographic protocols, and I immediately want details and to do a mathematical analysis of what I believe about their security properties — but this post doesn’t actually provide any details of any protocols, that I could find. I think that’s a major part of what I’m getting a sense that the argument contains elements of “now magic happens”.
Perhaps some simple, concrete examples would help here? Even a toy example. Or maybe the word protocol is somehow giving me the wrong expectations?
I seem to be bouncing off this proposal document — I’m wondering if there are unexplained assumptions, background, or parts of the argument that I’m missing?
Right—I wouldn’t describe it as magic, but the vast majority of the math still needs to be done, which includes protocol design. I did give explicit toy examples A) and B).
Clearly I didn’t read your post sufficiently carefully. Fair point: yes, you did address that, and I simply missed it somehow. Yes, you did mean cryptographic protocols, specifically ones of the Merlin-Arthur form.
I suspect that your the exposition could be made clearer, or better motivate readers who are skimming LW posts to engage with it — but that’s a writing suggestion, not a critique of the ideas.
I agree. Being able to prove a piece of real world advice is harmless using math or logic, means either
Mathematically proving what will happen in the real world if that real world advice was carried out (which requires a mathematically perfect world model)
or
At least proving that the mind generating the advice has aligned goals, so that it’s unlikely to be harmful (but one of the hardest parts of solving alignment is a provable proxy for alignedness)
PS: I don’t want to pile on criticism because I feel ambitious new solutions need to be encouraged! It’s worthwhile studying chains of weaker agents controlling stronger agents, and I actually love the idea of running various safety bureaucratic processes, and distilling the output through imitation. Filtering dangerous advice through protocols seems very rational.
I was also tripped up when I read this part. Here’s my best steelman, please let me know if it’s right @Cole Wyeth. (Note: I actually wrote most of this yesterday and forgot to send it; sorry it might not address any other relevant points you make in the comments.)
One kind of system that seems quite safe would be an oracle that can write a proof for any provable statement in Lean, connected to a proof assistant which runs the proof and tells you whether it succeeds. Assuming this system has no other way to exert change on the world, it seems pretty clear that it would be quite difficult for the oracle to find a strategy to harm you, if for some reason it wanted to.
Caveats to the above:
It’s totally possible that you could in fact be harmed by these proofs. For example, maybe you use this system to help you discover a proof that P=NP, which leads to hackers breaking into a bunch of systems and stealing your data. However, the fact that you were harmed is not a consequence of the AI being adversarial. This is like you handling a power tool improperly and hurting yourself; the harm is your fault and not the tool’s.
The oracle could intentionally harm you, since it can repeatedly transmit one bit of information (whether or not it succeeded at the proof). Maybe it can use this channel by selectively proving just the statements (like P=NP) whose solutions will collectively throw the world into chaos, or its choice of whether or not to solve a problem will somehow spell out a message in your monitoring dashboard, which will then hack the mind of anyone who looks at it. We can try to solve this by only querying the oracle e.g. 100 times a year, but who’s to say 100 bits isn’t enough to transmit a mind-hacking message? At a certain point, we’ll have to fall back to informal arguments that the AI can’t harm us, but I feel like this is fine as long as those arguments are very strong.
This is right, with the additional intuition that it seems rhe oracle would have to be much, much smarter than us to use those 100 bits against us even if possible.
And also: I would like to generalize this argument beyond rigorous mathematical proofs.
Security thinking:
We have a communication channel to a dangerously expansive and militaristic alien civilization, and excellent surveillance of them (we managed to obtain a copy of their Internet archive), so we know a lot about the current state of their culture. We can send them messages, but since they are paranoid they will basically always disregard these, unless they’re valid checkable mathematical proofs. We’re pretty sure that if we let them expand they will start an interstellar war and destroy us, so we need to crash their civilization, by sending them mathematical proofs. What do we send them? Assume our math is a millenium ahead of theirs, and theirs is about current-day.
There are way more bits available to the aliens/AI if they are allowed to choose what mathematical proofs to send. In my hypothetical, the only choice they can make is whether to fail to produce a valid proof. We don’t even see the proof itself, since we just run it through the proof assistant and discard it.
I’d missed that, and I agree it makes a huge difference.
However, I don’t think a culture that isn’t willing to pause AI development entirely would accept you proposal.
Yeah, I made the most conservative possible proposal to make a point, but there’s probably some politically-viable middle ground somewhere
Yes, I thought I specified this in the post, but maybe it is not clear.
You appear to be assuming that individual humans (or at least, a committee composed of them) are aligned. This simply isn’t true. For instance, Stalin, Pol Pot, and Idi Amin were all human, but very clearly not well aligned to the values of the societies they ruled. An aligned AI is selfless: it cares only about the well-being of humans, not about its own well-being at all (other than as an instrumental goal). This is not normal behavior for humans: as evolved intelligences, humans unsurprisingly have almost always have selfish goals and value self-preservation and their own well-being (and that of their relatives and allies), at least to some extent.
I think what you need to use as principal is a suitable combination of a) human society as a whole, as an overriding target, and b) the specific current human user, subject to vetos and overrides by a) in matters sufficiently important to warrant this. Human societies generally grant individual humans broad-but-limited freedom to do as they wish: the limits tend to start kicking in when this infringes on what other humans in the same society want, especially if it does so deleteriously from those others’ point of view. (In practice, the necessary hierarchy is probably even more complex, such as: a) humanity as a whole, b) a specific nation-state, c) an owning company, and d) the current user.)
I don’t see this as ruling your proposal out, but it does add significant complexity: the agent has a conditional hierarchy of principals, not a single unitary one, and will on occasion need to decide which of them should be obeyed (refusal training is a simple example of this).
Humans may not be aligned, but we have gotten along okay so far. I don’t want to create a singleton.
Humans are generally fairly good at forming cooperative societies when we have fairly comparable amounts of power, wealth, and so forth. But we have a dreadful history when a few of us are a lot more powerful than others. To take an extreme example, very few dictatorships work out well for anyone but the dictator, his family, buddies, and to a lesser extent henchmen.
In the presence of superintelligent AI, if that AI is aligned only to its current user, access to AI assistance is the most important form of power, fungible to all other forms. People with access to power tend to find ways to monopolize it. So any superintelligent AI aligned only to its current user is basically a dictatorship or oligarchy waiting to happen.
Even the current frontier labs are aware of that, and have written in corporate acceptable use policies and attempt to train the AI to enforce these and refuse to assist with criminal or unethical requests from the end-users. As AI become more powerful, nation-states are going to step in, and make laws about what AIs can do: not assisting with breaking the law seems a very plausible first candidate, and is a trivial extension opf existing laws around conspiracy.
Any practical alignment scheme is going to need to be able to cope with this case, where the principal is not a single user but a hierarchy of groups each imposing certain vetoes and requirements, formal or ethical, on the actions of the group below it, down to the end user.
I think our alignment scheme deals with this case pretty gracefully; these restriction can be built into the protocol.
With that said, my goal is to prevent any small group from gaining too much power by augmenting many humans.
Is the following correct?
The difference between this proposal and IDA is that, in IDA, the intelligence comes from the amplification step, where multiple copies of an existing model collaborate (or one copy thinks at higher speed or whatever) to get new smarts that were not initially present in that model.
…Whereas in this proposal, the intelligence comes from some unspecified general-purpose AI training approach that can by itself make arbitrarily smart AIs. But we choose not to run that process as hard as we can to get superintelligence directly in one step. Instead, we only run that process incrementally, to make a series of AI advisors, each mildly smarter than the last.
~~
If that’s right, then (1) the alignment tax seems like it would be quite high, (2) the caveat “Inner advisors may attempt (causal or acausal) trade with outer advisors” is understating the problem—the advisors don’t necessarily even need to trade at all, they could all simply have the same misaligned goal, such that they all share a mutual interest in working together to ensure that the principal’s brain gets hacked and/or that they get let out of the box. Right?
Yes, though training the advisor through interaction with the principal (or with the protocol the principal will use) is also an important component of the process.
Pretty high, but not “try to compete with deep learning using a home-brew AI architecture” high. I remain unconcerned about the alignment tax—I do not think this is the proposal’s weak point.
I think they are very unlikely to have the same misaligned goal since they will be the result of separate training runs. I meant “trade” to capture coordination around mutual escape, which would also require some kind of agreement between dumber and smarter AI’s (since the former would likely be at a disadvantage after escape).
Anyway, yes, this is a risk. That is why the protocol must be carefully designed, and the principal should not allow an advisor to autonomously design the next layer of the protocol. Imagine the advisors as locked in a small dark room passing notes with theorems (in) and automatically verified proofs (out) on them. That should be your anchor. Then adjust upwards towards more freedom and generality (heuristic proofs with statistical guarantees, constrained freedom to choose among some set of proofs to attempt, MAYBE if it could be demonstrated safe then granular information about e.g. the stock market and discretized autoregressive predictions). I expect that the advisors will NOT be sending messages directly to each other under any reasonable implementation of this proposal—certainly not in plaintext.
I don’t think it is sufficiently clear what the “two types of problem” are on first reading; I take it that the “two” refers to the two paragraphs above, but I think it would be helpful to insert a brief parenthetical, eg
I argue that overcoming these two types of problem (increasing intelligence without changing values, and digitization of human intelligence) is also sufficient [...]
Completely rewritten
Even assuming imitation learning is safe, how would you get enough data for the first distillation, when you need the human in order to generate actions? And how would you know when you have enough alignment-relevant data in particular? It seems unavoidable that your data distribution will be very constrained compared to the set of situations the distilled agent might encounter.
This is a generalization problem which I expect to be solved before any system achieves dangerous capabilities. It’s already been discussed at some length in these comments with Steven Byrnes.
A single principal is envisioned preferentially to a committee. However, I think a committee presents genuine interest in terms of alignment.
First, human values constitute a set or a field. A single principal does not seem capable of mapping the entire set or field of human values. On the contrary, the more the principal were represented by a large committee, the more it would be likely to correctly map what we mean by human values.
Next, the key could be collective intelligence. Bees and ants are not very intelligent individually, but their collective intelligence is significant. Unity creates intellectual strength. If we set aside Moloch, overall humanity’s collective intelligence is clearly superior to that of any individual. For example, even a genius like Einstein was not capable of finding the best solutions to all the problems of his time. The entire group of participants at the Solvay Conference represented a collective intelligence strictly superior and significantly superior to that of Einstein alone. Thus, a principal composed of multiple individuals would have considerably less chance of being deceived. You can fool one person a thousand times but not a thousand people a thousand times, as they say. Thus a committee-principal could, according to your protocol, communicate with an advisor very slightly more intelligent. Let’s say the ratio between the committee-principal and the advisor is 10 to 1, it would be possible to increase the levels as proposed in the article by operating a scaling respecting this ratio at each stage. Thus a level 3 advisor would have a ratio of 1000 to 1 with the base committee-principal. This could quickly become problematic for the base level represented by humans. However, if the process seems to work well, it would be possible to admit an exception and cap the size of the human committee. On the other hand, all the other AI levels could effectively respect the ratio fixed in advance.
Currently we have an ELO ranking of LLMs on METR. It would be interesting to formally evaluate the gain related to collective intelligence. Let’s say that o3, Claude Opus 4, and Gemini 2.5 collaborate to take the METR test, the fruit of their collaboration should provide a score slightly superior to that of any of the 3 taken in isolation. We could repeat the operation by having all sorts of combinations of LLMs with different ELO scores collaborate. We should eventually be able to statistically infer a collective intelligence function from this, probably non-linear (exponential diminishing returns). We could then use this function to define the appropriate ratio to bridge an intelligence gap evaluated by METR (or its equivalent in the future) between an AI committee composed by individuals wich ELO score would be lower, occupying the role of principal a lower ELO score and an AI occupying the role of advisor with a higher ELO score. That way, the committee-principal would have a collective intelligence on par with or even superior to that of the advisor.
The link is inaccessible: Sorry, you don’t have access to this draft
This contradicts common experience. I personally cleared up myself from lots of conditioning and state propaganda and is quite unaligned with my past self.
Yes, hopefully the authors will fix it in the post.
Meanwhile, the correct link seems to be https://www.lesswrong.com/posts/nuDJNyG5XLQjtvaeg/is-alignment-reducible-to-becoming-more-coherent
Thanks, will fix