Among people I’ve had significant online discussions with, your writings on alignment tend to be the hardest to understand and easiest to misunderstand.
Additionally, I think that there are ways to misunderstand the IDA approach that leave out significant parts of the complexity (ie. IDA based off of humans thinking for a day with unrestricted input, without doing the hard work of trying to understand corrigibility and meta-philosophy beforehand), but can seem to be plausible things to talk about in terms of “solving the AI alignment problem” if one hasn’t understood the more subtle problems that would occur. It’s then easy to miss the problems and feel optimistic about IDA working while underestimating the amount of hard philosophical work that needs to be done, or to incorrectly attack the approach for missing the problems completely.
(I think that these simpler versions of IDA might be worth thinking about as a plausible fallback plan if no other alignment approach is ready in time, but only if they are restricted in terms of accomplishing specific tasks to stabilise the world, restricted in how far the amplification is taking, replaced with something better as soon as possible, etc. I also think that working on simple versions of IDA can help make progress on issues that would be required for fully scalable IDA, ie. the experiments that Ought is running.).
Paul, to what degree do you think your approach will scale indefinitely while maintaining corrigibility vs. just thinking that it will scale while maintaining corrigibility to the point where we “get our house in order”? (I feel like this would help me in understanding the importance of particular objections, though objections relevant to both scenarios are probably still relevant).
So I also don’t see how Paul expects the putative alignment of the little agents to pass through this mysterious aggregation form of understanding, into alignment of the system that understands Hessian-free optimization.
My model of Paul’s approach sees the alignment of the subagents as just telling you that no subagent is trying to actively sabotage your system (ie. by optimizing to find the worst possible answer to give you), and that the alignment comes from having thought carefully about how the subagents are supposed to act in advance (in a way that could potentially be run just by using a lookup table).
From my current understanding of Paul’s IDA approach, I think there are two different senses in which corrigibility can be thought about in regards to IDA, both with different levels of guarantee.
From An unaligned benchmark
1. On average, the reward function incentivizes behaviour which competes effectively and gives the user effective control.
2. There do not exist inputs on which the policy choose an action because it is bad, or the value function outputs a high reward because the prior behaviour was bad. (Or else the policy on its own will generate bad consequences.)
3. The reward function never gives a behaviour a higher reward because it is bad. (Or else the test-time optimisation by MCTS can generate bad behaviour.) For example, if the AI deludes the human operator so that the operator can’t interfere with the AI’s behaviour, that behaviour can’t receive a higher reward even if it ultimately allows the AI to make more money.
Property 1 is dealing with “consequence corrigibility” (competence at producing actions that will produce outcomes in the world we would describe as corrigible)
Properties 2&3 are dealing with corrigibility in terms of “intent corrigibility” (guaranteeing that the system does not optimise for bad outcomes). This does not cover the agent incompetently causing bad actions in the world, only the agent deliberately trying to produce bad outcomes.
I think IDA doesn’t require or claim worst-case guarantees on the task of “consequence corrigibility” (and that this is an impossible goal for bounded reasoners).
I think that average-case good performance on “consequence corrigibility” is claimed by IDA, but only as a subset of general competence.
I think that providing worst-case guarantees on “intent corrigibility” is required and claimed by IDA.
Roughly, I think that:
Versions of IDA that allow the overseer nodes more information could be generally competent (including predicting what behaviour could be corrigible), but could fail to be “intent corrigible”
Versions of IDA that allow the overseer nodes only a highly restricted set of queries could be “intent corrigible” but fail to be generally competent, and hence not be “consequence corrigible”
Standard ML approaches will, at some level of optimisation power, fail to behave “intent corrigibly” (even if you train them to be “consequence corrigible”)
The question I’m uncertain about is whether there’s a middle point in tradeoff space where both properties are sufficiently satisfied to produce good outcomes.
Do you agree or disagree with how I’ve broken down corrigibility claims for IDA, and which claims do you think your argument bears on?
I would see the benefits of humans vs. algorithms being that giving a human a bunch of natural language instructions would be much easier (but harder to verify) than writing down a formal algorithm. Also, the training could just cover how to avoid taking incorrigible actions, and the Overseer could still use their judgement of how to perform competently within the space of corrigible outputs.
Paul, it might be helpful to clarify the sort of things you think your approach relies upon in regards to bounds on the amount of overhead (training time, human sample complexity), or the amount of overhead that would doom your approach. If I recall correctly, I think you’ve wanted the approach to have some reasonable constant overhead relative to an unaligned system, though I can’t find the post at the moment? It might also be helpful to have bounds, or at least your guesses on the magnitude of numbers related to individual components (ie. the rough numbers in the Universality and Security amplification post).
Open Question: Working with concepts that the human can’t understand
Question: when we need to assemble complex concepts by learning/interacting with the environment, rather than using H’s concepts directly, and when those concepts influence reasoning in subtle/abstract ways, how do we retain corrigibility/alignment?
Paul: I don’t have any general answer to this, seems like we should probably choose some example cases. I’m probably going to be advocating something like “Search over a bunch of possible concepts and find one that does what you want / has the desired properties.”
E.g. for elegant proofs, you want a heuristic that gives successful lines of inquiry higher scores. You can explore a bunch of concepts that do that, evaluate each one according to how well it discriminates good from bad lines of inquiry, and also evaluate other stuff like “What would I infer from learning that a proof is `elegant` other than that it will work” and make sure that you are OK with that.
Andreas: Suppose you don’t have the concepts of “proof” and “inquiry”, but learned them (or some more sophisticated analogs) using the sort of procedure you outlined below. I guess I’m trying to see in more detail that you can do a good job at “making sure you’re OK with reasoning in ways X” in cases where X is far removed from H’s concepts. (Unfortunately, it seems to be difficult to make progress on this by discussing particular examples, since examples are necessarily about concepts we know pretty well.)
This may be related to the more general question of what sorts of instructions you’d give H to ensure that if they follow the instructions, the overall process remains corrigible/aligned.
I think the way to do exponential search in amplification without being exponentially slow is to not try to do the search in one amplification step, but start with smaller problems, learn how to solve those efficiently, then use that knowledge to speed up the search in later iteration-amplification rounds.
Suppose we have some problem with branching factor 2 (ie. searching for binary strings that fit some criteria)
Start with agent A0.
Amplify agent A0 to solve problems which require searching a tree of depth d0 at cost 2d0.
Distill agent A1, which uses the output of the amplification process to learn how to solve problems of depth d0 faster than the amplified A0, ideally as fast as any other ML approach. One way would be to learn heuristics for which parts of the tree don’t contain useful information, and can be pruned.
Amplify agent A1, which can use the heuristics it has learned to prune the tree much earlier and solve problems of depth d1>d0 at cost <2d1
Distill agent A2, which can now efficiently solve problems of depth d1
If this process is efficient enough, the training cost can be less than 2d1 to get an agent that solves problems of depth d1 (and the runtime cost is as good as the runtime cost of the ML algorithm that implements the distilled agent)
It seems brittle. If there’s miscommunication at any level of the hierarchy, you run the risk of breakage. Fatal miscommunications could happen as information travels either up or down the hierarchy.
It seems to me that the amplification scheme could include redundant processing/error correction—ie. ask subordinates to solve a problem in several different ways, then look at whether they disagree and take majority vote or flag disagreements as indicating that something dangerous is going on, and this could deal with this sort of problem.
The framework does not appear to have a clear provision for adapting its value learning to the presence/absence of decisive strategic advantage. The ideal FAI will slow down and spend a lot of time asking us what we want once decisive strategic advantage has been achieved. With your thing, it appears as though this would require an awkward retraining process.
It seems to me that balancing the risks of acting vs. taking time to ask questions depending on the current situation falls under Paul’s notion of corrigibility, so it would happen appropriately (as long as you maintain the possiblity of asking questions as an output of the system, and the input appropriately describes the state of the world relevant to evaluating whether you have decisive strategic advantage)
I would solve X-and-only-X in two steps:
First, given an agent and an action which has been optimized for undesirable consequence Y, we’d like to be able to tell that the action has this undesirable side effect. I think we can do this by having a smarter agent act as an overseer, and giving the smarter agent suitable insight into the cognition of the weaker agent (e.g. by sharing weights between the weak agent and an explanation-generating agent). This is what I’m calling informed oversight.
Second, given an agent, identify situations in which it is especially likely to produce bad outcomes, or proofs that it won’t, or enough understanding of its internals that you can see why it won’t. This is discussed in “Techniques for Optimizing Worst-Case Performance.”
Paul, I’m curious whether you’d see as necessary for these techniques to work to have that the optimization target is pretty good/safe (but not perfect): ie some safety comes from the fact that the agents optimized for approval or imitation only have a limited class of Y’s that they might also end up being optimized for.
Open Question: Severity of “Honest Mistakes”
In the discussion about creative problem solving,Paul said that he was concerned about problems arising when the solution generator was deliberately searching for a solution with harmful side effects. Other failures could occur where the solution generator finds a solution with harmful side effects without “deliberately searching” for it. The question is how bad these “honest mistakes” would end up being.
Paul: I also want to make the further claim that such failures are much less concerning than what-I’m-calling-alignment failures, which is a possible disagreement we could dig into (I think Wei Dai disagrees or is very unsure).
What if the current node is responsible for the error instead of one of the subqueries, how do you figure that out?
I think you’d need to form the decomposition in such a way that you could fix any problem through perturbing something in the world representation (an extreme version is you have the method for performing every operation contained in the world representation and looked up, so you can adjust it in the future).
When you do backprop, you propagate the error signal through all the nodes, not just through a single path that is “most responsible” for the error, right? If you did this with meta-execution, wouldn’t it take an exponential amount of time?
One step of this method, as in backprop, is the same time complexity as the forward pass (running meta-execution forward, which I wouldn’t call exponential complexity, as I think the relevant baseline is the number of nodes in the meta-execution forward tree). You only need to process each node once (when the backprop signal for it’s output is ready), and need to do a constant amount of work at each node (figure out all the ways to perturb the nodes input).
The catch is that, as with backprop, maybe you need to run multiple steps to get it to actually work.
And what about nodes that are purely symbolic, where there are multiple ways the subnodes (or the current node) could have caused the error, so you couldn’t use the right answer for the current node to figure out what the right answer is from each subnode? (Can you in general structure the task tree to avoid this?)
The default backprop answer to this is to shrug and adjust all of the inputs (which is what you get from taking the first order gradient). If this causes problems, then you can fix them in the next gradient step. That seems to work in practice for backprop in continuous models. Discrete models like this it might be a bit more difficult—if you start to try out different combinations to see if they work, that’s where you’d get exponential complexity. But we’d get to counter this by potentially having cases where, based on understanding the operation, we could intelligently avoid some branches—I think this could potentially wash out to linear complexity in the number of forward nodes if it all works well.
I wonder if we’re on the right track at all, or if Paul has an entirely different idea about this.
So do I :)
Huh, I hadn’t thought of this as trying to be a direct analogue of gradient descent, but now that I think about your comment that seems like an interesting way to approach it.
A human debugging a translation software could look at the return value of some high-level function and ask “is this return value sensible” using their own linguistic intuition, and then if the answer is “no”, trace the execution of that function and ask the same question about each of the function it calls. This kind of debugging does not seem available to meta-execution trying to debug itself, so I just don’t see any way this kind of learning / error correction could work.
I think instead of asking “is this return value sensible”, the debugging overseer process could start with some computation node where it knows what the return value should be (the final answer), and look at each of the subqueries of that node and ask for each subquery “how can I modify the answer to make the query answer more correct”, then recurse into the subquery. This seems pretty analogous to gradient descent, with the potential advantage that the overseer’s understanding of the function at each node could be better than naively taking the gradient (understanding the operation could yield something that takes into account higher-order terms in the operation).
I’m curious now whether you could run a more efficient version of gradient descent if you replace the gradient at each step with an overseer human who can harness some intuition to try to do better than the gradient.
Brainstorming approaches to working with causal goodhart
Low-impact measures that include the change in the causal structure of the world. It might be possible to form a measure like this which doesn’t depend on recovering the true causal structure at any point (ie. minimizing the difference between predictions of causal structure in state A and B, even if both of those predictions are wrong)
Figure out how to elicit human models of causal structure, and provide the human model of causal structure along with the metric, and the AI uses this information to figure out whether it’s violating the assumptions that the human made
Causal transparency: have the AI explain the causal structure of how it’s plans will influence the proxy. This might allow a human to figure out whether the plan will cause the proxy to diverge from the goal. ie. True goal is happiness, proxy is happiness score as measured by online psychological questionnaire, AI’s plan says that it will influence the proxy by hacking into the online psychological questionnaire. You don’t need to understand how the AI plans to hack into the server to understand that the plan is diverging the proxy from the goal.