The better we can solve the key questions (“what are these ‘wiser’ versions?“, “how is the whole setup designed?“, “what questions exactly is it trying to answer?“), the better the wiser ourselves will be at their tasks.
I feel like this statement suggests that we might not be doomed if we make a bunch of progress, but not full progress on these statements. I agree with that assessment, but it felt on reading the post like the post was making the claim “Unless we fully specify a correct theory of human values, we are doomed”.
I think that I’d view something like Paul’s indirect normativity approach as requiring that we do enough thinking in advance to get some critical set of considerations known by the participating humans, but once that’s in place we should be able to go from this core set to get the rest of the considerations. And it seems possible that we can do this without a fully-solved theory of human value (but any theoretical progress in advance we can make on defining human value is quite useful).
My interpretation of what you’re saying here is that the overseer in step #1 can do a lot of things to bake in having the AI interpret “help the user get what they really want” in ways that get the AI to try to eliminate human safety problems for the step #2 user (possibly entirely), but problems might still occur in the short term before the AI is able to think/act to remove those safety problems.
It seems to me that this implies that IDA essentially solves the AI alignment portion of points 1 and 2 in the original post (modulo things happening before the AI is in control).
Correcting all problems in the subsequent amplification stage would be a nice property to have, but I think IDA can still work even if it corrects errors with multiple A/D steps in between (as long as all catastrophic errors are caught before deployment). For example, I could think of the agent initially using some rules for how to solve math problems where distillation introduces some mistake, but later in the IDA process the agent learns how to rederive those rules and realizes the mistake.
Shorter name candidates:
Inductively Aligned AI Development
Inductively Aligned AI Assistants
It’s a nice property of this model that it prompts consideration of the interaction between humans and AIs at every step (to highlight things like risks of the humans having access to some set of AI systems for manipulation or moral hazard reasons).
In the higher dimensional belief/reward space, do you think that it would be possible to significantly narrow down the space of possibilities (so this argument is saying “be bayesian with respect to reward/beliefs, picking policies that work over a distribution) or are you more pessimistic than that, thinking that the uncertainty would be so great in higher dimensional spaces that it would not be possible to pick a good policy?
Open Question: Working with concepts that the human can’t understand
Question: when we need to assemble complex concepts by learning/interacting with the environment, rather than using H’s concepts directly, and when those concepts influence reasoning in subtle/abstract ways, how do we retain corrigibility/alignment?
Paul: I don’t have any general answer to this, seems like we should probably choose some example cases. I’m probably going to be advocating something like “Search over a bunch of possible concepts and find one that does what you want / has the desired properties.”
E.g. for elegant proofs, you want a heuristic that gives successful lines of inquiry higher scores. You can explore a bunch of concepts that do that, evaluate each one according to how well it discriminates good from bad lines of inquiry, and also evaluate other stuff like “What would I infer from learning that a proof is `elegant` other than that it will work” and make sure that you are OK with that.
Andreas: Suppose you don’t have the concepts of “proof” and “inquiry”, but learned them (or some more sophisticated analogs) using the sort of procedure you outlined below. I guess I’m trying to see in more detail that you can do a good job at “making sure you’re OK with reasoning in ways X” in cases where X is far removed from H’s concepts. (Unfortunately, it seems to be difficult to make progress on this by discussing particular examples, since examples are necessarily about concepts we know pretty well.)
This may be related to the more general question of what sorts of instructions you’d give H to ensure that if they follow the instructions, the overall process remains corrigible/aligned.
Open Question: Severity of “Honest Mistakes”
In the discussion about creative problem solving,Paul said that he was concerned about problems arising when the solution generator was deliberately searching for a solution with harmful side effects. Other failures could occur where the solution generator finds a solution with harmful side effects without “deliberately searching” for it. The question is how bad these “honest mistakes” would end up being.
Paul: I also want to make the further claim that such failures are much less concerning than what-I’m-calling-alignment failures, which is a possible disagreement we could dig into (I think Wei Dai disagrees or is very unsure).
I would solve X-and-only-X in two steps:
First, given an agent and an action which has been optimized for undesirable consequence Y, we’d like to be able to tell that the action has this undesirable side effect. I think we can do this by having a smarter agent act as an overseer, and giving the smarter agent suitable insight into the cognition of the weaker agent (e.g. by sharing weights between the weak agent and an explanation-generating agent). This is what I’m calling informed oversight.
Second, given an agent, identify situations in which it is especially likely to produce bad outcomes, or proofs that it won’t, or enough understanding of its internals that you can see why it won’t. This is discussed in “Techniques for Optimizing Worst-Case Performance.”
Paul, I’m curious whether you’d see as necessary for these techniques to work to have that the optimization target is pretty good/safe (but not perfect): ie some safety comes from the fact that the agents optimized for approval or imitation only have a limited class of Y’s that they might also end up being optimized for.
So I also don’t see how Paul expects the putative alignment of the little agents to pass through this mysterious aggregation form of understanding, into alignment of the system that understands Hessian-free optimization.
My model of Paul’s approach sees the alignment of the subagents as just telling you that no subagent is trying to actively sabotage your system (ie. by optimizing to find the worst possible answer to give you), and that the alignment comes from having thought carefully about how the subagents are supposed to act in advance (in a way that could potentially be run just by using a lookup table).
Glad to see this work on possible structure for representing human values which can include disagreement between values and structured biases.
I had some half-formed ideas vaguely related to this, which I think map onto an alternative way to resolve self reference.
Rather than just having one level of values that can refer to other values on the same level (which potentially leads to a self-reference cycle), you could instead explicitly represent each level of value, with level 0 values referring to concrete reward functions, level 1 values endorsing or negatively endorsing level 0 values, and generally level n values only endorsing or negatively endorsing level n-1 values. This might mean that you have some kinds of values that end up being duplicated between multiple levels. For any n, there’s a unique solution to the level of endorsement for every concrete value. We can then consider the limit as n->infinity as the true level of endorsement. This allows for situations where the limit fails to converge (ie. it alternates between different values at odd and even levels), which seems like a way to handle self reference contradictions (possibly also the all-or-nothing problem if it results from a conflict between meta-levels).
I think this maps into the case where we don’t distinguish between value levels if we define an function that just adjusts the endorsement of each value by the values that directly to refer to it. Then iterating this function n times gives the equivalent of having an n-level meta-hierarchy.
I think there might be interesting work in mapping this strategy into some simple value problem, and then trying to perform bayesian value learning in that setting with some reasonable prior over values/value endorsements.
Ah, right. I guess I was balking at moving from exorbitant to exp(exorbitant). Maybe it’s better to think of this as reducing the size of fully worked initial overseer example problems that can be produced for training/increasing the number of amplification rounds that are needed.
So my argument is more an example of what a distilled overseer could learn as an efficient approximation.
I guess what we’re trying to unpack is “the mechanism that makes decisions from that database”, and whether it can be efficient. If you store all experience with no editing, you’re left with the same problem of “what decisions in the tree do you change based on experience (X,y) and how do you change them?” And it seems like the most naive way of doing this is too inefficient (looking at all data at all decision nodes).
If you summarize experience (what I’m interpreting “decide how to update some small sketch” as meaning), then the question is can you produce that summary in an efficient way, and efficiently use it at runtime?
The gradient-descent style solution is meant to gesture how this could be done in a way analogous to how we use parametric models rather than non-parametric models in ML for decently sized datasets. In advance, bake in a set of parameters/information that could influence each decision, then do updates by having the overseer look for ways to change the decision parameters. But when we do this in an oversight model, we can make the decisions/parameters interpretable sentences.
Rough example of how I see the overall process going:
1. amplified overseer acts in an environment with a breakable vase, but the overseer does not know that the vase is breakable
2. First time it acts, some overseer node N looks into the database for any information related to the vase, and returns nothing, and proceeds in it decision
3. The vase breaks, and the negative feedback is given to the amplified overseer
4. Amplified overseer runs some process to figure out that it could have made a better decision if the database contained the knowledge that vases are breakable, and so updates the database. (The thing that Wei Dai are trying to unpack is, can this step be run in a reasonable amount of time, given that each node of the overseer has only a limited understanding of the original decision process)
5. Second time it acts, the amplified overseer process is the same, up until the node N that does the database lookup is run. This time, the lookup returns that the vase is breakable, so the overseer runs a different decision path and treats the vase with more care.
What if the current node is responsible for the error instead of one of the subqueries, how do you figure that out?
I think you’d need to form the decomposition in such a way that you could fix any problem through perturbing something in the world representation (an extreme version is you have the method for performing every operation contained in the world representation and looked up, so you can adjust it in the future).
When you do backprop, you propagate the error signal through all the nodes, not just through a single path that is “most responsible” for the error, right? If you did this with meta-execution, wouldn’t it take an exponential amount of time?
One step of this method, as in backprop, is the same time complexity as the forward pass (running meta-execution forward, which I wouldn’t call exponential complexity, as I think the relevant baseline is the number of nodes in the meta-execution forward tree). You only need to process each node once (when the backprop signal for it’s output is ready), and need to do a constant amount of work at each node (figure out all the ways to perturb the nodes input).
The catch is that, as with backprop, maybe you need to run multiple steps to get it to actually work.
And what about nodes that are purely symbolic, where there are multiple ways the subnodes (or the current node) could have caused the error, so you couldn’t use the right answer for the current node to figure out what the right answer is from each subnode? (Can you in general structure the task tree to avoid this?)
The default backprop answer to this is to shrug and adjust all of the inputs (which is what you get from taking the first order gradient). If this causes problems, then you can fix them in the next gradient step. That seems to work in practice for backprop in continuous models. Discrete models like this it might be a bit more difficult—if you start to try out different combinations to see if they work, that’s where you’d get exponential complexity. But we’d get to counter this by potentially having cases where, based on understanding the operation, we could intelligently avoid some branches—I think this could potentially wash out to linear complexity in the number of forward nodes if it all works well.
I wonder if we’re on the right track at all, or if Paul has an entirely different idea about this.
So do I :)
Huh, I hadn’t thought of this as trying to be a direct analogue of gradient descent, but now that I think about your comment that seems like an interesting way to approach it.
A human debugging a translation software could look at the return value of some high-level function and ask “is this return value sensible” using their own linguistic intuition, and then if the answer is “no”, trace the execution of that function and ask the same question about each of the function it calls. This kind of debugging does not seem available to meta-execution trying to debug itself, so I just don’t see any way this kind of learning / error correction could work.
I think instead of asking “is this return value sensible”, the debugging overseer process could start with some computation node where it knows what the return value should be (the final answer), and look at each of the subqueries of that node and ask for each subquery “how can I modify the answer to make the query answer more correct”, then recurse into the subquery. This seems pretty analogous to gradient descent, with the potential advantage that the overseer’s understanding of the function at each node could be better than naively taking the gradient (understanding the operation could yield something that takes into account higher-order terms in the operation).
I’m curious now whether you could run a more efficient version of gradient descent if you replace the gradient at each step with an overseer human who can harness some intuition to try to do better than the gradient.
What if the field of linguistics as a whole is wrong about some concept or technique, and as a result all of the humans are wrong about that? It doesn’t seem like using different random seeds would help, and there may not be another approach that can be taken that avoids that concept/technique.
Yeah, I don’t think simple randomness would recover from this level of failure (only that it would help with some kinds of errors, where we can sample from a distribution that doesn’t make that error sometimes). I don’t know if anything could recover from this error in the middle of a computation without reinventing the entire field of linguistics from scratch, which might be too to ask. However, I think it could be possible to recover from this error if you get feedback about the final output being wrong.
But in IDA, H is fixed and there’s no obvious way to figure out which parts of a large task decomposition tree was responsible for the badly translated sentence and therefore need to be changed for next time.
I think that the IDA task decomposition tree could be created in such a way that you can reasonably trace back which part was responsible for the misunderstanding/that needs to be changed. The structure you’d need for this is that given a query, you can figure out which of it’s children would need to be corrected to get the correct result. So if you have a specific word to correct, you can find the subagent that generated that word, then look at it’s inputs, see which input is correct, trace where that came from, etc. This might need to be deliberately engineered into the task decomposition (in the same way that differently written programs accomplishing the same task could be easier or harder to debug).
Ah, misunderstood that, thanks.
Say w2a is the world where the agent starts in w2 and w2b is the world that results after the agent moves from w1 to w2.
Without considering the agent’s memory part of the world, it seems like the problem is worse: the only way to distinguish between w2a and w2b is the agent’s memory of past events, so it seems that leaving the agent’s memory over the past out of the utility function requires U(w2a) = U(w2b)