2) Calling the issues between the agents because of model differences “terminology issues” could also work well—this may be a little like people talking past each other.
I really like this point. I think it’s parallel to the human issue where different models of the world can lead to misinterpretation of the “same” goal. So “terminology issues” would include, for example, two different measurements of what we would assume is the same quantity. If the base optimizer is looking to set the temperature and using a wall-thermometer, while the mesa-optimizer is using one located on the floor, the mesa-optimizer might be mis-aligned because it interprets “temperature” as referring to a different fact than the base-optimizer. On the other hand, when the same metric is being used by both parties, the class of possible mistakes does not include what we’re not calling terminology issues.
I think this also points to a fundamental epistemological issue, one even broader than goal-representation. It’s possible that two models disagree on representation, but agree on all object level claims—think of using different coordinate systems. Because terminology issues can cause mistakes, I’d suggest that agents with non-shared world models can only reliably communicate via object-level claims.
The implication for AI alignment might be that we need AI to either fundamentally model the world the same way as humans, or need to communicate only via object-level goals and constraints.
It seems like there’s a difference between the two cases. If I write a program to take the CRT, and then we both take it, and we both get the same score (and that isn’t a perfect score), because it solved them the way I solve them, that doesn’t sound like misalignment.
The misalignment here is between you and the CRT, and reflects your model being misaligned with the goal / reality. That’s why I’m calling it a principal alignment failure—even though it’s the program / mesa-optimizer that fails, the alignment failure is located in the principal, you / the base optimizer.
Separate from my other comment, I want to question your assumption that we must worry about an AI-takeoff that is exponentially better than humans at everything, so that a very slight misalignment would be disastrous. That seems possible, per Eliezer’s Rocket Example, but is far from certain.
It seems likely that instead there are fundamental limits on intelligence (for a given architecture, at least) and while it is unlikely that the overall limits are coincidentally the same as / near human intelligence, it seems plausible that the first superhuman AI system still plateaus somewhere far short of infinite optimization power. If so, we only need to mitigate well, instead of perfectly align the AI to our goals.
I’ll start by noting that I am in the strange (for me) position of arguing that someone is too concerned about over-optimization failures, rather than trying to convince someone who is dismissive. But that said, I do think that the concern here, while real, is mitigable in a variety of ways.
First, there is the possibility of reducing optimization pressure. One key contribution here is Jessica Taylor’s Quantilizers paper, which you note, that shows a way to build systems that optimize but are not nearly as subject to Goodhart’s curse. I think you are too dismissive. Similarly, you are dismissive of optimizing the target directly. I think that the epistemological issues you point to are possible to mitigate to the extent that they won’t cause misalignment between reality and an AI’s representation of that reality. Once that is done, the remaining issue is aligning “true” goals with the measured goals, which is still hard, but certainly not fundamentally impossible in the same way.
Second, you note that you don’t think we will solve alignment. I agree, because I think that “alignment” presupposes a single coherent ideal. If human preferences are diverse, as it seems they are, we may find that alignment is impossible. This, however, allows a very different approach. This would optimize only when it finds Pareto-improvements across a set of sub-alignment metrics or goals, to constrain the possibility of runaway optimization. Even if alignment is possible, it seems likely that we can specify a set of diverse goals / metrics that are all aligned with some human goals, so that the system will be limited in its ability to be misaligned.
Lastly, there is optimization for a safe and very limited goal. If the goal is limited and specific, and we find a way to minimize side-effects, this seems like it could be fairly safe. For example, Oracle AIs are an attempt to severely limit the goal. More broadly, however, we might be able to build constraints that work, so that it that can reliably perform limited tasks (“put a strawberry on a plate without producing any catastrophic side-effects.”)
This is very interesting—I hadn’t thought about utility aggregation for a single agent before, but it seems clearly important now that it has been pointed out.
I’m thinking about this in the context of both the human brain as an amalgamation of sub-agents, and organizations as an amalgamation of individuals. Note that we can treat organizations as rationally maximizing some utility function in the same way we can treat individuals as doing so—but I think that for many or most voting or decision structures, we should be able to rule out the claim that they are following any weighted combination of normalized utilities of the agents involved in the system using any intertheoretic comparison. This seems like a useful result if we can prove it. (Alternatively, it may be that certain decision rules map to specific intertheoretic comparison rules, which would be even more interesting.)
1) It’s neither noise nor rapid increase—it’s delayed feedback. Control theorists in engineering have this as a really clear, basic result, that delayed feedback is really really bad in various ways. There are entire books on how to do it well—https://books.google.ch/books?id=Cy_wCAAAQBAJ&pg=PR9&lpg=PR9 - but doing it without using these more complex techniques is bad.
2) You either hire a control theorist, or (more practically) you avoid the current feedback mechanism, and instead get people on the phone to talk about and understand what everyone needs, as opposed to relying on their delayed feedback in the form of numeric orders.
I think I agree about Eliezer’s definition, that it’s theoretically correct, but I definitely agree that I need to understand this better.
Thanks for the feedback. I agree that in a control system, any divergence between intent and outcome is an alignment issue, and I agree that this makes overoptimization different in control versus selection. Despite the conceptual confusion, I definitely think the connections are worth noting—not only “wireheading,” but the issues with mesa-optimizers. And I definitely think that causal failures are important particularly in this context.
But I strongly endorse how weak and fuzzy this is—which is a large part of why I wanted to try to de-confuse myself. That’s the goal of this mini-sequence, and I hope that doing so publicly in this way at least highlights where the confusion is, even if I can’t successfully de-confuse myself, much less others. And if there are places where others are materially less confused than me and/or you, I’d love for them to write responses or their own explainers on this.
A couple points.
First, the reason why I wasn’t happy with entropy as a metric is because it doesn’t allow (straightforward) comparison of different types of optimization, as I discussed. Entropy of a probability distribution output isn’t comparable to the entropy over states that Eliezer defines, for example.
Second, I’m not sure false positive and false negative are the right conceptual tools here. I can easily show examples of each—gradient descent can fail horribly in many ways, and luck of specific starting parameters on specific distributions can lead to unreasonably rapid convergence, but in both cases, it’s a relationship between the algorithm and the space being optimized.
As always, constructive criticism on whether I’m still confused, or whether the points I’m making are clear, is welcome!
Yes, and this is a step in the right direction, but as you noted in the writeup, it only applies in a case where we’ve assumed away a number of key problems—among the most critical of which seem to be:
We have an assumed notion of optimality, and I think an implicit assumption that the optimal point is unique, which seems to be needed to define reward—Abram Demski has noted in another post that this is very problematic.
We also need to know a significant amount about both/all agents, and compute expectations in order to design any of their reward functions. That means future agents joining the system could break our agent’s indifference. (As an aside, I’m unclear how we can be sure it is possible to compute rewards in a stable way if their optimal policy can change based on the reward we’re computing.) If we can compute another agent’s reward function when designing our agent, however, we can plausibly hijack that agent.
We also need a reward depending on an expectation of actions, which means we need counterfactuals not only over scenarios, but over the way the other agent reasons. That’s a critical issue I’m still trying to wrap my head around, because it’s unclear to me how a system can reason in those cases.
The way the agents interact across interruptions seems to exactly parallel interactions between agents where we design for correct behavior for agents separately, and despite this, agents can corrupt the overall design by hijacking other agents. You say we need to design for mutual indifference, but if we have a solution that fixes the way they exploit interruption, it should also go quite a ways towards solving the generalized issues with Goodhart-like exploitation between agents.
This seems like an important issue, but given the example, I’m also very interested in how we can detect interactions like this. These are effectively examples of multi-party Goodhart effects, and the example you use is assumed to be “obvious” and so a patch would be obviously needed. This seems unclear—we need to understand the strategic motives to diagnose what is happening, and given that we don’t have good ideas for explainability, I’m unsure how in general to notice these effects to allow patching. (I have been working on this and thinking about it a bit, and don’t currently have good ideas.)
Note: as I mentioned in the post, I’d love feedback of all types—for better clarity, criticism of my understanding, suggestions for how to continue, and places the discussion still seems confused.
I like that intuition overall, but there’s a sense that adaptive search does give far more resolution than grid search, but the analysis seems wrong; if I use gradient descent, by eventual accuracy is the step size near the end, which gives far more precise answers than “only” checking a grid of M^N points spaced equally.
Pretty sure we’re agreeing here. I was originally just supporting cousin_it’s claim, not claiming that Nash Equilibria are a useful-enough solution concept. I was simply noting that—while they are weaker than a useful-enough concept would be—they can show the issue with non-uniqueness clearly.