Co-founder @ Gladstone AI.
Contact: edouard@gladstone.ai
Website: eharr.is
Thanks for the comment!
Not sure if I agree with your interpretation of the “real objective”—might be better served by looking for stable equilibria and just calling them as such.
I think this is a reasonable objection. I don’t make this very clear in the post, but the “true objective” I’ve written down in the example indeed isn’t unique: like any measure of utility or loss, it’s only unique up to affine transformations with positive coefficients. And that could definitely damage the usefulness of these definitions, since it means that alignment factors, for example, aren’t uniquely defined either. (I’ll be doing a few experiments soon to investigate this, and a few other questions, in a couple of real systems.)
Don’t we already have weak alignment to arbitrary functions using annealing (basically, jump at random, but jump around more/further on average when the loss is higher and lower the jumping rate over time)? The reason we don’t add small annealing terms to gradient descent is entirely because of we expect them to be worse in the short term (a “strong alignment” question).
Interesting question! To try to interpret in light of the definitions I’m proposing: adding annealing changes the true objective (or mesa-objective) of the optimizer, which is no longer solely trying to minimize its gradients — it now has this new annealing term that it’s also trying to optimize for. Whether this improves alignment or not depends on the effect annealing has on 1) the long-term performance of the mesa-optimizer on its new (gradient + annealing) objective; and 2) the long-term performance this induces on the base objective.
Hope that’s somewhat helpful, but please let me know if it’s unclear and I can try to unpack things a bit more!
Thanks for the kind words, Adam! I’ll follow up over DM about early drafts — I’m interested in getting feedback that’s as broad as possible and really appreciate the kind offer here.
Typo is fixed — thanks for pointing it out!
At first I wondered why you were taking the sum instead of just , but after thinking about it, the latter would probably converge to 0 almost all the time, because even with amazing optimization, the loss will stop being improved by a factor linear in T at some point. That might be interesting to put in the post itself.
Yes, the problem with that definition would indeed be that if your optimizer converges to some limiting loss function value like , then you’d get for any .
Thanks again!
Really interesting!
I think there might be a minor typo in Section 2.2:
For transitivity, assume that for
I think this should be based on the indexing in the rest of the paragraph.
Great post. Thanks for writing this — it feels quite clarifying. I’m finding the diagram especially helpful in resolving the sources of my confusion.
I believe everything here is consistent with the definitions I proposed recently in this post (though please do point out any inconsistencies if you see them!), with the exception of one point.
This may be a fundamental confusion on my part — but I don’t see objective robustness, as defined here, as being a separate concept at all from inner alignment. The crucial point, I would argue, is that we ought to be treating the human who designed our agent as the base optimizer for the entire system.
Zooming in on the “inner alignment objective robustness” part of the diagram, I think what’s actually going on is something like:
A human AI researcher wishes to optimize for some base objective, .
It would take too much work for our researcher to optimize for manually. So our researcher builds an agent to do the work instead, and sets to be the agent’s loss function.
Depending on how it’s built, the agent could end up optimizing for , or it could end up optimizing for something different. The thing the agent ends up truly optimizing for is the agent’s behavioral objective — let’s call it . If is aligned with , then the agent satisfies objective robustness by the above definition: its behavioral objective is aligned with the base. So far, so good.
But here’s the key point: from the point of view of the human researcher who built the agent, the agent is actually a mesa-optimizer, and the agent’s “behavioral objective” is really just the mesa-objective of that mesa-optimizer.
And now, we’ve got an agent that wishes to optimize for some mesa-objective . (Its “behavioral objective” by the above definition.)
And then our agent builds a sub-agent to do the work instead, and sets to be the sub-agent’s loss function.
I’m sure you can see where I’m going with this by now, but the sub-agent the agent builds will have its own objective which may or may not be aligned with , which may or may not in turn be aligned with . From the point of view of the agent, that sub-agent is a mesa-optimizer. But from the point of view of the researcher, it’s actually a “mesa-mesa-optimizer”.
That is to say, I think there are three levels of optimizers being invoked implicitly here, not just two. Through that lens, “intent alignment”, as defined here, is what I’d call “inner alignment between the researcher and the agent”; and “inner alignment”, as defined here, is what I’d call “inner alignment between the agent and the mesa-optimizer it may give rise to”.
In other words, humans live in this hierarchy too, and we should analyze ourselves in the same terms — and using the same language — as we’d use to analyze any other optimizer. (I do, for what it’s worth, make this point in my earlier post — though perhaps not clearly enough.)
Incidentally, this is one of the reasons I consider the concepts of inner alignment and mesa-optimization to be so compelling. When a conceptual tool we use to look inside our machines can be turned outward and aimed back at ourselves, that’s a promising sign that it may be pointing to something fundamental.
A final caveat: there may well be a big conceptual piece that I’m missing here, or a deep confusion that I have around one or more of these concepts that I’m still unaware of. But I wanted to lay out my thinking as clearly as I could, to make it as easy as possible for folks to point out any mistakes — would enormously appreciate any corrections!
Sure, makes sense! Though to be clear, I believe what I’m describing should apply to optimizers other than just gradient descent — including optimizers one might think of as reward-maximizing agents.
Thanks, Rohin!
Please note that I’m currently working on a correction for part of this post — the form of the mesa-objective I’m claiming is in fact wrong, as Charlie correctly alludes to in a sibling comment.
Great post.
I responded that for me, the whole point of the inner alignment problem was the conspicuous absence of a formal connection between the outer objective and the mesa-objective, such that we could make little to no guarantees based on any such connection.
Strong agree. In fact I believe developing the tools to make this connection could be one of the most productive focus areas of inner alignment research.
What I’d like to have would be several specific formal definitions, together with several specific informal concepts, and strong stories connecting all of those things together.
In connection with this, it may be worth checking out out my old post where I try to to untangle capability from alignment in the context of a particular optimization problem. I now disagree with around 20% of what I wrote there, but I still think it was a decent first stab at formalizing some of the relevant definitions, at least from a particular viewpoint.
Love this idea. From the linked post on the BAIR website, the idea of “prompting” a Minecraft task with e.g. a brief sequence of video frames seems especially interesting.
Would you anticipate the benchmark version of this would ask participants to disclose metrics such as “amount of task-specific feedback or data used in training”? Or does this end up being too hard to quantify because you’re explicitly expecting folks to use a variety of feedback modalities to train their agents?
That makes sense, though I’d also expect that LfLH benchmarks like BASALT could turn out to be a better fit for superscale models in general. (e.g. a BASALT analogue might do a better job of capturing the flexibility of GPT-N or DALL-E type models than current benchmarks do, though you’d probably need to define a few hundred tasks for that to be useful. It’s also possible this has already been done and I’m unaware of it.)
Late comment here, but I really liked this post and want to make sure I’ve fully understood it. In particular there’s a claim near the end which says: if is not fixed, then we can build equivalent models , for which it is fixed. I’d like to formalize this claim to make sure I’m 100% clear on what it means. Here’s my attempt at doing that:
For any pair of models , where , there exists a variable (of which is a subset) and a pair of models , such that 1) for any , ; and 2) the behavior of the system is the same under , as it was under , .
To satisfy this claim, we construct our as the conjunction of and some “extra” component . e.g., for a coin flip, for a die roll, and so is the conjunction of the coin flip and the die roll, and the domain of is the outer product of the coin flip domain and of the die roll domain.
Then we construct our by imposing 1) (i.e., , are logically independent given for every ); and 2) (i.e., the marginal prob given equals the original prob under ).
Finally we construct by imposing the analogous 2 conditions that we did for : 1) and 2) . But we also impose the extra condition 3) (assuming finite sets, etc.).
We can always find , and that satisfy the above conditions, and with these choices we end up with for all , (i.e., is fixed) and (i.e., the system retains the same dynamics).
Is this basically right? Or is there something I’ve misunderstood?
Ah yes, that’s right. Yeah, I just wanted to make this part fully explicit to confirm my understanding. But I agree it’s equivalent to just let ignore the extra (or whatever) component.
Thanks very much!
If we wish, we could replace or re-define “capability robustness” with “inner robustness”, the robustness of pursuit of the mesa-objective under distributional shift.
I strongly agree with this suggestion. IMO, tying capability robustness to the behavioral objective confuses a lot of things, because the set of plausible behavioral objectives is itself not robust to distributional shift.
One way to think about this from the standpoint of the “Objective-focused approach” might be: the mesa-objective is the thing the agent is revealed to be pursuing under arbitrary distributional shifts. To be precise: suppose we take the world and split it into an “agent” part and “environment” part. Then we expose the agent to every possible environment (or data distribution) allowed by our laws of physics, and we note down what the agent does in each of them. Any objective function that’s consistent with our agent’s actions across all of those environments must then count as a valid mesa-objective. (This is pretty much Amin & Singh’s omnipotent experimenter setting.)
The behavioral objective, meanwhile, would be more like the thing the agent appears to be pursuing under some subset of possible distributional shifts. This is the more realistic case where we can’t afford to expose our agent to every possible environment (or data distribution) that could possibly exist, so we make do and expose it to only a subset of them. Then we look at what objectives could be consistent with the agent’s behavior under that subset of environments, and those count as valid behavioral objectives.
The key here is that the set of allowed mesa-objectives is a reliable invariant of the agent, while the set of allowed behavioral objectives is contingent on our observations of the agent’s behavior under a limited set of environments. In principle, the two sets of objectives won’t converge perfectly until we’ve run our agent in every possible environment that could exist.
So if we do an experiment whose results are consistent with behavioral objectives , and we want to measure the agent’s capability robustness with respect to , we’d apply a distributional shift and see how well the agent performs at . But what if isn’t actually the mesa-objective? Then the fact that the agent appeared to be pursuing at all was just an artifact of the limited set of experiments we were running. So if our agent does badly at under the shift, maybe the problem isn’t a capability shortfall — maybe the problem is that the agent doesn’t care about and never did.
Whereas with your definition of inner robustness, we’d at least be within our rights to say that the true mesa-objective was , and therefore that doing badly at really does say something about the capability of our agent.
which stems from the assumption that you are able to carve an environment up into an agent and an environment and place the “same agent” in arbitrary environments. No such thing is possible in reality, as an agent cannot exist without its environment
I might be misunderstanding what you mean here, but carving up a world into agent vs environment is absolutely possible in reality, as is placing that agent in arbitrary environments to see what it does. You can think of the traditional RL setting as a concrete example of this: on one side we have an agent that is executing some policy ; and on the other side we have an environment that consists of state transition dynamics given by some distribution . One can in fact show (see the unidentifiability in IRL paper) that if an experimenter has the power to vary the environment arbitrarily and look at the policies the agent pursues on each of those environments, then that experimenter can recover a reward function that is unique up to the usual affine transformations.
That recovered reward function is a fortiori a reliable invariant of the agent, since it is consistent with the agent’s actions under every possible environment the agent could be exposed to. (To be clear, this claim is also proved in the paper.) It also seems reasonable to identify that reward function with the mesa-objective of the agent, because any mesa-objective that is not identical with that reward function has to be inconsistent with the agent’s actions on at least one environment.
Admittedly there are some technical caveats to this particular result: off the top, 1) the set of states & actions is fixed across environments; 2) the result was proved only for finite sets of states & actions; and 3) optimal policy is assumed. I could definitely imagine taking issue with some of these caveats — is this the sort of thing you mean? Or perhaps you’re skeptical that a proof like this in the RL setting could generalize to the train/test framing we generally use for NNs?
in the OOD robustness literature you try to optimize worst-case performance over a perturbation set of possible environments.
Yeah that’s sensible because this is often all you can do in practice. Having an omnipotent experimenter is rarely realistic, but imo it’s still useful as a way to bootstrap a definition of the mesa-objective.
Btw, if you’re aware of any counterpoints to this — in particular anything like a clearly worked-out counterexample showing that one can’t carve up a world, or recover a consistent utility function through this sort of process — please let me know. I’m directly working on a generalization of this problem at the moment, and anything like that could significantly accelerate my execution.
Thanks!!
Ah I see! Thanks for clarifying.
Yes, the point about the Cartesian boundary is important. And it’s completely true that any agent / environment boundary we draw will always be arbitrary. But that doesn’t mean one can’t usefully draw such a boundary in the real world — and unless one does, it’s hard to imagine how one could ever generate a working definition of something like a mesa-objective. (Because you’d always be unable to answer the legitimate question: “the mesa-objective of what?”)
Of course the right question will always be: “what is the whole universe optimizing for?” But it’s hard to answer that! So in practice, we look at bits of the whole universe that we pretend are isolated. All I’m saying is that, to the extent you can meaningfully ask the question, “what is this bit of the universe optimizing for?”, you should be able to clearly demarcate which bit you’re asking about.
(i.e. I agree with you that duality is a useful fiction, just saying that we can still use it to construct useful definitions.)
I would further add that looking for difficulties created by the simplification seems very intellectually productive.
Yep, strongly agree. And a good first step to doing this is to actually build as robust a simplification as you can, and then see where it breaks. (Working on it.)
I’m not sure what would constitute a clearly-worked counterexample. To me, a high reliance on an agent/world boundary constitutes a “non-naturalistic” assumption, which simply makes me think a framework is more artificial/fragile.
Oh for sure. I wouldn’t recommend having a Cartesian boundary assumption as the fulcrum of your alignment strategy, for example. But what could be interesting would be to look at an isolated dynamical system, draw one boundary, investigate possible objective functions in the context of that boundary; then erase that first boundary, draw a second boundary, investigate that; etc. And then see whether any patterns emerge that might fit an intuitive notion of agency. But the only fundamentally real object here is always going to be the whole system, absolutely.
As I understand, something like AIXI forces you to draw one particular boundary because of the way the setting is constructed (infinite on one side, finite on the other). So I’d agree that sort of thing is more fragile.
The multiagent setting is interesting though, because it gets you into the game of carving up your universe into more than 2 pieces. Again it would be neat to investigate a setting like this with different choices of boundaries and see if some choices have more interesting properties than others.
Yeah I agree this is a legitimate concern, though it seems like it is definitely possible to make such a demarcation in toy universes (like in the example I gave above). And therefore it ought to be possible in principle to do so in our universe.
To try to understand a bit better: does your pessimism about this come from the hardness of the technical challenge of querying a zillion-particle entity for its objective function? Or does it come from the hardness of the definitional challenge of exhaustively labeling every one of those zillion particles to make sure the demarcation is fully specified? Or is there a reason you think constructing any such demarcation is impossible even in principle? Or something else?
I’m with you on this, and I suspect we’d agree on most questions of fact around this topic. Of course demarcation is an operation on maps and not on territories.
But as a practical matter, the moment one starts talking about the definition of something such as a mesa-objective, one has already unfolded one’s map and started pointing to features on it. And frankly, that seems fine! Because historically, a great way to make forward progress on a conceptual question has been to work out a sequence of maps that give you successive degrees of approximation to the territory.
I’m not suggesting actually trying to imbue an AI with such concepts — that would be dangerous (for the reasons you alluded to) even if it wasn’t pointless (because prosaic systems will just learn the representations they need anyway). All I’m saying is that the moment we started playing the game of definitions, we’d already started playing the game of maps. So using an arbitrary demarcation to construct our definitions might be bad for any number of legitimate reasons, but it can’t be bad just because it caused us to start using maps: our earlier decision to talk about definitions already did that.
(I’m not 100% sure if I’ve interpreted your objection correctly, so please let me know if I haven’t.)
Update: having now thought more deeply about this, I no longer endorse my above comment.
While I think the reasoning was right, I got the definitions exactly backwards. To be clear, what I would now claim is:
The behavioral objective is the thing the agent is revealed to be pursuing under arbitrary distributional shifts.
The mesa-objective is something the agent is revealed to be pursuing under some subset of possible distributional shifts.
Everything in the above comment then still goes through, except with these definitions reversed.
On the one hand, the “perfect IRL” definition of the behavioral objective seems more naturally consistent with the omnipotent experimenter setting in the IRL unidentifiability paper cited downthread. As far as I know, perfect IRL isn’t defined anywhere other than by reference to this reward modelling paper, which introduces the term but doesn’t define it either. But the omnipotent experimenter setting seems to capture all the properties implied by perfect IRL, and does so precisely enough that one can use it to make rigorous statements about the behavioral objective of a system in various contexts.
On the other hand, it’s actually perfectly possible for a mesa-optimizer to have a mesa-objective that is inconsistent with its own actions under some subset of conditions (the key conceptual error I was making was in thinking this was not possible). For example, a human being is a mesa-optimizer from the point of view of evolution. A human being may have something like “maximize happiness” as their mesa-objective. And a human being may, and frequently does, do things that do not maximize for their happiness.
A few consequences of the above:
Under an “omnipotent experimenter” definition, the behavioral objective (and not the mesa-objective) is a reliable invariant of the agent.
It’s entirely possible for the behavioral objective to be overdetermined in certain situations. i.e., if we run every possible experiment on an agent, we may find that the only reward / utility function consistent with its behavior across all those experiments is the trivial utility function that’s constant across all states.
If the behavioral objective of a system is overdetermined, that might mean the system never pursues anything coherently. But it might also mean that there exist subsets of distributions on which the system pursues an objective very coherently, but that different distributions induce different coherent objectives.
The natural way to use the mesa-objective concept is to attach it to one of these subsets of distributions on which we hypothesize our system is pursuing a goal coherently. If we apply a restricted version of the omnipotent experimenter definition — that is, run every experiment on our agent that’s consistent with the subset of distributions we’re conditioning on — then we will in general recover a set of mesa-objective candidates consistent with the system’s actions on that subset.
It is strictly incorrect to refer to “the” mesa-objective of any agent or optimizer. Any reference to a mesa-objective has to be conditioned on the subset of distributions it applies on, otherwise it’s underdetermined. (I believe Jack refers to this as a “perturbation set” downthread.)
This seems like it puts these definitions on a more rigorous footing. It also starts to clarify in my mind the connection with the “generalization-focused approach” to inner alignment, since it suggests a procedure one might use in principle to find out whether a system is pursuing coherent utilities on some subset of distributions. (“When we do every experiment allowed by this subset of distributions, do we recover a nontrivial utility function or not?”)
Would definitely be interested in getting feedback on these thoughts!
Great framework—feels like this is touching on something fundamental.
I’m curious: is the controllable / observable terminology intentionally borrowed from control theory? Or is that a coincidence?