I do alignment research, mostly stuff that is vaguely agent foundations. Formerly on Vivek’s team at MIRI. Most of my writing before mid 2023 are not representative of my current views about alignment difficulty.
Jeremy Gillen
Explaining inner alignment to myself
I like all of this post except for the conclusion. I think this comment shows that the definition of inner alignment requires an explicit optimizer. Your broader definition of inner misalignment is equivalent to generalization error or robustness, which we already have a name for.
I currently see inner alignment problems as a superset of generalisation error and robustness.
What would you include as an inner alignment problem that isn’t a generalization problem or robustness problem?
Hmm yeah I like your edit, it breaks down the two definitions well. I definitely have a preference for the second one, I prefer confusing terms like this to have super specific definitions rather than broad vague ones, because it helps me to think about whether a proposed solution is actually solving the problem being pointed to. I have, like you, seen people using inner (mis)alignment to refer to other things outside of the original strict definition, but as far as I know the comment I linked to is the one that clarifies the definition most recently? I haven’t checked this. If there are more recent discussions involving the people who coined the term I would defer to them.
Regarding the crux that you mention in the edit:
whether or not deciding whether an AI is an optimizer, and finding its objective, is a well-defined procedure for powerful AIs
If you mean precisely mathematically well-defined, then I think this is too high a standard. I think it is sufficient that we be able to point toward archetypal examples of optimizing algorithms and say “stuff like that”.
I think the main reason I care about this distinction is that generalization error without learned optimizers doesn’t seem to be a huge problem, whereas “sufficiently powerful optimizing algorithms with imperfectly aligned goals” seems like a world-ending level of problem. Do you agree with this?
Recently I’ve seen a bunch of high status people using “inner alignment” in the more general sense, so I’m starting to think it might be too late to stick to the narrow definition. E.g. this post.
Any AI that does a task well is in some sense optimizing for good performance on that task.
I disagree with this. To me there are two distinct approaches, one is to memorize which actions did well in similar training situations, and the other is to predict the consequences of each action and somehow rank each consequence.
for a sufficiently complex training objective, even a very powerful agent -y AI will have a “fuzzy” goal that isn’t an exact specification of what it should do (for example, humans don’t have clearly defined objectives that they consistently pursue).
I disagree with this, but I can’t put it into clear words. I’ll think more about it. It doesn’t seem true for model-based RL, unless we explicitly build in uncertainty over goals. I think it’s only true for humans for value-loading-from-culture reasons.
Thanks for reading!
Are you referring to this post? I hadn’t read that, thanks for pointing me in that direction. I think technically my subtitle is still correct, because the way I defined priors in the footnotes covers any part of the training procedure that biases it toward some hypotheses over others. So if the training procedure is likely to be hijacked by “greedy genes” then it wouldn’t count as having an “accurate prior”.
I like the learning theory perspective because it allows us to mostly ignore optimization procedures, making it easier to think about things. This perspective works nicely until the outer optimization process can be manipulated by the hypothesis. After reading John’s post I think I did lean too hard on the learning theory perspective.
I didn’t have much to say about deception because I considered it to be a straightforward extension of inner misalignment, but I think I was wrong, the “optimization demon” perspective is a good way to think about it.
Broad Basins and Data Compression
I think it was smooth-ish, although we didn’t plot it. It was decelerating really slowly, but we also didn’t quantify that. We could send you the notebook (bit of a mess)?
I really wanted to extend the experiment to more complicated data, and bigger networks with more layers, but we didn’t have time. I plan to do this after the MATS program ends in late september if someone else hasn’t done it first.
Project proposal: Testing the IBP definition of agent
Yeah that would be interesting, but how would we tell the difference between trivial params (I’m assuming this means function doesn’t change anywhere) and equal loss models? Estimate this with a sampling of points out of distribution?
I kind of assumed that all changes in the parameters changed the function, but that some areas of the loss landscape change the function faster than others? This would be my prediction
Finding Goals in the World Model
Yeah I agree in retrospect about utility functions not being a good formulation of corrigibility, we phrased it like that because we spent some time thinking about the MIRI corrigibility paper, which uses a framing like this to make it concrete.
On outer alignment: I think if we have utility function over universe histories/destinies, then this is a sufficiently general framing that any outer alignment solution should be capable of being framed this way. Although it might not end up being the most natural framing.
On cruxes: Good point, we started off using it pretty much correctly but ended up abusing those sections. Oops.
On soft optimization: We talked a lot about quantilizers, the quantilizers paper is among my favorite papers. I’m not really convinced yet that the problems we would want an AGI to solve (in the near term) are in the “requires super high optimization pressure” category. But we did discuss how to improve the capabilities of quantilizers, by adjusting the level of quantilization based on some upper bound on the uncertainty about the goal on the local task.
On pointers problem: Yeah we kind of mushed together extra stuff into the pointers problem section, because this was how our discussion went. Someone did also argue that it would be more of a problem if NAH was weak, but overall I thought it would probably be a bad way to frame the problem if the NAH was weak.
That’s correct that it simultaneously infers the policy and utility function. To avoid the underspecification problem, it uses a prior that favors higher intelligence agents. This is similar to taking assumptions 1 and 2a from http://proceedings.mlr.press/v97/shah19a/shah19a.pdf
I see this proposal as reducing the level of deep understanding of neural networks that would be required to have an “agentometer”.
If we had a way of iterating over every “computation” in the world model, then in principle, we could use the definition of intelligence above to measure the intelligence of each computation, and filter out all the low intelligence ones. I think this covers most of the work required to identify the operator.
Working out how to iterate over every computation in the world model is the difficult part. We could try iterating over subnetworks of the world model, but it’s not clear this would work. Maybe iterate over pairs of regions in activation space? Of course these are not practical spaces to search over, but once we know the right type signature to search for, we can probably speed up the search by developing heuristic guided methods.
An agentometer is presumably a thing that finds stuff that looks like (some mathematically precise operationalisation of) bundles-of-preferences-that-are-being-optimised-on. If you have that, can’t you just look through the set of things like this in the AI’s world model that’s active when it’s say, talking to the operator, or looking at footage of the operator on camera, or anything else that’s probably require thinking about the operator in some fashion, and point at the bundle of preferences that gets lit up by that?
Yeah this is approximately how I think the “operator identification” would work.
Is the fear here that the AI may eventually stop thinking of the operator as a bundle-of-preferences-that-are-optimised-on, i.e. an “agent” at all, in favour of some galaxy brained superior representation only a superintelligence would come up with?
Yeah this is one of the fears. The point of the intelligence measuring equation for g is that it is supposed to work, even on galaxy brained world model ontologies. It only works by measuring competence of a computation at achieving goals, not by looking at the structure of the computation for “agentyness”.
it can even spit out complete agent policies for you to run and modify
These can be computations that aren’t every agenty, or don’t match an agent at all, or only match part of an agent, so the part that spits out potential policies doesn’t have to be very good. The g computation is used to find among these the ones that best match an agent.
I love the idea of modeling humans how they want to be modeled. I think of this as like a fuzzy pointer to human values, that sharpens itself? But I’m confused about how to implement this, or formalize this process.
I hadn’t seen your sequence, I’m a couple of posts in, it’s great so far. Does it go into formalizing the process you describe?
The reason this method might be useful is that it allows the agent to “fantasize” about actions it would take if it could. We don’t want it to take these actions in the real world. For example, it could explore bizarre hypotheticals like: “turn the whole world into gold/computronium/hedonium”.
If we had a sub-human physical robot, and we were confident that it wouldn’t do any damage when doing unconstrained real world training, then I can’t see any additional benefit to using our method? You can just do normal RL training.
And if it’s super-human, we wouldn’t want to use our technique in the real world, because it would turn us into gold/computronium/hedonium during training.
I think it might help a bit with non-model-based approaches, because it will be a bit like training on adversarial environments. But with model-based RL, this technique should reduce the distribution shift much more, because the distribution shift happening with “being powerful in sim” → “being powerful in reality”, should be much larger than “being powerful in my world model” → “being powerful in reality (as observed using a slightly better version of my world model)”.
PPO isn’t model based, so we will test it out with that.
One challenge here is that it may learn to use certain superpowers or not to use them rather than learning to seek or avoid a particular state of the world. Do you think this is likely to be an issue?
Yeah this is a good point, it’s not clear whether it will generalise to “use superpowers when available and seek aligned states, but act unaligned otherwise”, or “use superpowers when available and seek aligned states, and act aligned otherwise”. My hope is that the second one is more likely, because it should be a simpler function to represent. This does assume that the model has a simplicity bias, but I think that’s reasonable.
Yeah might depend on the details, but it shouldn’t learn to always use or not use superpowers, because it should still be trained on some normal non-superpower rollouts. So the strategy it learns should always involve using superpowers when available but not using them when not, otherwise it’d get high training loss.
I agree in the case of a model-free agent (although we think it should scale up better to be having the agent find its own adversarial examples).
In the model based agent, I think the case is better. Because you can implement the superpowers on its own world model (I.e. mu-zero that has the additional action of overwriting some or all of its world model latent state during rollouts), then the distribution shift that happens when capabilities get higher is much smaller, and depends mainly on how much the world model has changed its representation of the world state. This is a strictly smaller distribution shift to what you would have otherwise, because it has ~eliminated the shift that comes from not being able to access most states during the lower capabilities regime.
I’m glad you wrote a post about this topic. When I was first reading the sequences, I didn’t find the posts by Eliezer on Induction very satisfying, and it was only after reading Jaynes and a bunch of papers on Solomonoff induction that I felt I had a better understanding of the situation. This post might have sped up that process for me by a day or two, if I had read it a year ago.
There was a little while where I thought Solomonoff Induction was a satisfying solution to the problem of induction. But there doesn’t seem to be any justification for the order over hypotheses in the Solomonoff Prior. Is there discussion/reading about this that I’m missing?
There are several related concepts (mostly from ML) that have caused me a lot of confusion, because of the way they overlap with each other and are often presented separately. These included Occam’s Razor and The Problem of Induction, and also “inductive bias”, “simplicity”, “generalisation”, overfitting, model bias and variance, and the general problem of assigning priors. I’d like there to be a post somewhere explaining the relationships between these words. I might try to write it, but I’m not confident I can make it clear.