I do alignment research, mostly stuff that is vaguely agent foundations. Formerly on Vivek’s team at MIRI. Most of my writing before mid 2023 are not representative of my current views about alignment difficulty.
Jeremy Gillen
Without fundamental advances, misalignment and catastrophe are the default outcomes of training powerful AI
Soft optimization makes the value target bigger
Finding Goals in the World Model
Broad Basins and Data Compression
“they should clearly communicate their non-respectful/-kind alternative communication protocols beforehand, and they should help the other person maintain their boundaries;”
Nate did this.
By my somewhat idiosyncratic views on respectful communication, Nate was roughly as respectful as Thomas Kwa.
I do seem to be unusually emotionally compatible with Nate’s style of communication though.
Project proposal: Testing the IBP definition of agent
Thanks for reading it, it’s good to know exactly where you think the argument is weakest and I appreciate the effort of going through and noting differences.
On section 4:
I definitely don’t feel confident that any of the mentioned problems will arise in practice. I don’t see why I should believe in an inner/outer shell breakdown of constraints — this section seemed quite speculative.
This surprises me actually, I thought this section was solid conditional on the previous assumptions. I think you shouldn’t think of them as problems that might arise in practice, instead they should be thought of as reasons why behavioral training under-specifies some parts of the AI algorithm.
I’ll give some more quick examples of “outer-shell constraints”, because examples are very common and it seems odd for this to be called speculative: Unendorsed habits or addictions in humans; the instinct to breathe in when you’re trying to hold your breath; appending a checker algorithm C to any algorithm A, to double check the output of A; any biases in a search ordering (as in the alphago example).
On section 5:
It’s on my todo list to write a comment responding to some of the specifics of Redwood’s control post. This post wasn’t intending to engage with their detailed arguments, just give the top-level reasons why we believe control is difficult.Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff?
Yes, approximately, as I believe you and I are capable of doing. The crux here might be “if we had 20 phd students, and incentivize them to both look like they are doing checkable research, and also incentivize them not to make any real useful progress on solving a particular difficult problem, would this be very useful without costing a lot of expert supervision?”. If I remember correctly, this was a crux between myself and Ryan during an in person discussion. It seems likely to me that the phd students, trying not to be useful while looking good in checkable ways, will succeed at both. More intelligent agent are even more likely to succeed at both. It does depend on the type of research, I agree that formal theorem-proving would be much easier to incentivize properly. I’m talking about the least-easily-checkable 20% of useful research.
Predicting where that line is seems pretty difficult, especially given the AI cannot trust sensory inputs, does not know what time it is
Seems easy enough to predict given roughly human-scientist-team level of capabilities. And seems hard to get useful work out of human-scientist-team without them also working out ways to (maybe imperfectly) distinguish “faked sensory inputs” from real ones, and without leaking any information about what time it is. E.g. we need the AI to build on previous work
“There needs to be sufficient human-generated research in the domain where we are attempting to elicit the AI’s capabilities.” Well, no. You either need sufficient labeled data in the domain we’re trying to elicit capabilities, or good generalization from other domains.
Nice, this is my other crux with Ryan, which we phrased roughly as “how data efficient can we make training”. Where my argument against very high data efficiency is that current long-horizon RL isn’t very data-efficient, and intentional exploration hacking seems plausible in this situation (e.g. bad analogy but for the sake of describing exploration hacking: someone is training you to solve rubik’s cubes, you can avoid succeeding at solving a rubik’s cube in the first place (and use other strategies to mess with their reward signal). This works as long as the trainers don’t know the specific steps required to solve the rubik’s cube, but even then I think there’s some strategies that could work to reduce generalization).
The data need not be human generated, it could be generated by other AIs, as long as we trust it.
?? This seems to be assuming a solution to the problem.
I think there are still numerous directions of hope though, e.g., leaning more on generalization, latent adversarial training.
I agree there are numerous directions of hope, but disagree that either of your examples count, if I understand them correctly. “leaning more on generalization” ≈ “leaning more on agency”, which doesn’t seem good to me. Latent adversarial training seems like it could make an agent robust to a few extra categories of internal perturbation, but doesn’t seem like it would help with any problem that we were focused on in this post.
I agree that we don’t have strong reasons to think one thing or another here, but I think goal modification is reasonably likely: humans can iterate a bunch and maybe have okay interpretability tools (assuming decent but not fundamental advances). Also, as discussed, goals might be changing a bunch during training — that’s not totally asymmetric, it also gives us hope about our ability to modify AI goals.
If we are using okay interpretability tools to understand whether the AI has the goal we intended, and to guide training, then I would consider that a fundamental advance over current standard training techniques.
I agree that goals would very likely be hit by some modifications during training, in combination with other changes to other parts of the algorithm. The problem is shaping the goal to be a specific thing, not changing it in unpredictable or not-fully-understood ways.Many of the issues in this section are things that, if we’re not being totally idiots, it seems we’ll get substantial warning about. e.g., AIs colluding with their AI monitors. That’s definitely a positive, though far from conclusive.
I think that there is a lot of room for the evidence to be ambiguous and controversial, and for the obvious problems to look patchable. For this reason I’ve only got a little hope that people will panic at the last minute due to finally seeing the problems and start trying to solve exactly the right problems. On top of this, there’s the pressure of needing to “extract useful work to solve alignment” before someone less cautious builds an unaligned super-intelligence, which could easily lead to people seeing substantial warnings and pressing onward anyway.
Section 6:
I think a couple of the arguments here continue to be legitimate, such as “Unclear that many goals realistically incentivise taking over the universe”, but I’m overall fine accepting this section.
That argument isn’t really what it says on the tin, it’s saying something closer to “maybe taking over the universe is hard/unlikely and other strategies are better for achieving most goals under realistic conditions”. I buy this for many environments and levels of power, but it’s obviously wrong for AIs that have taking-over-the-universe in their easily-within-reach options. And that’s the sort of AI we get if it can undergo self-improvement.
Overall I think your comment is somewhat representative of what I see as the dominant cluster of views currently in the alignment community. (Which seems like a very reasonable set of beliefs and I don’t think you’re unreasonable for having them).
I read about half of this post when it came out. I didn’t want to comment without reading the whole thing, and reading the whole thing didn’t seem worth it at the time. I’ve come back and read it because Dan seemed to reference it in a presentation the other day.
The core interesting claim is this:
My conclusion will be that most of the items on Bostrom’s laundry list are not ‘convergent’ instrumental means, even in this weak sense. If Sia’s desires are randomly selected, we should not give better than even odds to her making choices which promote her own survival, her own cognitive enhancement, technological innovation, or resource acquisition.
This conclusion doesn’t follow from your arguments. None of your models even include actions that are analogous to the convergent actions on that list.
The non-sequential theoretical model is irrelevant to instrumental convergence, because instrumental convergence is about putting yourself in a better position to pursue your goals later on. The main conclusion seems to come from proposition 3, but the model there is so simple it doesn’t include any possibility of Sia putting itself in a better position for later.
Section 4 deals with sequential decisions, but for some reason mainly gets distracted by a Newcomb-like problem, which seems irrelevant to instrumental convergence. I don’t see why you didn’t just remove Newcomb-like situations from the model? Instrumental convergence will show up regardless of the exact decision theory used by the agent.
Here’s my suggestion for a more realistic model that would exhibit instrumental convergence, while still being fairly simple and having “random” goals across trajectories. Make an environment with 1,000,000 timesteps. Have the world state described by a vector of 1000 real numbers. Have a utility function that is randomly sampled from some Gaussian process (or any other high entropy distribution over functions) on . Assume there exist standard actions which directly make small edits to the world-state vector. Assume that there exist actions analogous to cognitive enhancement, making technology and gaining resources. Intelligence can be used in the future to more precisely predict the consequences of actions on the future world state (you’d need to model a bounded agent for this). Technology can be used to increase the amount or change the type of effect your actions have on the world state. Resources can be spent in the future for more control over the world state. It seems clear to me that for the vast majority of the random utility functions, it’s very valuable to have more control over the future world state. So most sampled agents will take the instrumentally convergent actions early in the game and use the additional power later on.
The assumptions I made about the environment are inspired by the real world environment, and the assumptions I’ve made about the desires are similar to yours, maximally uninformative over trajectories.
Here’s a mistake some people might be making with mechanistic interpretability theories of impact (and some other things, e.g. how much Neuroscience is useful for understanding AI or humans).
When there are multiple layers of abstraction that build up to a computation, understanding the low level doesn’t help much with understanding the high level.
Examples:
1. Understanding semiconductors and transistors doesn’t tell you much about programs running on the computer. The transistors can be reconfigured into a completely different computer, and you’ll still be able to run the same programs. To understand a program, you don’t need to be thinking about transistors or logic gates. Often you don’t even need to be thinking about the bit level representation of data.2. The computation happening in single neurons in an artificial neural network doesn’t have have much relation to the computation happening at a high level. What I mean is that you can switch out activation functions, randomly connect neurons to other neurons, randomly share weights, replace small chunks of network with some other differentiable parameterized function. And assuming the thing is still trainable, the overall system will still learn to execute a function that is on a high level pretty similar to whatever high level function you started with.[1]
3. Understanding how neurons work doesn’t tell you much about how the brain works. Neuroscientists understand a lot about how neurons work. There are models that make good predictions about the behavior of individual neurons or synapses. I bet that the high level algorithms that are running in the brain are most naturally understood without any details about neurons at all. Neurons probably aren’t even a useful abstraction for that purpose.
Probably directions in activation space are also usually a bad abstraction for understanding how humans work, kinda analogous to how bit-vectors of memory are a bad abstraction for understanding how program works.
Of course John has said this better.
- ^
You can mess with inductive biases of the training process this way, which might change the function that gets learned, but (my impression is) usually not that much if you’re just messing with activation functions.
- ^
You leave money on the table in all the problems where the most efficient-in-money solution involves violating your constraint. So there’s some selection pressure against you if selection is based on money.
We can (kinda) turn this into a money-pump by charging the agent a fee for to violate the constraint for it. Whenever it encounters such a situation, it pays you a fee and you do the killing.
Whether or not this counts as a money pump, I think it satisfies the reasons I actually care about money pumps, which are something like “adversarial agents can cheaply construct situations where I pay them money, but the world isn’t actually different”.
Explaining inner alignment to myself
I agree in the case of a model-free agent (although we think it should scale up better to be having the agent find its own adversarial examples).
In the model based agent, I think the case is better. Because you can implement the superpowers on its own world model (I.e. mu-zero that has the additional action of overwriting some or all of its world model latent state during rollouts), then the distribution shift that happens when capabilities get higher is much smaller, and depends mainly on how much the world model has changed its representation of the world state. This is a strictly smaller distribution shift to what you would have otherwise, because it has ~eliminated the shift that comes from not being able to access most states during the lower capabilities regime.
I think the term is very reasonable and basically accurate, even more so with regard to most RL methods. It’s a good way of describing a training process without implying that the evolving system will head toward optimality deliberately. I don’t know a better way to communicate this succinctly, especially while not being specific about what local search algorithm is being used.
Also, evolutionary algorithms can be used to approximate gradient descent (with noisier gradient estimates), so it’s not unreasonable to use similar language about both.
I’m not a huge fan of the way you imply that it was chosen for rhetorical purposes.
We don’t fully understand this comment.
Our current understanding is this:
The kernel matrix of shape takes in takes in two label vectors and outputs a real number: . The real number is roughly the negative log prior probability of that label set.
We can make some orthogonal matrix that transforms the labels , such that the real number output doesn’t change.
This is a transformation that keeps the label prior probability the same, for any label vector.
for all iff , which implies and share the same eigenvectors (with some additional assumption about having different eigenvalues, which we think should be true in this case).
Therefore we can just find the eigenvectors of .
But what can be? If has multiple eigenvalues that were the same, then we could construct an R that works for all . But empirically aren’t the eigenvalues of K all different?
So we are confused about that.
Also we are confused about this: “without changing the loss function”. We aren’t sure how the loss function comes into it.Also this: “training against a sinusoid” seems false? Or we really don’t know what this means.
I haven’t read every natural abstraction post yet, but I’m wondering whether this is a useful frame:
The relevant inductive bias for algorithms that learn natural abstractions might be to minimize expected working memory use (simultaneously with model complexity). This means the model will create labels for concepts that appear more frequently in the data distribution, with the optimal label length smaller for more commonly useful concepts.
In a prior over computable hypotheses, the hypotheses should be ordered by K-complexity(h) + AverageMemoryUsageOverRuntime(h).
I think this gives us the properties we want:
The hypothesis doesn’t compute details when they are irrelevant to its predictions.
The most memory efficient way to simulate the output of a gearbox uses some representation equivalent to the natural summary statistics. But if the system has to predict the atomic detail of a gear, it will do the low level simulation.
There exists a simple function from model-state to any natural concept.
Common abstract concepts have a short description length, and need to be used by the (low K-complexity) hypothesis program.
Most real world models approximate this prior, by having some kind of memory bottleneck. The more closely an algorithm approximates this prior, the more “natural” the set of concepts it learns.
I sometimes name your work in conversation as an example of good recent agent foundations work, based on having read some of it and skimmed the rest, and talked to you a little about it at EAG. It’s on my todo list to work through it properly, and I expect to actually do it because it’s the blocker on me rewriting and posting my “why the shutdown problem is hard” draft, which I really want to post.
The reasons I’m a priori not extremely excited are that it seems intuitively very difficult to avoid either of these issues:
I’d be surprised if an agent with (very) incomplete preferences was real-world competent. I think it’s easy to miss ways that a toy model of an incomplete-preference-agent might be really incompetent.
It’s easy to shuffle around the difficulty of the shutdown problem, e.g. by putting all the hardness into an assumed-adversarially-robust button-manipulation-detector or self-modification-detector etc.
It’s plausible you’ve avoided these problems but I haven’t read deeply enough to know yet. I think it’s easy for issues like this to be hidden (accidentally), so it’ll take a lot of effort for me to read properly (but I will, hopefully in about a week).
The part where it works for a prosaic setup seems wrong (because of inner alignment issues (although I see you cited my post in a footnote about this, thanks!)), but this isn’t what the shutdown problem is about so it isn’t an issue if it doesn’t apply directly to prosaic setups.
To me it feels like the natural place to draw the line is update-on-computations but updateless-on-observations. Because 1) It never disincentivizes thinking clearly, so commitment races bottom out in a reasonable way, and 2) it allows cooperation on common-in-the-real-world newcomblike problems.
It doesn’t do well in worlds with a lot of logical counterfactual mugging, but I think I’m okay with this? I can’t see why this situation would be very common, and if it comes up it seems that an agent that updates on computations can use some precommitment mechanism to take advantage of it (e.g. making another agent).
Am I missing something about why logical counterfactual muggings are likely to be common?
Looking through your PIBBS report (which is amazing, very helpful), I intuitively feel the pull of Desiderata 4 (No existential regret), and also the intuition of wanting to treat logical uncertainty and empirical uncertainty in a similar way. But ultimately I’m so horrified by the mess that comes from being updateless-on-logic that being completely updateful on logic is looking pretty good to me.
(Great post, thanks)
This is one of my favorite posts because it gives me tools that I expect to use.
A little while ago, John described his natural latent result to me. It seemed cool, but I didn’t really understand how to use it and didn’t take the time to work through it properly. I played around with similar math in the following weeks though; I was after a similar goal, which was better ways to think about abstract variables.
More recently, John worked through the natural latent proof on a whiteboard at a conference. At this point I felt like I got it, including the motivation. A couple of weeks later I tried to prove it as an exercise for myself (with the challenge being that I had to do it from memory, rigorously, and including approximation). This took me two or three days, and the version I ended up with used a slightly different version of the same assumptions, and got weaker approximation results. I used the graphoid axioms, which are the standard (but slow and difficult) way of formally manipulating independence relationships (and I didn’t have previous experience using them).
This experience caused me to particularly appreciate this post. It turns lots of work into relatively little work.
I’m not sure how to implement the rule “don’t pay people to kill people”. Say we implement it as a utility function over world-trajectories, and any trajectory that involves any causally downstream of your actions killing gets MIN_UTILITY. This makes probabilistic tradeoffs so it’s probably not what we want. If we use negative infinity, but then it can’t ever take actions in a large or uncertain world. We need to add the patch that the agent must have been aware at the time of taking its actions that the actions had chance of causing murder. I think these are vulnerable to blackmail because you could threaten to cause murders that are causally-downstream-from-its-actions.
Maybe I’m confused and you mean “actions that pattern match to actually paying money directly for murder”, in which case it will just use a longer causal chain, or opaque companies that may-or-may-not-cause-murders will appear and trade with it.
If the ultimate patch is “don’t take any action that allows unprincipled agents to exploit you for having your principles”, then maybe there isn’t any edge cases. I’m confused about how to define “exploit” though.
Paperclip metaphor is not very useful if interpreted as “humans tell the AI to make paperclips, and it does that, and the danger comes from doing exactly what we said because we said a dumb goal”.
There is a similar-ish interpretation, which is good and useful, which is “if the AI is going to do exactly what you say, you have to be insanely precise when you tell it what to do, otherwise it will Goodhart the goal.” The danger comes from Goodharting, rather than humans telling it a dumb goal. The paperclip example can be used to illustrate this, and I think this is why it’s commonly used.
And he is referencing in the first tweet (with inner alignment), that we will have very imprecise (think evolution-like) methods of communicating a goal to an AI-in-training.
So apparently he intended the metaphor to communicate that the AI-builders weren’t trying to set “make paperclips” as the goal, they were aiming for a more useful goal and “make paperclips” happened to be the goal that it latched on to. Tiny molecular squiggles is better here because it’s a more realistic optima of an imperfectly learned goal representation.