I do alignment research, mostly stuff that is vaguely agent foundations. Formerly on Vivek’s team at MIRI. Most of my writing before mid 2023 is not representative of my current views about alignment difficulty.
Jeremy Gillen
I made a mistake again. As described above, complete only pseudodominates incomplete.
But this is easily patched with the trick described in the OP. So we need the choice complete to make two changes to the downstream decisions. First, change decision 1 to always choose up (as before), second, change the distribution of Decision 2 to {, }, because this keeps the probability of B constant. Fixed diagram:
Now the lottery for complete is {B: , A+: , A:}, and the lottery for incomplete is {B: , A+: , A:}. So overall, there is a pure shift of probability from A to A+.
[Edit 23/7: hilariously, I still had the probabilities wrong, so fixed them, again].
That is really helpful, thanks. I had been making a mistake, in that I thought that there was an argument from just “the agent thinks it’s possible the agent will run into a money pump” that concluded “the agent should complete that preference in advance”. But I was thinking sloppily and accidentally sometimes equivocating between pref-gaps and indifference. So I don’t think this argument works by itself, but I think it might be made to work with an additional assumption.
One intuition that I find convincing is that if I found myself at outcome A in the single sweetening money pump, I would regret having not made it to A+. This intuition seems to hold even if I imagine A and B to be of incomparable value.
In order to avoid this regret, I would try to become the sort of agent that never found itself in that position. I can see that if I always follow the Caprice rule, then it’s a little weird to regret not getting A+, because that isn’t a counterfactually available option (counterfacting on decision 1). But this feels like I’m being cheated. I think the reason that if feels like I’m being cheated is that I feel like getting to A+ should be a counterfactually available option.
One way to make it a counterfactually available option in the thought experiment is to introduce another choice before choice 1 in the decision tree. The new choice (0), is the choice about whether to maintain the same decision algorithm (call this incomplete), or complete the preferential gap between A and B (call this complete).
I think the choice complete statewise dominates incomplete. This is because the choice incomplete results in a lottery {B: , A+: , A:} for .[1] However, the choice complete results in the lottery {B: , A+: , A:0}.
Do you disagree with this? I think this allows us to create a money pump, by charging the agent $ for the option to complete its own preferences.
The statewise pseudodominance relation is cyclic, so the Statewise Pseudodominance Principle would lead to cyclic preferences.
This still seems wrong to me, because I see lotteries as being an object whose purpose is to summarize random variables and outcomes. So it’s weird to compare lotteries that depend on the same random variables (they are correlated), as if they are independent. This seems like a sidetrack though, and it’s plausible to me that I’m just confused about your definitions here.
- ^
Letting be the probability that the agent chooses 2A+ and the probability the agent chooses 2B (following your comment above). And is defined similarly, for choice 1.
- ^
The reasons I don’t find this convincing:
The examples of human cognition you point to are the dumbest parts of human cognition. They are the parts we need to override in order to pursue non-standard goals. For example, in political arguments, the adaptations that we execute that make us attached to one position are bad. They are harmful to our goal of implementing effective policy. People who are good at finding effective government policy are good at overriding these adaptations.
“All these are part of the arbitrary, intrinsically-complex, outside world.” This seems wrong. The outside world isn’t that complex, and reflections of it are similarly not that complex. Hardcoding knowledge is a mistake, of course, but understanding a knowledge representation and updating process needn’t be that hard.
“They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity.” I agree with this, but it’s also fairly obvious. The difficulty of alignment is building these in such a way that you can predict that they will continue to work, despite the context changes that occur as an AI scales up to be much more intelligent.
“These bitter lessons were taught to us by deep learning.” It looks to me like deep learning just gave most people an excuse to not think very much about how the machine is working on the inside. It became tractable to build useful machines without understanding why they worked.
It sounds like you’re saying that classical alignment theory violates lessons like “we shouldn’t hardcode knowledge, it should instead be learned by very general methods”. This is clearly untrue, but if this isn’t what you meant then I don’t understand the purpose of the last quote. Maybe a more charitable interpretation is that you think the lesson is “intelligence is irreducibly complex and it’s impossible to understand why it works”. But this is contradicted by the first quote. The meta-methods are a part of a mind that can and should be understood. And this is exactly the topic that much of agent foundations research has been about (with a particular focus on the aspects that are relevant to maintaining stability through context changes).
(My impression was that this is also what shard theory is trying to do, except with less focus on stability through context changes, much less emphasis on fully-general outcome-directedness, and more focus on high-level steering-of-plans-during-execution instead of the more traditional precise-specification-of-outcomes).
A very similar strategy is listed as a borderline example of a pivotal act, on the pivotal act page:
I intended for my link to point to the comment you linked to, oops.
I’ve responded here, I think it’s better to just keep one thread of argument, in a place where there is more necessary context.
(sidetrack comment, this is not the main argument thread)
Think about your own preferences.
Let A be some career as an accountant, A+ be that career as an accountant with an extra $1 salary, and B be some career as a musician. Let p be small. Then you might reasonably lack a preference between 0.5p(A+)+(1-0.5p)(B) and A. That’s not instrumentally irrational.
I find this example unconvincing, because any agent that has finite precision in their preference representation will have preferences that are a tiny bit incomplete in this manner. As such, a version of myself that could more precisely represent the value-to-me of different options would be uniformly better than myself, by my own preferences. But the cost is small here. The amount of money I’m leaving on the table is usually small, relative to the price of representing and computing more fine-grained preferences.
I think it’s really important to recognize the places where toy models can only approximately reflect reality, and this is one of them. But it doesn’t reduce the force of the dominance argument. The fact that humans (or any bounded agent) can’t have exactly complete preferences doesn’t mean that it’s impossible for them to be better by their own lights.
Think about incomplete preferences on the model of imprecise exchange rates.
I appreciate you writing out this more concrete example, but that’s not where the disagreement lies. I understand partially ordered preferences. I didn’t read the paper though. I think it’s great to study or build agents with partially ordered preferences, if it helps get other useful properties. It just seems to me that they will inherently leave money on the table. In some situations this is well worth it, so that’s fine.
The general principle that you appeal to (If X is weakly preferred to or pref-gapped with Y in every state of nature, and X is strictly preferred to Y in some state of nature, then the agent must prefer X to Y) implies that rational preferences can be cyclic. B must be preferred to p(B-)+(1-p)(A+), which must be preferred to A, which must be preferred to p(A-)+(1-p)B+, which must be preferred to B.
No, hopefully the definition in my other comment makes this clear. I believe you’re switching the state of nature for each comparison, in order to construct this cycle.
It seems we define dominance differently. I believe I’m defining it a similar way as “uniformly better” here. [Edit: previously I put a screenshot from that paper in this comment, but translating from there adds a lot of potential for miscommunication, so I’m replacing it with my own explanation in the next paragraph, which is more tailored to this context.].
A strategy outputs a decision, given a decision tree with random nodes. With a strategy plus a record of the outcome of all random nodes we can work out the final outcome reached by that strategy (assuming the strategy is deterministic for now). Let’s write this like Outcome(strategy, environment_random_seed). Now I think that we should consider a strategy s to dominate another strategy s* if for all possible environment_random_seeds, Outcome(s, seed) ≥ Outcome(s*,seed), and for some random seed, Outcome(s, seed*) > Outcome(s*, seed*). (We can extend this to stochastic strategies, but I want to avoid that unless you think it’s necessary, because it will reduce clarity).
In other words, a strategy is better if it always turns out to do “equally” well or better than the other strategy, no matter the state of nature. By this definition, a strategy that chooses A at the first node will be dominated.
Relating this to your response:
We say that a strategy is dominated iff it leads to a lottery that is dispreferred to the lottery led to by some other available strategy. So if the lottery 0.5p(A+)+(1-0.5p)(B) isn’t preferred to the lottery A, then the strategy of choosing A isn’t dominated by the strategy of choosing 0.5p(A+)+(1-0.5p)(B). And if 0.5p(A+)+(1-0.5p)(B) is preferred to A, then the Caprice-rule-abiding agent will choose 0.5p(A+)+(1-0.5p)(B).
I don’t like that you’ve created a new lottery at the chance node, cutting off the rest of the decision tree from there. The new lottery wasn’t in the initial preferences. The decision about whether to go to that chance node should be derived from the final outcomes, not from some newly created terminal preference about that chance node. Your dominance definition depends on this newly created terminal preference, which isn’t a definition that is relevant to what I’m interested in.
I’ll try to back up and summarize my motivation, because I expect any disagreement is coming from there. My understanding of the point of the decision tree is that it represents the possible paths to get to a final outcome. We have some preference partial order over final outcomes. We have some way of ranking strategies (dominance). What we want out of this is to derive results about the decisions the agent must make in the intermediate stage, before getting to a final outcome.
If it has arbitrary preferences about non-final states, then it’s behavior is entirely unconstrained and we cannot derive any results about its decisions in the intermediate state.
So we should only use a definition of dominance that depends on final outcomes, then any strategy that doesn’t always choose B at decision node 1 will be dominated by a strategy that does, according to the original preference partial order.
(I’ll respond to the other parts of your response in another comment, because it seems important to keep the central crux debate in one thread without cluttering it with side-tracks).
I find the money pump argument for completeness to be convincing.
The rule that you provide as a counterexample (Caprice rule) is one that gradually completes the preferences of the agent as it encounters a variety of decisions. You appear to agree with that this is the case. This isn’t a large problem for your argument. The big problem is that when there are lots of random nodes in the decision tree, such that the agent might encounter a wide variety of potentially money-pumping trades, the agent needs to complete its preferences in advance, or risk its strategy being dominated.
You argue with John about this here, and John appears to have dropped the argument. It looks to me like your argument there is wrong, at least when it comes to situations where there are sufficient assumptions to talk about coherence (which is when the preferences are over final outcomes, rather than trajectories).
If it has no preference, neither choice will constitute a dominated strategy.
I think this statement doesn’t make sense. If it has no preference between choices at node 1, then it has some chance of choosing outcome A. But if it does so, then that strategy is dominated by the strategy that always chooses the top branch, and chooses A+ if it can. This is because 50% of the time, it will get a final outcome of A when the dominating strategy gets A+, and otherwise the two strategies give incomparable outcomes.
I’m assuming dominated means a strategy that gives a final outcome that is incomparable or > in the partial order of preferences, for all possible settings of random variables. (And strictly > for at least one setting of random variables). Maybe my definition is wrong? But it seems like this is the definition I want.
We might have developed techniques to specify simple, bounded object-level goals. Goals that can be fully specified using very simple facts about reality, with no indirection or meta level complications. If so, we can probably use inner aligned agents to assist with some relativity well specified engineering or scientific problems. Specification mistakes at that point could easily result in irreversible loss of control, so it’s not the kind of capability I’d want lots of people to have access to.
To move past this point, we would need to make some engineering or scientific advances that would be helpful for solving the problem more permanently. Human intelligence enhancement would be a good thing to try. Maybe some kind of AI defence system to shut down any rogue AI that shows up. Maybe some monitoring tech that helps governments co-ordinate. These are basically the same as the examples given on the pivotal act page.
Yes, if you have a very high bar for assumptions or the strength of the bound, it is impossible.
Fortunately, we don’t need a guarantee this strong. One research pathway is to weaken the requirements until they no longer cause a contradiction like this, while maintaining most of the properties that you wanted from the guarantee. For example, one way to weaken the requirements is to require that the agent provably does well relative to what is possible for agents of similar runtime. This still gives us a reasonable guarantee (“it will do as well as it possibly could have done”) without requiring that it solve the halting problem.
A dramatic advance in the theory of predicting the regret of RL agents. So given a bunch of assumptions about the properties of an environment, we could upper bound the regret with high probability. Maybe have a way to improve the bound as the agent learns about the environment. The theory would need to be flexible enough that it seems like it should keep giving reasonable bounds if the is agent doing things like building a successor. I think most agent foundations research can be framed as trying to solve a sub-problem of this problem, or a variant of this problem, or understand the various edge cases.
If we can empirically test this theory in lots of different toy environments with current RL agents, and the bounds are usually pretty tight, then that’d be a big update for me. Especially if we can deliberately create edge cases that violate some assumptions and can predict when things will break from which assumptions we violated.
(although this might not bring doom below 25% for me, depends also on race dynamics and the sanity of the various decision-makers).
I think your summary is a good enough quick summary of my beliefs. The minutia that I object to is how confident and specific lots of parts of your summary are. I think many of the claims in the summary can be adjusted or completely changed and still lead to bad outcomes. But it’s hard to add lots of uncertainty and options to a quick summary, especially one you disagree with, so that’s fair enough.
(As a side note, that paper you linked isn’t intended to represent anyone else’s views, other than myself and Peter, and we are relatively inexperienced. I’m also no longer working at MIRI).I’m confused about why your <20% isn’t sufficient for you to want to shut down AI research. Is it because of benefits outweigh the risk, or because we’ll gain evidence about potential danger and can shut down later if necessary?
I’m also confused about why being able to generate practical insights about the nature of AI or AI progress is something that you think should necessarily follow from a model that predicts doom. I believe something close enough to (1) from your summary, but I don’t have much idea (above general knowledge) of how the first company to build such an agent will do so, or when they will work out how to do it. One doesn’t imply the other.
I’m enjoying having old posts recommended to me. I like the enriched tab.
Doesn’t the futarchy hack come up here? Contractors will be betting that competitors timelines and cost will be high, in order to get the contract.
This doesn’t feel like it resolves that confusion for me, I think it’s still a problem with the agents he describes in that paper.
The causes are just the direct computation of for small values of . If they were arguments that only had bearing on small values of x and implied nothing about larger values (e.g. an adversary selected some to show you, but filtered for such that ), then it makes sense that this evidence has no bearing on. But when there was no selection or other reason that the argument only applies to small , then to me it feels like the existence of the evidence (even though already proven/computed) should still increase the credence of the forall.
I sometimes name your work in conversation as an example of good recent agent foundations work, based on having read some of it and skimmed the rest, and talked to you a little about it at EAG. It’s on my todo list to work through it properly, and I expect to actually do it because it’s the blocker on me rewriting and posting my “why the shutdown problem is hard” draft, which I really want to post.
The reasons I’m a priori not extremely excited are that it seems intuitively very difficult to avoid either of these issues:
I’d be surprised if an agent with (very) incomplete preferences was real-world competent. I think it’s easy to miss ways that a toy model of an incomplete-preference-agent might be really incompetent.
It’s easy to shuffle around the difficulty of the shutdown problem, e.g. by putting all the hardness into an assumed-adversarially-robust button-manipulation-detector or self-modification-detector etc.
It’s plausible you’ve avoided these problems but I haven’t read deeply enough to know yet. I think it’s easy for issues like this to be hidden (accidentally), so it’ll take a lot of effort for me to read properly (but I will, hopefully in about a week).
The part where it works for a prosaic setup seems wrong (because of inner alignment issues (although I see you cited my post in a footnote about this, thanks!)), but this isn’t what the shutdown problem is about so it isn’t an issue if it doesn’t apply directly to prosaic setups.
I would be excited to read this / help with a draft.
We can meet in person one afternoon and work out some cruxes and write them up?
Is the claim here that the AI performs well on ~all the human-checkable tasks and then reverts to being absolutely useless or sabotaging on the hard stuff?
Yes, approximately, as I believe you and I are capable of doing. [...PhD student crux]
The analogy is strained due to not being able to gradient update my brain with arbitrary training data. It’s pretty unclear to me if I could pull off deception like this, it seems pretty hard imo.
Are you saying you buy my side of the PhD student crux, but that it isn’t closely analogous because you think we have more fine-grained control over the thoughts or goals of the AI? And in your second sentence my guess is that you are saying a second reason you disagree is that it looks difficult for a research AI to know when to perform well and when to perform less well? I think we might be miscommunicating about this because it doesn’t look like the sort of situation where the “deception” is a relatively difficult task.
I think if you took a human research team, stuck them on an island, and had them get up to speed in some subfield they’re not part of, they would be unable to distinguish between research that had just been done on the mainland or is just past the frontier. This just seems pretty hard.
I agree with this. But the research we need isn’t just past the frontier. At some point not that far past the frontier, you get to a situation where the research team can say “given our guesses about mainland researchers and their time constraints, we’d put <1% on them having made discovery X {or explored idea Y, or done experiment Z}. Let’s {delay giving it to them, not work on that direction, insert a subtle false assumption, instead share our 8th best idea, just say we’re stuck, communicate it poorly}”.
I think this example is closer to what we’ll get with AIs rather than the bar being at “human level” it’s more like “what has been accomplished by the humans+AIs in the outside world, for which I’ve received untrustworthy information”.
Sure, kinda, but the important part here is that a trapped agentic AI can selectively mislead you about the most important {knowledge, tasks, strategies} that you aren’t capable of doing with just humans. If you’re capable of doing it with humans, or if you’re close to being able to do it with humans, then why not just do that? Seems safer.
And every time the AIs do useful work we can add this to our supervised training corpus.
Only if you spend the human resources to verify it. Are you aiming for 30x research acceleration? Because this by itself might be enough limit you below that level. (and it also directly provides a way for the AI to tell “what time it is” by sending messages to future iterations, and plausibly also provide a bunch of information about human research capability and capacity).
One version of this is: you have GPT-5 (potentially untrustworthy) do some research, you have humans carefully check this research using the best available tools (e.g., AI debate / other ‘scalable oversight’), and then you use this research as supervised / labeled data for training GPT-6 to do good work.
Mostly my objection to this is above. On scalable oversight, I buy that it might be possible to get 20% improvements in the accuracy of your evaluations of (adversarialy generated!) research, compared to team of humans trying really hard to evaluate research, given a human level research AI. This is so far from being enough. Are you predicting much better scalable oversight that scales much further than I am predicting it will? (I’m assuming you’re using “GPT-5/6” to represent “some powerful future consequentialist AI that can do research”, because otherwise we should be having a different argument).
While it’s probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to.
It’d be cool if there were a demonstration of manipulating internal activations during training, in a way that fixes non-trivial unwanted OOD behavior. If such techniques existed, and also worked on consequentialist agents (I expect would be harder), I would consider them very useful tools that would help with building an aligned agent.
While it’s probably non-trivial, I expect you could use this internal activation to train models which do a better job paying attention to the tokens you want them to.
This is very plausible, but I think you’re stretching the analogy between LMs and future research AIs way too far. “Paying attention to the tokens we want them to” doesn’t add much/any control over the properties we want control over (compared to e.g. changing the input string to remove information we don’t want it paying attention to).
Similarly, I think it’s probable (but far from guaranteed) that deceptively aligned models are representing training/deployment in their activations, or that sandbagging models are representing human-level / super-human-level difficulty for tasks. If so, it it possible in principle to do all kinds of things with these activations which may help us. e.g., get the model to solve tasks it otherwise would have intentionally failed at, or giving us hints about the accuracy of a research result. I expect this is difficult implementation wise, but still a reason for hope. IMO activation engineering is off to a good start.
I think you might be centrally thinking about an LLM completing a few paragraphs, where I’m thinking of an AI doing a large scale scientific project? I buy that for many concepts X, you can work out the “neural correlates of X” in your AI. But that kind of thing is noisy (and plausibly can be optimized against by the AI) unless you have a deep understanding of what you are measuring. And optimizing against such imperfect metrics obviously wouldn’t do much beyond destroying the metric. I do think research in this direction has a chance of being useful, but mainly by being upstream of much better understanding.
By leaning more on generalization, I mean leaning more on the data efficiency thing
Sorry for misinterpreting you, but this doesn’t clarify what you meant.
also weak-to-strong generalization ideas.
I think I don’t buy the analogy in that paper, and I don’t find the results surprising or relevant (by my current understanding, after skimming it). My understanding of the result is “if you have a great prior, you can use it to overcome some label noise and maybe also label bias”. But I don’t think this is very relevant to extracting useful work from a misaligned agent (which is what we are talking about here), and based on the assumptions they describe, I think they agree? (I just saw appendix G, I’m a fan of it, it’s really valuable that they explained their alignment plan concisely and listed their assumptions).
I could imagine starting with a deceptively aligned AI whose goal is “Make paperclips unless being supervised which is defined as X, Y, and Z, in which case look good to humans”. And if we could change this AI to have the goal “Make paperclips unless being supervised which is defined as X, Y, and Q, in which case look good to humans”, that might be highly desirable. In particular, it seems like adversarial training here allows us to expand the definition of ‘supervision’, thus making it easier to elicit good work from AIs (ideally not just ‘looks good’).
If we can tell we are have such an AI, and we can tell that our random modifications are affecting the goal, and also the change is roughly one that helps us rather than changing many things that might or might not be helpful, this would be a nice situation to be in.
I don’t feel like I’m talking about AIs which have “taking-over-the-universe in their easily-within-reach options”. I think this is not within reach of the current employees of AGI labs, and the AIs I’m thinking of are similar to those employees in terms of capabilities, but perhaps a bit smarter, much faster, and under some really weird/strict constraints (control schemes).
Section 6 assumes we have failed to control the AI, so it is free of weird/strict constraints, and free to scale itself up, improve itself, etc. So my comment is about an AI that no longer can be assumed to have human-ish capabilities.
Do you have recordings? I’d be keen to watch a couple of the ones I missed.
I think this post is great, I’ll probably reference it next time I’m arguing with someone about AI risk. It’s a good summary of the standard argument and does a good job of describing the main cruxes and how they fit into the argument. I’d happily argue for 1,2,3,4 and 6, and I think my disagreements with most people can be framed as disagreements about these points. I agree that if any of these are wrong, there isn’t much reason to be worried about AI takeover, as far as I can see.
One pet peeve of mine is when people call something an assumption, even though in that context it’s a conclusion. Just because you think the argument was insufficient to support it, doesn’t make it an assumption. E.g. In the second last paragraph:
There’s something wrong with the footnotes. [17] is incomplete and [17-19] are never referenced in the text.