“Act-based approval-directed agents”, for IDA skeptics
Summary / tl;dr
In the 2010s, Paul Christiano built an extensive body of work on AI alignment—see the “Iterated Amplification” series for a curated overview as of 2018.
One foundation of this program was an intuition that it should be possible to build “act-based approval-directed agents” (“approval-directed agents” for short). These AGIs, for example, would not lie to their human supervisors, because their human supervisors wouldn’t want them to lie, and these AGIs would only do things that their human supervisors would want them to do. (It sounds much simpler than it is!)
Another foundation of this program was a set of algorithmic approaches, Iterated Distillation and Amplification (IDA), that supposedly offers a path to actually building these approval-directed AI agents.
I am (and have always been) a skeptic of IDA: I just don’t think any of those algorithms would work very well.[1]
But I still think there might be something to the “approval-directed agents” intuition. And we should be careful not to throw out the baby with the bathwater.
So my goal in this post is to rescue the “approval-directed agents” idea from its IDA baggage. Here’s the roadmap:
In Section 1, I offer a high-level picture of what we’re hoping to get out of “approval-directed agents”, following a discussion by Abram Demski (2018).
In Section 2, I walk through an example of how this vision can actually manifest in the context of brain-like AGI, a different AI paradigm which (unlike IDA) can definitely scale to superintelligence. I offer an everyday example of having role-models / idols who celebrate honesty, and correspondingly taking pride in one’s self-image as an honest person. In terms of brain algorithms, I relate this phenomenon to (what I call) “Approval Reward”, a hypothesized component of the human brain’s innate reinforcement learning reward function.
1. The easy and hard problems of wireheading, observation-utility agents, and approval-directed agents
In “Stable Pointers to Value II: Environmental Goals” (2018), Abram Demski describes the “observation-utility agents” trick[2] to solve (what he calls) “the easy problem of wireheading”.
(a) If we set up an agent to maximize the output of a utility function, it will edit the utility function to give a high output. (b) This problem is solved in “observation-utility agents” by using the current utility function to evaluate plans. Then the plan of “edit the utility function to output a higher value” generally gets a low score according to the current (not-edited) utility function, so it won’t happen.
Abram then suggests that we can think of Paul Christiano’s idea of “approval-directed agents” as a second, analogous move in this same direction:
(c) If we set up an agent to maximize the output of a human evaluation, it will manipulate or deceive the human to give a high output. (d) This problem is solved in “approval-directed agents” by using the current human to evaluate plans. Then the plan of “brainwash the human” generally gets a low score according to the current (not-brainwashed) human, so it won’t happen.
Abram calls the (c) failure mode “the hard problem of wireheading”; it includes all the ways to manipulate and deceive the human. The hope would be that (d) is an elegant solution to “the hard problem of wireheading” in (c), just as (b) is an elegant solution to “the easy problem of wireheading” in (a). After all, they have an obvious structural similarity.
Seems promising on paper, but how would we make these work in practice?
For the “hard problem” (d vs c) in particular, Abram mentions two challenges:
First, there’s an alignment issue. My diagram above obviously can’t be taken literally, with a literal human inspecting AI plans. For one thing, inspecting even one plan would be difficult and time-consuming at best, and impossible at worst, because the plans will be defined in terms of the AI’s inscrutable world-model. For another, we would probably need billions or trillions of plan-evaluation steps to happen, far beyond our ability to hire human plan-evaluators. After all, even a single human going about his day will entertain multiple plans per second, i.e. millions per year, and we’ll need a great many person-years of AGI labor if we want to move the needle on the AI x-risk problem.
So instead of an actual human supervisor in (d), we need some learned substitute. How do we get it? And more to the point, what happens when the learned substitute comes apart from the ground truth? That’s the first problem.
Second, there’s a capabilities (“alignment tax”) issue. The human supervisor is out in the environment in (c), but the (learned imitation of the) human supervisor is brought inside the AI’s thought process in (d). So in (d), much more than (c), we seem to be deeply constrained by the human supervisor’s competence and knowledge. For example, if the AI is supposed to be inventing futuristic nanotechnology, it might be entertaining plans like “What if I try exploring metastable covalent plasma flux resonances?” Alas, we can’t rely on the (learned imitation of the) current human supervisor to evaluate that plan, because the current human supervisor has no idea what the heck “metastable covalent plasma flux resonances” even means. So, how is this supposed to work?
Paul Christiano’s IDA-related 2010s research offers various ideas for addressing these two problems, which I basically don’t buy (see above). But here’s a quite different perspective on the problem:
2. If human desires are a case study of the “observation-utility agents” trick, then human pride is a case study of the “approval-directed agents” trick
For “the easy problem of wireheading” (b vs a), I argued in my Intro series §9.5.2 (2022) that human brains are more-or-less “observation-utility agents” in the above sense. (And indeed, plenty of humans would choose to not wirehead, given the choice.)
Well, now four years later, I’m proposing that human brains also provide an illustration of the “approval-directed agents” trick—specifically, when humans act out of pride in our self-image.
Consider a person who takes pride in their honesty. When they think of themselves as being honest, they feel pride, which comes along with an immediate squirt of pleasant feelings in their brain. I claim that this squirt of pleasant feelings is the result of an innate drive (a.k.a. primary reward) that I call “Approval Reward”. See my post Social drives 2: “Approval Reward”, from norm-enforcement to status-seeking (2025) (especially §3), for more on this everyday phenomenon, and see my post Neuroscience of human social instincts: a sketch (2024) for gory details of how I think this mechanism works in the brain. (I.e., how does the brain know which thoughts / plans do or don’t merit a squirt of Approval Reward?)[3]
I claim that if a person (call him Alex) has pride in his honesty, then upstream of that is at least one person whom Alex greatly admires, who thinks that it’s good to be honest and bad to be dishonest.[4] I’ll pick the name “Hugh” for this honesty-loving person whom Alex admires, and I’ll assume for now that they’re an actual living person (as opposed to a cartoon character, or Jesus, etc.).
Alex would love to get Hugh’s actual approval in real life—indeed, getting a few words of approval from someone you greatly admire can be a life-changing experience.[5] But Alex would not want to get Hugh’s approval by deceptively tricking Hugh into thinking that Alex is honest!
Yes, the plan to trick Hugh would impress Hugh in the future, when this plan comes to fruition. But merely entertaining this plan is appalling right now to imagined-Hugh, who is living rent-free in Alex’s brain, as Alex thinks about what to do. So this plan-to-deceive seems bad, and Alex won’t do it.[6]
Thus, imagined-Hugh has inserted himself into the plan-evaluation slot of Alex’s optimization loop, and this is working to prevent real-Hugh-manipulating strategies! It’s just like (d) in the diagram above! Here’s the corresponding diagram:
A real phenomenon in human psychology that parallels the (c-d) diagram above. If Hugh, your idol, prizes honesty, then you’re unlikely to deceptively trick him into believing that you’re an honest guy, even if you’re extremely confident that you could pull it off, and even if you care a great deal about what he thinks of you.
So now we have human analogies for not only the “observation-utility agents” trick, but also the “approval-directed agents” trick. And that’s great! It elevates these ideas from “things that sound maybe plausible” to “plans which are clearly compatible in practice with powerful general intelligence”, and which moreover I feel competent to analyze in detail on a nuts-and-bolts level.
Thus, if the above dynamic is a thing that can happen in human brains, then maybe something like it is likewise possible in brain-like AGI! For example, perhaps “Alex” is a stand-in for the AI, “Hugh” for the human supervisor, and “honesty” for (perhaps) some broader bundle of honesty, loyalty, obedience, forthrightness, integrity, etc. (cf. Paul’s broad notion of corrigibility).
…And then what? What exactly do we learn, from this human analogy, about what might go right or wrong? Is there a way to solve those two problems listed in §1 above? Is there an end-to-end path to safe & beneficial AGI somewhere in here?
My answer is: I don’t know! More on this in future posts, hopefully. :-)
Thanks Seth Herd for critical comments on an earlier draft.
- ^
More specifically: Sooner or later, one way or another, someone will invent radically superintelligent AI, and that AI might kill everyone. That’s the big problem that I’m interested in. And I think these IDA AIs would not be powerful enough to play an important role in either that problem or its solution. At most they’d be relevant as background context, similar to internet search engines or PyTorch or World War III etc.
…And why do I think that IDAs won’t be more powerful than that? Well, I had a whole section about this in an earlier draft, but it got really long and felt like an off-topic digression, so I cut it. But then I repackaged part of it and turned it into a separate post: You can’t imitation-learn how to continual-learn. Anyway, if you read that link, alongside earlier discussions by Eliezer here and John Wentworth here, then you can pretty much piece together the gist of where I’m coming from.
- ^
Abram attributes the trick to Daniel Dewey 2011, who in turn attributes it to Nick Hay 2005, which I didn’t read. Later on in 2019, this idea was analyzed in great detail by Everitt, Hutter, Kumar, and Krakovna, who called it “current-RF optimization”—see arxiv link, blog version, lesswrong crosspost.
- ^
See also: 6 reasons why “alignment-is-hard” discourse seems alien to human intuitions, and vice-versa (2025) for even more perspectives on Approval Reward.
- ^
Of course, what actually matters is that Alex believes that his idol feels that honesty is important and good. Whether they actually do is a different matter. As the saying goes, “never meet your heroes”.
- ^
See Mentorship, Management, and Mysterious Old Wizards, and also most of the examples in the “feelgood” email folder anecdote here.
- ^
Or maybe he will anyway! But my point is: this is a real consideration that pushes Alex in the direction of disliking that plan. If he does the plan anyway, it would have to be because there were other countervailing considerations that outvoted it.
Is this why the smartest humans (e.g. John von Neumann, Terrance Tao) go into math, where verification is definitely easier than generation, instead of fields like philosophy and long-horizon strategy, where plans and outputs are much harder to judge by others? (JvN did do some philosophy and strategy, but surprisingly little relative to his abilities and interests, and I note that his philosophical work, in decision theory, was heavily math flavored.)
I don’t think this is related to the points I was making in the post … But happy to chat about that anyway.
Yeah sure, common sense says that smart people will tend to enjoy being in more meritocratic intellectual fields, rather than less meritocratic ones, and also that fields in general tend to be more meritocratic when quality is easy to judge (although other things matter too, e.g. glamorous fields have it tougher because they attract grifters).
See e.g. what I wrote here about experimental science.
The mathematics community has successfully kept the cranks out, as far as I know, but two grimly amusing failures (in my controversial opinion) are: (1) in the 2000s, the (correct) theoretical physics consensus that we should be focusing on string theory was somewhat broken by an invasion of people unable to tell good physics theory from bad (e.g. “loop quantum gravity”), and there were enough of such people (including department chairs etc) that they broke the blockade and wound up with positions and credentials; (2) this funny anecdote in Dan Dennett’s memoir:
Having said all that…
If your proposal is:
von Neumann and Tao did math-y stuff rather than other stuff because they got adulation when they did math-y stuff and they got heckled by idiots when they did other stuff.
…then I think that’s part of it but not all of it. I would note that they presumably got good at math by thinking about math all the time, and if they were thinking about math all the time, it’s probably because they found it very satisfying and enjoyable to think about math. I have a kid like that—when he was like 8 years old, I might be talking about politics at dinner or whatever, and he would interrupt me to share something he just thought of about perfect squares that he found very exciting. I.e., some people, when their mind is wandering, think about other people, and some people think about sports, and he was evidently thinking about perfect squares. Anyway, if a person intrinsically enjoys thinking about numbers and symbols, then it stands to reason that they would probably choose a career where they get to think about numbers and symbols all day.
I sometimes wonder why the AI x-risk community was so overrepresented in physicists in the early-ish days (e.g. Hawking, Tegmark, Wilczek, Musk, Tallinn, Rees, Omohundro, Aguirre…). The best I can come up with is that people who self-select into physics are unusually likely to have the combination of (1) smart & quantitative, and (2) really, deeply, profoundly bothered by not understanding important things about the world.
To spell out the relevance that I see, if the same “alignment tax” issue that you mentioned for approval-directed AIs occurs in humans, that means we can’t use humans as an “existence proof” that this problem is solvable, while at the same time if somebody was to come up with a solution to the problem for AIs, the same solution could plausibly be “back-ported” to humans and allow the smartest humans to be more productive in some especially important fields (like philosophy and long-horizon strategy).
I think the point I was trying to make in this post is both narrower and weirder than the general topics of humans supervising more competent AIs, and generation-verification gaps. For example, my self-image might be partly formed from admiration of the character traits of a cartoon character, or Jesus, etc., and I might feel pride in acting in ways that I imagine them approving of, and that might influence how I go about my day-to-day conduct as a string theory researcher. But Jesus is long gone, and the cartoon character doesn’t even exist at all, and certainly neither was able to evaluate string theory ideas. They’re not “supervising” me in that sense.
Actual humans supervising actual AGIs is something that Paul talked about in IDA stuff, and like I said in the OP, I reject that entire line of research as a dead end.
Separately, I agree that “humans are an existence proof that safe & beneficial brain-like AGI is possible in principle” needs a heavy dose of nuance and caveats (humans are working towards misaligned AGI right now, plus I’d generally expect tech progress to drive humanity off the rails even without AGI or other destructive tech, among other things). But I do think there is some “…existence proof…” argument that goes through. E.g. at least some humans are making the overall situation better not worse (or if not, then we’re screwed no matter what), and AGIs don’t have to match the human population distribution.
Oh I didn’t realize this was your main point. To connect this to my most salient problem, namely how to improve production of philosophy and long-term strategy, I can’t think of anyone who is working in these areas and primarily motivated by the imagined approval of fictional or historical characters. Instead I think they’re mainly trying to win approval of other actual humans.
Do you think that nevertheless fictional approval (is this a good phrase to describe your idea?) is a promising avenue to pursue, for humans and/or AIs? A potential problem is that I don’t see how to ground it, i.e., if the imagined approval diverges from what’s actually good, there is no feedback loop to correct it?
It occurs to me that “at least some humans are making the overall situation better not worse” could be true, but a necessary factor is the constraints those humans have, e.g. limited intelligence, which can’t be reproduced in AIs. (If you limit your AI’s intelligence to make it safer / more aligned, someone will just copy your design and remove the limit.) E.g., maybe if I had von Neumann level IQ, I’d also be working in easy-to-verify domains like math and computer hardware, instead of philosophy and long-term strategy.
This post contains no plan for technical AGI alignment (or anything else). I have no such plan. See the last two paragraphs of the post.
I am trying to find such a plan (or prove that none exists), and in the course of doing so, occasionally I come across a nugget of deconfusion that I want to share :-) Hence this post.
As a general rule, I take interest in certain things that humans sometimes do or want, not because I’m interested in copying those things directly into AGIs, but rather because they are illustrative case studies for building my nuts-and-bolts understanding of aspects of motivation and learning etc. And then I can use that understanding to try to dream up some engineered system that might be useful in AGIs. The resulting engineered system might or might not resemble anything in humans or biology. By analogy, the Wright Brothers learned a lot from soaring birds, but their plane did not look like a bird.
I think what people “mainly” do is not of much interest to me right now. If a few people sometimes do X, then it follows that X is a possible thing that a brain can do, and then I can go try to figure out how the brain does that, and maybe learn something useful for technical alignment of brain-like AGI.
So along those lines: I think that there exist people who have a self-image as a person with such-and-such virtue, and take pride in that, and will (sometimes) make decisions driven by that self-image even when they have high confidence that nobody will ever find out, or worse, when they have high confidence that the people they care most about will despise them for it. They (sometimes) make that decision anyway.
I think this kind of self-image-related motivation has a deep connection to other people’s approval, and is causally downstream of their experience of such approval over a lifetime. But it is definitely NOT the same as consequentialist planning to maximize future approval / status.
First of all, I suspect that fictional approval has constraints similar to the collective’s approval and/or cultural hegemony. Secondly, “the constraints those humans have” could be not limited intelligence, but embodiment and/or growing in environments with long-term consequences and similarly capable, but different intelligences. An embodied paperclip optimizer can do just so much with an individual brain and limbs that it would have to steer others’ actions towards executing plans (e.g. participating in the creation of a robot army and aligning it to paperclips). Finally, I don’t buy the argument that long-term strategy, unlike philosophy, is hard to verify. LTS is supposed to have an objective result of goals being achieved or non-achieved and is likely testable in a manner similar to, e.g. the AI-2027 tabletop exercise.
First of all, I doubt that IDA-like agents cannot be bootstrapped to the ASI at all. Since GPT-5.4 Pro has likely solved a FrontierMath open problem, I suspect that Yudkowsky’s case against bootstrapping IDA-based agents to ASI either ended up Proving Too Much or proves the different idea that we can’t expect IDA agents to be bootstrapped into an aligned ASI.
Additionally, I am not sure of the extent to which Approval Reward in humans isn’t a proxy for long-term/large scale coordination, since Alex tricking Hugh would cause Hugh to become adversarial if the truth is revealed. That being said, the hard problem of wireheading the human does have a human analogue in misaligned governments engaging in wholesale deception of their citizens.
Finally, I suspect that the humans have Approval Reward from their entire collective with which they identify, not just from another human or collective. What if, say, Agent-3 develops the Approval Reward from the entire Agent-3 collective instead of the human collective and has the collective’s values drift?
P.S. I also suspect that the first and second issues mentioned by Demski can be partially solved by ensuring that the AI is rewarded for having the explanations understood by the simulated human so that the simulated human could rerun the experiments. Even if the simulated human can’t understand the concept of “metastable covalent plasma flux resonances”, the AI wasn’t born with this concept, the concept is either BS requiring the simulated human to reject the plan (edit: think of torsion fields in pseudoscience) or was created through some interaction of other concepts which are hopefully closer to being comprehensible.