Wei Dai comments on “Act-based approval-directed agents”, for IDA skeptics

Wei Dai 22 Mar 2026 20:39 UTC
LW: 3 AF: 3
−1
AF

But Jesus is long gone, and the cartoon character doesn’t even exist at all, and certainly neither was able to evaluate string theory ideas. They’re not “supervising” me in that sense.

Oh I didn’t realize this was your main point. To connect this to my most salient problem, namely how to improve production of philosophy and long-term strategy, I can’t think of anyone who is working in these areas and primarily motivated by the imagined approval of fictional or historical characters. Instead I think they’re mainly trying to win approval of other actual humans.

Do you think that nevertheless fictional approval (is this a good phrase to describe your idea?) is a promising avenue to pursue, for humans and/or AIs? A potential problem is that I don’t see how to ground it, i.e., if the imagined approval diverges from what’s actually good, there is no feedback loop to correct it?

But I do think there is some “…existence proof…” argument that goes through. E.g. at least some humans are making the overall situation better not worse (or if not, then we’re screwed no matter what), and AGIs don’t have to match the human population distribution.

It occurs to me that “at least some humans are making the overall situation better not worse” could be true, but a necessary factor is the constraints those humans have, e.g. limited intelligence, which can’t be reproduced in AIs. (If you limit your AI’s intelligence to make it safer / more aligned, someone will just copy your design and remove the limit.) E.g., maybe if I had von Neumann level IQ, I’d also be working in easy-to-verify domains like math and computer hardware, instead of philosophy and long-term strategy.
- Steven Byrnes 23 Mar 2026 3:02 UTC
  LW: 14 AF: 5
  2
  AF Parent
  This post contains no plan for technical AGI alignment (or anything else). I have no such plan. See the last two paragraphs of the post.
  I am trying to find such a plan (or prove that none exists), and in the course of doing so, occasionally I come across a nugget of deconfusion that I want to share :-) Hence this post.
  As a general rule, I take interest in certain things that humans sometimes do or want, not because I’m interested in copying those things directly into AGIs, but rather because they are illustrative case studies for building my nuts-and-bolts understanding of aspects of motivation and learning etc. And then I can use that understanding to try to dream up some engineered system that might be useful in AGIs. The resulting engineered system might or might not resemble anything in humans or biology. By analogy, the Wright Brothers learned a lot from soaring birds, but their plane did not look like a bird.
  I think they’re mainly trying to win approval of other actual humans.
  I think what people “mainly” do is not of much interest to me right now. If a few people sometimes do X, then it follows that X is a possible thing that a brain can do, and then I can go try to figure out how the brain does that, and maybe learn something useful for technical alignment of brain-like AGI.
  So along those lines: I think that there exist people who have a self-image as a person with such-and-such virtue, and take pride in that, and will (sometimes) make decisions driven by that self-image even when they have high confidence that nobody will ever find out, or worse, when they have high confidence that the people they care most about will despise them for it. They (sometimes) make that decision anyway.
  I think this kind of self-image-related motivation has a deep connection to other people’s approval, and is causally downstream of their experience of such approval over a lifetime. But it is definitely NOT the same as consequentialist planning to maximize future approval / status.
- StanislavKrym 23 Mar 2026 0:34 UTC
  1 point
  0
  Parent
  First of all, I suspect that fictional approval has constraints similar to the collective’s approval and/or cultural hegemony. Secondly, “the constraints those humans have” could be not limited intelligence, but embodiment and/or growing in environments with long-term consequences and similarly capable, but different intelligences. An embodied paperclip optimizer can do just so much with an individual brain and limbs that it would have to steer others’ actions towards executing plans (e.g. participating in the creation of a robot army and aligning it to paperclips). Finally, I don’t buy the argument that long-term strategy, unlike philosophy, is hard to verify. LTS is supposed to have an objective result of goals being achieved or non-achieved and is likely testable in a manner similar to, e.g. the AI-2027 tabletop exercise.