Wei Dai comments on leogao’s Shortform

Wei Dai 26 Oct 2025 22:26 UTC
3 points
3

first, I think it implies that we should try to figure out how to reduce the asymmetry in verifiability between capabilities and alignment

If solving alignment implies solving difficult philosophical problems (and I think it does), then a major bottlenecks for verifying alignment will be verifying philosophy, which in turn implies that we should be trying to solve metaphilosophy (i.e., understand the nature of philosophy and philosophical reasoning/judgment). But that is unlikely to be possible within 2-4 years, even with the largest plausible effort, considering the history of analogous fields like metaethics and philosophy of math.

What to do in light of this? Try to verify the rest of alignment, just wing it on the philosophical parts, and hope for the best?

in particular, because ultimately the only way we can make progress on alignment is by relying on whatever process for deciding that research is good that human alignment researchers use in practice (even provably correct stuff has the step where we decide what theorem to prove and give an argument for why that theorem means our approach is sound), there’s an upper bound on the best possible alignment solution that humans could ever have achieved, which is plausibly a lot lower than perfectly solving alignment with certainty.

I kind of want to argue against this, but also am not sure how this fits in with the rest of your argument. Whether or not there’s an upper bound that’s plausibly a lot lower than perfectly solving alignment with certainty, it doesn’t seem to affect your final conclusions?
- leogao 27 Oct 2025 1:17 UTC
  6 points
  2
  Parent
  suppose a human solved alignment. how would we check their solution? ultimately, at the end of the day, we look at their argument and use our reasoning and judgement to determine that it’s correct. this applies even if we need to adopt a new frame of looking at the world—we can ultimately only use the kinds of reasoning and judgement we use today to decide which new kinds of reasoning to accept into the halls of truth.
  so there is a philosophically very straightforward thing you could do to get the solution to alignment, or the closest we could ever get: put a bunch of smart thoughtful humans in a sim and run it for a long time. so the verification process looks like proving that the sim is correct, showing somehow that the humans are actually correctly uploaded, etc. not trivial but also seems plausibly doable.
  - Wei Dai 27 Oct 2025 2:22 UTC
    4 points
    0
    Parent
    Unless you can abstract out the “alignment reasoning and judgement” part of a human’s entire brain process (and philosophical reasoning and judgement as part of that) into some kind of explicit understanding of how it works, how do you actually build that into AI without solving uploading (which we’re obviously not on track to solve in 2-4 year either)?
    
    put a bunch of smart thoughtful humans in a sim and run it for a long time
    
    Alignment researchers have had this thought for a long time (see e.g. Paul Christiano’s A formalization of indirect normativity) but I think all of the practical alignment research programs that this line of thought led to, such as IDA and Debate, are all still bottlenecked by lack of metaphilosophical understanding, because without the kind of understanding that lets you build an “alignment/philosophical reasoning checker” (analogous to a proof checker for mathematical reasoning) they’re stuck trying to do ML of alignment/philosophical reasoning from human data, which I think is unlikely to work out well.
- Raemon 26 Oct 2025 23:30 UTC
  4 points
  0
  Parent
  I periodically say to people “if you want AI to be able to help directly with alignment research, it needs to be good at philosophy in a way it currently isn’t”.
  Almost invariably, the person suggests training on philosophy textbooks, or philosophy academic work. And I sort of internally scream and externally say “no, or, at least, not without caveats.” (I think some academic philosophy “counts” as good training data, but, I feel like these people would not have good enough taste to tell the difference^[1] and also “train on the text” seems obviously incomplete/insufficient/not-really-the-bottleneck)
  I’ve been trying to replace philosophy with the underlying substance. I think I normally mean “precise conceptual reasoning”, but, reading this comment and remembering your past posts, I think you maybe mean something broader or different, but I’m not sure how to characterize it.
  I think “figure out what are the right concepts to be use, and, use those concepts correctly, across all of relevant-Applied-conceptspace” is the expanded version of what I meant, which maybe feels more likely to be what you mean. But, I’m curious if you were to taboo “philosophy” what would you mean.
  1. ^
    to be clear I don’t think I’d have good enough taste either
  - leogao 27 Oct 2025 0:11 UTC
    2 points
    0
    Parent
    to be clear, “just train on philosophy textbooks lol” is not at all what i’m proposing, and i don’t think that works
    - Raemon 27 Oct 2025 0:14 UTC
      2 points
      0
      Parent
      Oh yeah I did not mean this to be a complaint about you.
  - Wei Dai 27 Oct 2025 4:51 UTC
    0 points
    0
    Parent
    Figuring out the underlying substance behind “philosophy” is a central project of metaphilosophy, which is far from solved, but my usual starting point is “trying to solve confusing problems which we don’t have established methodologies for solving” (methodologies meaning explicitly understood methods), which I think bakes in the least amount of assumptions about what philosophy is or could be, while still capturing the usual meaning of “philosophy” and explains why certain fields started off as being part of philosophy (e.g., science starting off as nature philosophy) and then became “not philosophy” when we figured out methodologies for solving them.
    
    I think “figure out what are the right concepts to be use, and, use those concepts correctly, across all of relevant-Applied-conceptspace” is the expanded version of what I meant, which maybe feels more likely to be what you mean.
    
    This bakes in “concepts” being the most important thing, but is that right? Must AIs necessarily think about philosophy using “concepts”, or is that really the best way to formulate how idealized philosophical reasoning should work?
    
    Is “concepts” even what distinguishes philosophy from non-philosophical problems, or is “concepts” just part of how humans reason about everything, which we latch onto when trying to define or taboo philosophy, because we have nothing else better to latch onto? My current perspective is that what uniquely distinguishes philosophy is their confusing nature and the fact that we have no well-understood methods for solving them (but would of course be happy to hear any other perspectives on this).
    
    Regarding good philosophical taste (or judgment), that is another central mystery of metaphilosophy, which I’ve been thinking a lot about but don’t have any good handles on. It seems like a thing that exists (and is crucial) but is very hard to see how/why it could exist or what kind of thing it could be.
    
    So anyway, I’m not sure how much help any of this is, when trying to talk to the type of person you mentioned. The above are mostly some cached thoughts I have on this, originally for other purposes.
    
    BTW, good philosophical taste being rare definitely seems like a very important part of the strategic picture, which potentially makes the overall problem insurmountable. My main hopes are 1) someone makes an unexpected metaphilosophical breakthrough (kind of like Satoshi coming out of nowhere to totally solve distributed currency) and there’s enough good philosophical taste among the AI safety community (including at the major labs) to recognize it and incorporate it into AI design or 2) there’s an AI pause during which human intelligence enhancement comes online and selecting for IQ increases the prevalence of good philosophical taste as a side effect (as it seems too much to hope that good philosophical taste would be directly selected for) and/or there’s substantial metaphilosophical progress during the pause.
- StanislavKrym 26 Oct 2025 23:50 UTC
  1 point
  0
  Parent
  What Leogao means is that we should increase the ability to verify alignment ALONG with the ability to verify capabilities. The latter is easy at least until the model comes up with a galaxy-brained scheme which allows it to leverage its failures to gain influence on the world. Verifying alignment or misalignment is rather simple with CoT-based AIs (especially if one decides to apply paraphrasers to prevent steganography), but not with neuralese AIs. The WORST aspect is that we are likely to come up with a super-capable architecture AT THE COST of our ability to understand the AI’s thoughts.
  What I don’t understand is how alignment involves high-level philosophy and how a bad high-level philosophy causes astronomical waste. Astronomical waste is most likely to be caused by humanity or the AIs making a wrong decision on whether we should fill the planets in the lightcone with human colonies instead of lifeforms that could’ve grown on these planets, since this mechanism is already known to be possible.