Wei Dai comments on An alignment safety case sketch based on debate

Wei Dai 17 Jun 2025 11:31 UTC
LW: 2 AF: 2
0
AF

I think we’re in a similar place with the philosophical worries: we have both a bunch of specific games that fail with older theories, and a bunch of proposals (say, variants of FDT) without a clear winner.

I think the situation in decision theory is way more confusing than this. See https://www.lesswrong.com/posts/wXbSAKu2AcohaK2Gt/udt-shows-that-decision-theory-is-more-puzzling-than-ever and I would be happy to have a chat about this if that would help convey my view of the current situation.
- Martín Soto 17 Jun 2025 12:54 UTC
  2 points
  0
  Parent
  I read Wei as saying “debate will be hard because philosophy will be hard (and path-dependent and brittle), and one of the main things making philosophy hard is decision theory”. I quite strongly disagree.
  About decision theory in particular:
  - I think Wei (and most people) are confused about updatelessness in ways that I’m not. I’m actually writing a post about this right now (but the closest thing for now is this one). More concretely, this is a problem of choosing our priors, which requires a kind of moral deliberation not unique to decision theory.
  About philosophy more generally:
  - I would differentiate between “there is a ground truth but it’s expensive to compute” and “there is literally no ground truth, this is a subjective call we just need to engage in some moral deliberation, pitting our philosophical intuitions against each other, to discover what we want to do”.
  - For the former category, I agree “expensive ground truths” can be a problem for debate, or alignment in general, but I expect it to also appear (and in fact do so sooner) on technical topics that we wouldn’t call philosophy. And I’d hope to have solutions that are mostly agnostic on subject matter, so the focus on philosophy doesn’t seem warranted (although it can be a good case study!).
  - I think it squarely includes ethics, normativity, decision theory, and some other philosophy fall squarely into the latter category. I’m sympathetic to Wei (and others)’s worries that most of the value of the future can be squandered if we solve intent alignment and then choose the wrong kind of moral deliberation. But this problem seems totally orthogonal to getting debate to work in the technical sense that the UKAISI Alignment team focuses on.