Wei Dai comments on An alignment safety case sketch based on debate

Wei Dai 8 May 2025 23:24 UTC
LW: 35 AF: 19
7
AF
I’m curious if your team has any thoughts on my post Some Thoughts on Metaphilosophy, which was in large part inspired by the Debate paper, and also seems relevant to “Good human input” here.

Specifically, I’m worried about this kind of system driving the simulated humans out of distribution, either gradually or suddenly, accidentally or intentionally. And distribution shift could cause problems either with the simulation (presumably similar to or based on LLMs instead of low-level neuron-by-neuron simulation), or with the human(s) themselves. In my post, I talked about how philosophy seems to be a general way for humans to handle OOD inputs, but tends to be very slow and may be hard for ML to learn (or needs extra care to implement correctly). I wonder if you agree with this line of thought, or have some other ideas/plans to deal with this problem.

Aside from the narrow focus on “good human input” in this particular system, I’m worried about social/technological change being accelerated by AI faster than humans can handle it (due to similar OOD / slowness of philosophy concerns), and wonder if you have any thoughts on this more general issue.
What links here?
- ryan_greenblatt's comment on ryan_greenblatt’s Shortform by ryan_greenblatt (21 Jun 2025 0:14 UTC; 63 points)
- Geoffrey Irving 9 May 2025 12:26 UTC
  LW: 7 AF: 6
  0
  AF Parent
  I broadly agree with these concerns. I think we can split it into (1) the general issue of AGI/ASI driving humans out of distribution and (2) the specific issue of how assumptions about human data quality as used in debate will break down. For (2), we’ll have a short doc soon (next week or so) which is somewhat related, along the lines of “assume humans are right most of the time on a natural distribution, and search for protocols which report uncertainty if the distribution induced by a debate protocol on some new class of questions is sufficiently different”. Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there’s no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn’t competitive.
  One general note is that scalable oversight is a method for accelerating an intractable computation built out of tractable components, and these components can include both human and conventional software. So if you understand the domain somewhat well, you can try mitigate failures of (2) (and potentially gain more traction on (1)) by formalising part of the domain. And this formalisation can be bootstrapped: you can use on-distribution human data to check specifications, and then use those specifications (code, proofs, etc.) in order to rely on human queries for a smaller portion of the over next-stage computation. But generally this requires you to have some formal purchase on the philosophical aspects where humans are off distribution, which may be rough.
  What links here?
  - Wei Dai's comment on An alignment safety case sketch based on debate by Marie_DB (11 May 2025 21:23 UTC; 3 points)
  - Wei Dai 9 May 2025 17:54 UTC
    LW: 12 AF: 8
    1
    AF Parent
    
    Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there’s no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn’t competitive.
    
    This is my concern, and I’m glad it’s at least on your radar. How do you / your team think about competitiveness in general? (I did a simple search and the word doesn’t appear in this post or the previous one.) How much competitiveness are you aiming for? Will there be a “competitiveness case” later in this sequence, or later in the project? Etc.?
    
    But generally this requires you to have some formal purchase on the philosophical aspects where humans are off distribution, which may be rough.
    
    Because of the “slowness of philosophy” issue I talked about in my post, we have no way of quickly reaching high confidence that any such formalization is correct, and we have a number of negative examples where a proposed formal solution to some philosophical problem that initially looked good turned out to be flawed upon deeper examination. (See decision theory and Solomonoff induction.) AFAIK we don’t really have any positive examples of such formalizations that have stood the test of time. So I feel like this is basically not a viable approach.
    - Geoffrey Irving 15 May 2025 9:37 UTC
      LW: 5 AF: 5
      0
      AF Parent
      The Dodging systematic human errors in scalable oversight post is out as you saw, we can mostly take the conversation over there. But briefly, I think I’m mostly just more bullish on the margin than you about the (1) the probability that we can in fact make purchase on the hard philosophy, should that be necessary and (2) the utility we can get out of solving other problems should the hard philosophy problems remain unsolved. The goal with the dodging human errors post would be that if fail at case (1), we’re more likely to recognise it and try to get utility out of (2) on other questions.
      
      Part of this is that my mental model of formalisations standing the test of time is that we do have a lot of these: both of the links you point to are formalisations that have stood the test of time and have some reasonable domain of applicability in which they say useful things. I agree they aren’t bulletproof, but I think I’d place more chance than you of muddling through with imperfect machinery. This is similar to physics: I would argue for example that Newtonian physics has stood the test of time even though it is wrong, as it still applies across a large domain of applicability.
      
      That said, I’m not all confident in this picture: I’d place a lower probability than you on these considerations biting, but not that low.
      - Wei Dai 23 May 2025 20:11 UTC
        LW: 11 AF: 7
        0
        AF Parent
        Sorry about the delayed reply. I’ve been thinking about how to respond. One of my worries is that human philosophy is path dependent, or another way of saying this is that we’re prone to accepting wrong philosophical ideas/arguments and then it’s hard to talk us out of them. The split of western philosophy into analytical and continental traditions seems to be an instance of this, then even within analytical philosophy, academic philosophers would strongly disagree with each other and each be confident in their own positions and rarely get talked out of them. I think/hope that humans collectively can still make philosophical progress over time (in some mysterious way that I wish I understood), if we’re left to our own devices but the process seems pretty fragile and probably can’t withstand much external optimization pressure.
        
        On formalizations, I agree they’ve stood the test of time in your sense, but is that enough to build them into AI? We can see that they wrong on some questions, but can’t formally characterize the domain in which they are right. And even if we could, I don’t know why we’d muddle through… What if we built AI based on Debate, but used Newtonian physics to answer physics queries instead of human judgment, or the humans are pretty bad at answering physics related questions (including meta questions like how to do science)? That would be pretty disastrous, especially if there are any adversaries in the environment, right?
        Geoffrey Irving 29 May 2025 8:42 UTC
        LW: 1 AF: 1
        0
        AF Parent
        Continuing with the Newtonian physics analogy, the case for optimism would be:
        
        1. We have some theories with limited domain of applicability. Say, theory A.
        2. Theory A is wrong at some limit, where it is replaced by theory B. Theory B is still wrong, but it has a larger domain of applicability.
        3. We don’t know theory B, and can’t access it despite our best scalable oversight techniques, even though the AIs do figure out theory B. (This is the hard case: I think there other cases where scalable oversight does work.)
        4. However, we do have some purchase on the domain of applicability of theory A: we know the limits of where it’s been tested (energy levels, length scales, etc.).
        5. Scalable oversight has an easier job talking about these limits to theory A than it doesn’t about theory B itself. Concretely, what this means is that you can express arguments like “theory A doesn’t resolve question Q, as the answer depends on applying theory A beyond it’s decent-confidence domain of applicability”.
        6. Profit.
        
        This gives you a capability cap: the AIs know theory B but you can’t use it. But I do think if you can pull off the necessary restriction to which questions you can answer you can muddle through, even if you know only theory A and have some sense of its limits. The limits of Newtonian physics started to appear long before the replacement theories (relativity and quantum). I think we’re in a similar place with the philosophical worries: we have both a bunch of specific games that fail with older theories, and a bunch of proposals (say, variants of FDT) without a clear winner.
        
        The additional big thing you need here is a property of the world that makes that capability cap okay: if the only way to succeed is find perfect solutions using theory B, say because that gives you a necessary edge in an adversarial competition between multiple AIs, then lacking theory B sinks you. But I think we have a shot about not being in the worst case here.
        
        (Sorry as well for delay! Was sick.)
        Wei Dai 17 Jun 2025 11:31 UTC
        LW: 2 AF: 2
        0
        AF Parent
        
        I think we’re in a similar place with the philosophical worries: we have both a bunch of specific games that fail with older theories, and a bunch of proposals (say, variants of FDT) without a clear winner.
        
        I think the situation in decision theory is way more confusing than this. See https://www.lesswrong.com/posts/wXbSAKu2AcohaK2Gt/udt-shows-that-decision-theory-is-more-puzzling-than-ever and I would be happy to have a chat about this if that would help convey my view of the current situation.
        Martín Soto 17 Jun 2025 12:54 UTC
        2 points
        0
        Parent
        I read Wei as saying “debate will be hard because philosophy will be hard (and path-dependent and brittle), and one of the main things making philosophy hard is decision theory”. I quite strongly disagree.
        About decision theory in particular:
        I think Wei (and most people) are confused about updatelessness in ways that I’m not. I’m actually writing a post about this right now (but the closest thing for now is this one). More concretely, this is a problem of choosing our priors, which requires a kind of moral deliberation not unique to decision theory.
        About philosophy more generally:
        I would differentiate between “there is a ground truth but it’s expensive to compute” and “there is literally no ground truth, this is a subjective call we just need to engage in some moral deliberation, pitting our philosophical intuitions against each other, to discover what we want to do”.
        For the former category, I agree “expensive ground truths” can be a problem for debate, or alignment in general, but I expect it to also appear (and in fact do so sooner) on technical topics that we wouldn’t call philosophy. And I’d hope to have solutions that are mostly agnostic on subject matter, so the focus on philosophy doesn’t seem warranted (although it can be a good case study!).
        I think it squarely includes ethics, normativity, decision theory, and some other philosophy fall squarely into the latter category. I’m sympathetic to Wei (and others)’s worries that most of the value of the future can be squandered if we solve intent alignment and then choose the wrong kind of moral deliberation. But this problem seems totally orthogonal to getting debate to work in the technical sense that the UKAISI Alignment team focuses on.
- Marie_DB 9 May 2025 11:42 UTC
  LW: 1 AF: 1
  0
  AF Parent
  Interesting post!
  
  Could you say more about what you mean by “driving the simulated humans out of distribution”? Is it something like “during deployment, the simulated human judges might be asked to answer questions far outside the training distribution, and so they might fail to accurately simulate humans (or humans might be worse than on )”?
  
  The solution in the sketch is to keep the question distribution during deployment similar + doing online training during deployment (the simulated human judges could also be subject to online training). Is there a reason you think that won’t work?
  - Wei Dai 11 May 2025 21:23 UTC
    LW: 3 AF: 2
    0
    AF Parent
    
    Is it something like “during deployment, the simulated human judges might be asked to answer questions far outside the training distribution, and so they might fail to accurately simulate humans (or humans might be worse than on )”?
    
    Yes, but my concern also includes this happening during training of the debaters, when the simulated or actual humans can also go out of distribution, e.g., the actual human is asked a type of question that he has never considered before, and either answers in a confused way, or will have to use philosophical reasoning and a lot of time to try to answer, or maybe it looks like one of the debaters “jailbreaking” a human via some sort of out of distribution input.
    
    The solution in the sketch is to keep the question distribution during deployment similar + doing online training during deployment (the simulated human judges could also be subject to online training). Is there a reason you think that won’t work?
    
    This intuitively seems hard to me, but since Geoffrey mentioned that you have a doc coming out related to this, I’m happy to read it to see if it changes my mind. But this still doesn’t solve the whole problem, because as Geoffrey also wrote, “Of course, if the questions on which we need to use AI advice force those distributions to skew too much, and there’s no way for debaters to adapt and bootstrap from on-distribution human data, that will mean our protocol isn’t competitive.”