williawa comments on Counting Arguments in AI Safety

williawa 22 May 2026 18:47 UTC
2 points
0
I am arguing that a priori we should be “uncertain” about pretty much any aspect of AI motivation, so long as it’s not contradictory. There certainly exist optimisation pressures that would select for an AI that wants to spend all of its time building anime cat-girls, but given knowledge of our current economic systems and the culture inside labs, it seems very likely that optimisation pressures will actually select against this.
I’m just saying apriori, do you think we should be ⁵⁰⁄₅₀ on whether the ASI ends up spending its time building catgirls? Or are you advocating for some kind of uncertainty we can’t assign numbers to?
Like, I don’t think there is a fully canonical way to parameterize the space of all goals AIs have. But there are still somewhat principled arguments you can use to judge that some parametrizations are better than others. Like randomly privileging some concept you like, without good reason, is probably not The Way.
Like if you have AIXI, and you have a value-function slot, and you randomly sample from bit strings that encode valid value function, weighted by their length, I’d be willing to bet at 1:999999 odds you don’t end up with a catgirl-AIXI. To get catgirl aixi, you have to put a bunch of stuff into the specification of your probability weighting.
A lot of prosaic alignment is relevant to this. The post was mostly aimed at the class of people who are very sceptical of prosaic alignment in general because current techniques won’t scale to the things that actually matter.
To me the counting arguments are more about, some people think “50/50 it kills us or doesn’t, we do RLHF++ which moves it to ⁹⁵⁄₅, we do a bunch of other fancy monitoring and control with interp in theloop and alignment pretraining etc, and we move it to 99/1”
Others (I’m in this camp) view it more like 1:10^9999* chance a random neural-net ASI is friendly. Then if it was created from pretrained base its somewhere between 1:10^998 and 1:10^-1* (with distribution over those netting to ~20%?). Then post-training moves it to somewhere between 1:10^9997 and 1:10^-999 (netting to maybe 50%?).
Does this sound reasonable? Does it illustrate why I think counting arguments tell us something, but that almost all the work goes into figuring out how details of training update that distribution?
*just me writing the first numbers that pop into my head for illustrative purposes
- Samuel Ratnam 23 May 2026 19:37 UTC
  1 point
  0
  Parent
  > I’m just saying apriori, do you think we should be ⁵⁰⁄₅₀ on whether the ASI ends up spending its time building catgirls? Or are you advocating for some kind of uncertainty we can’t assign numbers to?
  
  Yes, I am saying precisely that we shouldn’t assign ⁵⁰⁄₅₀ when faced with deep ignorance about the nature of the generating process. Maybe I am advocating for a stance of Knightian uncertainty or maybe just saying that the standard Bayesian approach can plausibly lead you to be much too overconfident in your views. I think reasoning about ASI a priori doesn’t makes sense in the same way that the question posed in Bertrand’s paradox doesn’t make sense.
  >Like if you have AIXI, and you have a value-function slot, and you randomly sample from bit strings that encode valid value function, weighted by their length, I’d be willing to bet at 1:999999 odds you don’t end up with a catgirl-AIXI. To get catgirl aixi, you have to put a bunch of stuff into the specification of your probability weighting.
  
  So two things I want to point out here:
  1) I suspect there is no single privileged value function encoding, so I think the answer to this question is undefined. I admit though that some value function encodings look weirder than others, and I think I agree with the general gist of what you’re saying.
  2) I think that the jump to “1:10^9999* chance a random neural-net ASI is friendly” is not valid here. AIXI is one framework which we can use to reason about intelligent neural networks, but it is far from the only one. I think ‘random neural-net AISI’ is underspecified unless you provide a sampling method, and yes you’re allowed to use this as a prior, and I think it’s not an unreasonable one to use here, but there are other priors you could use that would give you much less pessimistic conclusions.
  - williawa 23 May 2026 20:29 UTC
    2 points
    0
    Parent
    Yes, I am saying precisely that we shouldn’t assign ⁵⁰⁄₅₀ when faced with deep ignorance about the nature of the generating process. Maybe I am advocating for a stance of Knightian uncertainty or maybe just saying that the standard Bayesian approach can plausibly lead you to be much too overconfident in your views.
    I mean, to be honest, I don’t really understand Knightian uncertainty. ASI either kills us or it doesn’t. If we lived in a distinct reality where the outcome of ASI doesn’t impact us except we get to know how it went, and I offered you a bet, where you get a billion dollars if ASI doesn’t kill everyone, and you have to pay me 1 cent if it does. Do you not take the bet? That seems absurd to me, but if you do take the bet, you have an implied probability distribution over outcomes.
    I guess a more direct argument against what you’re saying is: Bertrand’s paradox occurs because the sample space is underspecified. But in the real world, there is a fact of the matter about the sample space. We are uncertain about that fact. But we can observe that almost any such sample space, unless we put in a bunch of very specific information, will have the property that, friendly goals occupy a very small fraction of it.
    I think that the jump to “1:10^9999* chance a random neural-net ASI is friendly” is not valid here. AIXI is one framework which we can use to reason about intelligent neural networks, but it is far from the only one. I think ‘random neural-net AISI’ is underspecified unless you provide a sampling method, and yes you’re allowed to use this as a prior, and I think it’s not an unreasonable one to use here, but there are other priors you could use that would give you much less pessimistic conclusions.
    This is kind of addressed by what I said above but imagine you have a gpt2-style transformer, but it has 1 quadrillion layers, and a hudnred quadrillion dimensional hidden state. Then look over all possible ways to assign values to the weights, assuming a fixed floating point representation.
    Seems likely to me some fraction of those will yield an ASI. And some fraction of those ASIs will be friendly. But that fraction will be vanishingly small. And furthermore, that if you want to modify the probability distribution over weights so that friendly AIs are likely, you’ll need to add a bunch of ad-hoc and very specific information into the distribution.
    - Samuel Ratnam 25 May 2026 13:24 UTC
      1 point
      0
      Parent
      >I guess a more direct argument against what you’re saying is: Bertrand’s paradox occurs because the sample space is underspecified. But in the real world, there is a fact of the matter about the sample space. We are uncertain about that fact. But we can observe that almost any such sample space, unless we put in a bunch of very specific information, will have the property that, friendly goals occupy a very small fraction of it.
      
      I agree with everything you said here, up to the point of “but we can observe that almost any such sample space...”. My point is twofold:
      (1) instead about reasoning about proportion of sample spaces, we should actually try to reduce uncertainty about the sample space by reasoning directly about it
      (2) when you start reasoning about “friendly goals”, you are implicitly projecting down to a lower-dimensional subspace, which does not necessarily preserve the structure of your original space, and so may radically distort proportions
      - williawa 25 May 2026 15:11 UTC
        2 points
        0
        Parent
        when you start reasoning about “friendly goals”, you are implicitly projecting down to a lower-dimensional subspace, which does not necessarily preserve the structure of your original space, and so may radically distort proportions
        No offense, I already addressed this.
        (1) instead about reasoning about proportion of sample spaces, we should actually try to reduce uncertainty about the sample space by reasoning directly about it
        You “should” do that. I agree, but that doesn’t mean counting arguments provide no evidence. If I have a mole that appears diffuse and growing, I should get it checked out to be sure it is/isn’t cancer. But that doesn’t mean a diffuse mole that’s growing isn’t more scary than one that isn’t growing and isn’t diffuse.
        Samuel Ratnam 28 May 2026 11:47 UTC
        1 point
        0
        Parent
        Hey, sorry for not engaging properly. On rereading your earlier comment:
        
        >This is kind of addressed by what I said above but imagine you have a gpt2-style transformer, but it has 1 quadrillion layers, and a hudnred quadrillion dimensional hidden state. Then look over all possible ways to assign values to the weights, assuming a fixed floating point representation.
        >Seems likely to me some fraction of those will yield an ASI. And some fraction of those ASIs will be friendly. But that fraction will be vanishingly small. And furthermore, that if you want to modify the probability distribution over weights so that friendly AIs are likely, you’ll need to add a bunch of ad-hoc and very specific information into the distribution.
        
        This seems like a more reasonable prior to start with. I think this gets you less doom than AIXI though. But I can make this same argument about biology: consider all of the possible sets of DNA that can produce an organism. Some fraction of these are going to be at least an an intelligent as a mouse. But a vanishingly small fraction of these have legs (most, in fact, are just blobs of brains or other cognitive machinery). So we should expect it to be incredibly unlikely that systems smarter than a mouse have legs. But it turns out that it’s actually not that hard to create a selection process of which legs are a convergent property. I would claim that we don’t really know how hard it is to create a selection process of which “don’t kill us” is a convergent property. So I think unless you specify a particular selection process, it’s hard to make strong claims, purely from reasoning about the sample space.
        
        > I mean, to be honest, I don’t really understand Knightian uncertainty. ASI either kills us or it doesn’t. If we lived in a distinct reality where the outcome of ASI doesn’t impact us except we get to know how it went, and I offered you a bet, where you get a billion dollars if ASI doesn’t kill everyone, and you have to pay me 1 cent if it does. Do you not take the bet? That seems absurd to me, but if you do take the bet, you have an implied probability distribution over outcomes.
        
        So Infra-Bayesianism (which I don’t really understand) has a way of dealing with this through risk aversion across sets of priors. But yeah, I do have some implied probability distribution through the class of bets I’m willing to take. This plausibly manifests as not really being willing to take bets around the ⁵⁰⁄₅₀ mark and the further away you go, the more willing I am to take bets. You can perhaps model me as being risk averse and having uncertainty over my own credences. But I haven’t really given this much thought, and I’m not entirely sure I endorse this.
        
        > You “should” do that. I agree, but that doesn’t mean counting arguments provide no evidence. If I have a mole that appears diffuse and growing, I should get it checked out to be sure it is/isn’t cancer. But that doesn’t mean a diffuse mole that’s growing isn’t more scary than one that isn’t growing and isn’t diffuse.
        
        yep I agree with this, I think counting arguments should give you some evidence (which is maybe where I diverge from the Belrose and Pope view) but the further removed they are from the reality of how these systems are trained, the less I think they should update you.
        
        > No offense, I already addressed this.
        
        Please let me know if I haven’t responded properly to this, I wasn’t sure what exactly you were referring to.