habryka comments on Matthew Barnett’s Shortform

habryka 9 Mar 2024 19:39 UTC
4 points
0
You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said “no, it doesn’t actually make the prediction you claim it makes” and gave my reasons for believing that (that I am pretty sure are shared by many others as well).
We can talk about those reasons, but I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification. It’s not like you have a theory that is in any way more constraining here.
And it seems you expect this behavior to stop because of the capabilities of the models, rather than from deliberate efforts to mitigate deception in AIs.
I mean, I expect the observations to be affected by both, of course. That’s one of the key things that makes predictions in the space so messy.
- Matthew Barnett 9 Mar 2024 21:39 UTC
  2 points
  0
  Parent
  
  I am not super interested in being psychologized about whether I am structuring my theories intentionally to avoid falsification.
  
  For what it’s worth, I explicitly clarified that you were not consciously doing this, in my view. My main point is to notice that it seems really hard to pin down what you actually predict will happen in this situation.
  
  You made some pretty strong claims suggesting that my theory (or the theories of people in my reference class) was making strong predictions in the space. I corrected you and said “no, it doesn’t actually make the prediction you claim it makes” and gave my reasons for believing that
  
  I don’t think what you said really counts as a “correction” so much as a counter-argument. I think it’s reasonable to have disagreements about what a theory predicts. The more vague a theory is (and in this case it seems pretty vague), the less you can reasonably claim someone is objectively wrong about what the theory predicts, since there seems to be considerable room for ambiguity about the structure of the theory. As far as I can tell, none of the reasoning in this thread has been on a level of precision that warrants high confidence in what particular theories of scheming do or do not predict, in the absence of further specification.