In defense of probably wrong mechanistic models

This is a short post on a simple point that I get asked about a lot and want a canonical reference for.

Which of the following two options is more likely to be true?

  1. AIs will internally be running explicit search processes.

  2. AIs will internally be doing something weirder and more complicated than explicit search.

In my opinion, whenever you’re faced with a question about like this, it’s always weirder than you think, and you should pick option (2)—or the equivalent—every single time. The problem, though, is that while option (2) is substantially more likely to be correct, it’s not at all predictive—it’s effectively just the “not (1)” hypothesis, which gets a lot of probability mass because it covers a lot of the space, but precisely because it covers so much of the space is extremely difficult to operationalize to make any concrete predictions about what your AI will actually do.

The aphorism here is “All models are wrong, but some are useful.” Not having a model at all and just betting on the “something else” hypothesis is always going to be more likely than any specific model, but having specific models is nevertheless highly useful in a way that the “something else” hypothesis just isn’t.

Thus, in my opinion, I strongly believe that we should try our best to make lots of specific statements about internal structures even when we know those statements are likely to be wrong, because when we let ourselves make specific, structural, mechanistic models, we can get real, concrete predictions. And even if the model is literally false, to the extent that it has some plausible relationship to reality, the predictions that it makes can still be quite accurate.

Furthermore, one of my favorite strategies here is to come up with many different, independent mechanistic models and then see if they all converge: if you get the same prediction from lots of different mechanistic models, that adds a lot of credence to that prediction being quite robust. An example of this in the setting of modeling inductive biases is my “How likely is deceptive alignment?” post, where I take the two relatively independent—but both probably wrong—stories of high and low path-dependence and get the result that they both seem to imply a similar prediction about deceptive alignment, which I think lends a lot of credence to that prediction even if the specific models of inductive biases presented are unlikely to be literally correct.

Going back to the original question about explicit search, this is essentially how I like to think about the arguments in “Risks from Learned Optimization:” we argue that explicit search is a plausible model and explore what its predictions are. Though I think that the response “literally explicit search is unlikely” is potentially correct (though it depends on exactly how broad/​narrow your understanding of explicit search is), it’s not very constructive—my response is usually, “okay, so what’s a better mechanistic model then?” That’s not to say that I don’t think there are any better mechanistic models than explicit search for what a powerful AI might be doing—but it is to say that coming up with some alternative mechanistic model is a necessary step of trying to improve on existing mechanistic models.