All else equal, I think minimizing model entropy is desirable (i.e. the number of weights). In other words, you want to keep the size of the model class small.

Roughly, alignment could be viewed as constructing a list of constraints or criteria that a model must satisfy in order to be considered safe. As the size of the model class grows, more models will satisfy any particular constraint. The complexity of the constraints likely needs to grow along with the complexity of the model class.

If a large number of models satisfy all the constraints, there is a large amount of behavior that is unconstrained and unaccounted for. We’ve decided that we don’t care about any of the behavioral differences between the models that satisfy all the constraints.

This isn’t *necessarily* true. Modern DL models are semi-organically grown rather than engineered, so the set of SGD discoverable models is much smaller than the set of all possible models. And techniques like iterative amplification further shrink the set of learnable models. Or maybe many of the models are behaviorally identical on the subset of inputs we care about.

That said, thinking about model entropy seems helpful.

I’ve not really seen it written up, but it’s conceptually similar to the classic ML ideas of overfitting, over-parameterization, under-specification, and generalization. If you imagine your alignment constraints as a kind of training data for the model then those ideas fall into place nicely.

After some searching, the most relevant thing I’ve found is Section 9 (page 44) of Interpretable machine learning: Fundamental principles and 10 grand challenges. Larger model classes often have bigger Rashomon sets and different models in the same Rashomon set can behave very differently.