Bias towards simple functions; application to alignment?

Summary

Deep neural networks (DNNs) are generally used with large numbers of parameters relative to the number of given data-points, so that the solutions they output are far from uniquely determined. How do DNNs ‘choose’ what solution to output? Some fairly recent papers ([1], [2]) seem to suggest that DNNs are biased towards outputting solutions which are ‘simple’, in the sense of Kolmogorov complexity (minimum message length complexity) ([3]). It seems to me that this might be relevant for questions around AI alignment. I’ve not thought this idea through in detail; rather, I’m writing this post to ask whether people are aware of work in this direction (or see obvious reasons why it is a terrible idea) before I devote substantial time to it.

Example

Setting: some years in the future. I instruct Siri: “tell me a story”. Here are two things Siri could do:

  1. copy-paste an existing story from somewhere on the internet.

  2. kidnap a great author and force them to write a story (by now Siri is smart enough to do this);

I would much prefer Siri to choose option 1. There are lots of ways I might go about improving my request, or improving Siri, to make option 1 more likely. But I claim that, assuming Siri is still based on the same basic architecture as we are currently familiar with (and probably even if not), it is more likely to choose option 1 because it is simpler. Slightly more formally, because explicit instructions on how to carry out option 1 can be written out in many fewer characters than those required for option 2.

Now making this claim precise is hard, and if one looks too closely it tends to break down. For example, the instruction string ‘tell me a story’ could lead to option 1 or option 2 without changing its length. More generally, measures of complexity tend to be ‘up to a constant’ with the constant depending on the programming language, and if the language is ‘natural language that Siri parses’ then I am exactly back where I started. But I suspect this is not an unsolvable problem; one could for example think about the length of message that Siri has to write to instruct its sub-agents/​kidnapping drones.

Slightly more formal statement

I’ll try to translate from the general setup of ([2:1]) to the example above with Siri, to make things more concrete, but I remain close to their notation. All errors lie in the translation, etc. etc. We have the set of finite sequences of (english) words and the set actions Siri can take. The trained model Siri is a function

responding to my requests by taking some action.

How was Siri created? We started with a space of parameters for some (large) p, and a ‘parameter-function’ map

which is given by some kind of DNN. We also had a bunch of training data

consisting of a pair of an input and a ‘good’ resulting action (to simplify, let’s assume each input occurs at most once in ).

Now, there will probably exist very many values of the parameters such that $M(\theta)(x) = (y) $ for all ; let’s call such a parameter fitted to . And, assuming that

does not appear as the first coordinate of a point in , we can cook up values , both fitted to and such that

and

My key prediction is that, while and are both equally valid solutions to our optimisation problem, the output of the DNN strongly favours because of its simplicity.

I’m not going to try to justify this here; I need to think about it more first (and read ([3:1]) more carefully). But I do think it is at least plausible.

References


Lei Wu, Zhanxing Zhu, et al. Towards understanding generalization of deep learning: Perspective of loss landscapes. arXiv:1706.10239 ↩︎

Valle-Pérez, Camargo, Louis, arXiv:1805.08522v5 ↩︎ ↩︎

Kamaludin Dingle, Chico Q Camargo, and Ard A Louis. Input–output maps are strongly biased towards simple outputs. Nature communications, 9(1):761, 2018. ↩︎ ↩︎