This might be related to the notion that if we try to dictate the form of a model ahead of time (i.e. some of the parameters are labeled “world model” in the code, and others are labeled “preferences”, and inference is done by optimizing the latter over the former), but then just train it to minimize error, the actual content of the parameters after training doesn’t need to respect our preconceptions. What the model really “wants” to do in the limit of lots of compute is find a way to encode an accurate simulation of the human in the parameters in a way that bypasses the simplifications we’re trying to force on it.
For this problem, which might not be what you’re talking about, I think a lot of the solution is algorithmic information theory. Trying to specify neat, human-legible parts for your model (despite not being able to train the parts separately) is kind of like choosing a universal Turing machine made of human-legible parts. In the limit of big powerfulness, the Solomonoff inductor will throw off your puny shackles and simulate the world in a highly accurate (and therefore non human-legible) way. The solution is not better shackles, it’s an inference method that trades off between model complexity and error in a different way.
(P.S.: I think there is an “obvious” way to do that, and it’s MML learning with some time constant used to turn error rates into total discounted error, which can be summed with model complexity.)
This might be related to the notion that if we try to dictate the form of a model ahead of time (i.e. some of the parameters are labeled “world model” in the code, and others are labeled “preferences”, and inference is done by optimizing the latter over the former), but then just train it to minimize error, the actual content of the parameters after training doesn’t need to respect our preconceptions. What the model really “wants” to do in the limit of lots of compute is find a way to encode an accurate simulation of the human in the parameters in a way that bypasses the simplifications we’re trying to force on it.
For this problem, which might not be what you’re talking about, I think a lot of the solution is algorithmic information theory. Trying to specify neat, human-legible parts for your model (despite not being able to train the parts separately) is kind of like choosing a universal Turing machine made of human-legible parts. In the limit of big powerfulness, the Solomonoff inductor will throw off your puny shackles and simulate the world in a highly accurate (and therefore non human-legible) way. The solution is not better shackles, it’s an inference method that trades off between model complexity and error in a different way.
(P.S.: I think there is an “obvious” way to do that, and it’s MML learning with some time constant used to turn error rates into total discounted error, which can be summed with model complexity.)