In AI alignment, the entropic force pulls toward sampling random minds from the vast space of possible minds, while the energetic force (from training) pulls toward minds that behave as we want. The actual outcome depends on which force is stronger.
The MIRI view, I’m pretty sure, is that the force of training does not pull towards minds that behave as we want, unless we know a lot of things about training design we currently don’t.
MIRI is not talking about the randomness as in the spread of the training posterior as a function of random Bayesian sampling/NN initialization/SGD noise. The point isn’t that training is inherently random. It can be a completely deterministic process without affecting the MIRI argument basically at all. If everything were a Bayesian sample from the posterior and there was a single basin of minimum local learning coefficient corresponding to equivalent implementations of a single algorithm, then I don’t think this would by default make models any more likely to be aligned. The simplest fit to the training signal need not be an optimiser pointed at a terminal goal that maps to the training signal in a neat way humans can intuitively zero-shot without figuring out underlying laws. The issue isn’t that the terminal goals are somehow fundamentally random- that there is no clear one-to-one mapping from the training setup to the terminal goals. It’s that we early 21st century humans don’t know the mapping from the training setup to the terminal goals. Having the terminal goals be completely determined by the training criteria does not help us if we don’t know what training criteria map to terminal goals that we would like. It’s a random draw from a vast space from our[1] perspective because we don’t know what we’re doing yet.
Probability and randomness are in the mind, not the territory. MIRI is not alleging that neural network training is somehow bound to strongly couple to quantum noise.
I’m not following exactly what you are saying here so I might be collapsing some subtle point.
Let me preface that this is a shortform so half-baked by design so you might be completely right it’s confused.
Let me try and explain myself again.
I probably have confused readers by using the free energy terminology. What I mean is that in many cases (perhaps all) the probabilistic outcome of any process can be described in terms of a competition of between simplicity (entropy) and accuracy (energy) to some loss function.
Indeed, the simplest fit for a training signal might not be aligned. In some cases perhaps almost all fits for a training signal create an agent whose values are only a somewhat constrained by the training signal and otherwise randomly sampled conditional on doing well on the training signal. The “good” values might be only a small part of this subspace.
Perhaps you and Dmitry are saying the issue is not just an simplicity-accuracy / entropy-energy split but also a case that the training signal not perfectly “sampled from true goodly human values”. There would be another error coming from this incongruency?
The MIRI view, I’m pretty sure, is that the force of training does not pull towards minds that behave as we want, unless we know a lot of things about training design we currently don’t.
MIRI is not talking about the randomness as in the spread of the training posterior as a function of random Bayesian sampling/NN initialization/SGD noise. The point isn’t that training is inherently random. It can be a completely deterministic process without affecting the MIRI argument basically at all. If everything were a Bayesian sample from the posterior and there was a single basin of minimum local learning coefficient corresponding to equivalent implementations of a single algorithm, then I don’t think this would by default make models any more likely to be aligned. The simplest fit to the training signal need not be an optimiser pointed at a terminal goal that maps to the training signal in a neat way humans can intuitively zero-shot without figuring out underlying laws. The issue isn’t that the terminal goals are somehow fundamentally random- that there is no clear one-to-one mapping from the training setup to the terminal goals. It’s that we early 21st century humans don’t know the mapping from the training setup to the terminal goals. Having the terminal goals be completely determined by the training criteria does not help us if we don’t know what training criteria map to terminal goals that we would like. It’s a random draw from a vast space from our[1] perspective because we don’t know what we’re doing yet.
Probability and randomness are in the mind, not the territory. MIRI is not alleging that neural network training is somehow bound to strongly couple to quantum noise.
I’m not following exactly what you are saying here so I might be collapsing some subtle point. Let me preface that this is a shortform so half-baked by design so you might be completely right it’s confused.
Let me try and explain myself again.
I probably have confused readers by using the free energy terminology. What I mean is that in many cases (perhaps all) the probabilistic outcome of any process can be described in terms of a competition of between simplicity (entropy) and accuracy (energy) to some loss function.
Indeed, the simplest fit for a training signal might not be aligned. In some cases perhaps almost all fits for a training signal create an agent whose values are only a somewhat constrained by the training signal and otherwise randomly sampled conditional on doing well on the training signal. The “good” values might be only a small part of this subspace.
Perhaps you and Dmitry are saying the issue is not just an simplicity-accuracy / entropy-energy split but also a case that the training signal not perfectly “sampled from true goodly human values”. There would be another error coming from this incongruency?
Hope you can enlighten me.