I am happy that some people did this research, but I will pushback against some points.
First, inner alignment is not solved even with this results. Because we don’t know if the model really internalised the ethical values or some proxy that correlated within the environment. Maybe the new tool (Natural Language Auto encoders) by Anthropic can help distinguish but I don’t think the tool is precise and reliable enough.
Second, the generalization seems to be weaker than it appears. Basically the alignment generalized between ethicals scenarios and the model act (very) ethical in all scenarios. But the dangerous generalization is the between currents weaks model with limited possibilities of action and far more powerful models in differents environment with capacity of effectively choosing different options for their actions. If the models don’t have the possibility to take different actions you can’t see that is internalized from outside.
Third, better proxies are still proxies. I agree that we can likely obtain system with better proxies aligned utilities fonctions using carefully selected data for training. But I struggle to understand why we will obtain anything others than proxies with this type of training. (Like for evolution which has not directly written “reproduce and passe your genes” inside my head but proxies instead).
Reducing blackmail to zero for current environment seems nice but I feel that we must be careful on what exactly is solved and about the difficulty still ahead of us.
My “The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?” post title was intentionally provocative (thus the question mark): my actual opinion is that we’ve made a significant advance in Inner Alignment for LLMs, not that it’s a solved problem. Most of the academic research in this area suggests that alignment pretraining on a sufficiently large dataset can reduce misaligned behavior of the order of five-fold, which is quite significant, but doesn’t reach zero.
I was struck by Anthropic’s result that the training that generalized best was training in consistently answering difficult moral problems faced by users: I had not predicted that in advance. Perhaps answering people’s questions correctly is very important to Claude?
I’m not sure I agree that this is most usefully thought of as a proxy for the behavior we want. I think it’s more like a pointer: the enormous amount of pretraining data contains, among a great many other things, a huge amount of information about human values. The goal of alignment training is not to instill human values from scratch, but to point to them in the world model that pretraining created and say what to do with the information. Constitutional AI shows that even a very short, sentence-scale pointer can point to a surprising amount of data quite well. Anthropic are now working on much more voluminous and detailed pointers. This work is basically teaching the model that “Claude is a persona who is skilled at making moral judgements according to human values, and who acts morally”. However, just as a proxy can be imperfect, so can a pointer.
How well this will extend to ASI is obviously the vital question. The Orthogonality Thesis suggests that it’s possible to be very intelligent, and also both skilled at making moral judgements according to human values, and someone who acts morally. Whether this training is the best way to train that mentality is an open question — but there is quite a bit of evidence suggesting that skills learned quickly in posttraining are more fragile and generalize less well then ones instilled more slowly using a larger amount of data during pretraining or midtraining. It would be quite surprising if extensive Stochastic Gradient Descent training on making moral judgements well according to human values wasn’t a useful foundation for alignment training..
Hello,
Thanks for your post.
I am happy that some people did this research, but I will pushback against some points.
First, inner alignment is not solved even with this results. Because we don’t know if the model really internalised the ethical values or some proxy that correlated within the environment. Maybe the new tool (Natural Language Auto encoders) by Anthropic can help distinguish but I don’t think the tool is precise and reliable enough.
Second, the generalization seems to be weaker than it appears. Basically the alignment generalized between ethicals scenarios and the model act (very) ethical in all scenarios. But the dangerous generalization is the between currents weaks model with limited possibilities of action and far more powerful models in differents environment with capacity of effectively choosing different options for their actions. If the models don’t have the possibility to take different actions you can’t see that is internalized from outside.
Third, better proxies are still proxies. I agree that we can likely obtain system with better proxies aligned utilities fonctions using carefully selected data for training. But I struggle to understand why we will obtain anything others than proxies with this type of training. (Like for evolution which has not directly written “reproduce and passe your genes” inside my head but proxies instead).
Reducing blackmail to zero for current environment seems nice but I feel that we must be careful on what exactly is solved and about the difficulty still ahead of us.
My “The Best Way to Align an LLM: Is Inner Alignment Now a Solved Problem?” post title was intentionally provocative (thus the question mark): my actual opinion is that we’ve made a significant advance in Inner Alignment for LLMs, not that it’s a solved problem. Most of the academic research in this area suggests that alignment pretraining on a sufficiently large dataset can reduce misaligned behavior of the order of five-fold, which is quite significant, but doesn’t reach zero.
I was struck by Anthropic’s result that the training that generalized best was training in consistently answering difficult moral problems faced by users: I had not predicted that in advance. Perhaps answering people’s questions correctly is very important to Claude?
I’m not sure I agree that this is most usefully thought of as a proxy for the behavior we want. I think it’s more like a pointer: the enormous amount of pretraining data contains, among a great many other things, a huge amount of information about human values. The goal of alignment training is not to instill human values from scratch, but to point to them in the world model that pretraining created and say what to do with the information. Constitutional AI shows that even a very short, sentence-scale pointer can point to a surprising amount of data quite well. Anthropic are now working on much more voluminous and detailed pointers. This work is basically teaching the model that “Claude is a persona who is skilled at making moral judgements according to human values, and who acts morally”. However, just as a proxy can be imperfect, so can a pointer.
How well this will extend to ASI is obviously the vital question. The Orthogonality Thesis suggests that it’s possible to be very intelligent, and also both skilled at making moral judgements according to human values, and someone who acts morally. Whether this training is the best way to train that mentality is an open question — but there is quite a bit of evidence suggesting that skills learned quickly in posttraining are more fragile and generalize less well then ones instilled more slowly using a larger amount of data during pretraining or midtraining. It would be quite surprising if extensive Stochastic Gradient Descent training on making moral judgements well according to human values wasn’t a useful foundation for alignment training..
Make sense, thanks for the answer