Fundraiser (volunteer) for the french branch of Pause AI. I also have a small blog (in french) about the AI alignment problem.
Romain Deléglise
Hello,
Thanks for your post.
I am happy that some people did this research, but I will pushback against some points.
First, inner alignment is not solved even with this results. Because we don’t know if the model really internalised the ethical values or some proxy that correlated within the environment. Maybe the new tool (Natural Language Auto encoders) by Anthropic can help distinguish but I don’t think the tool is precise and reliable enough.
Second, the generalization seems to be weaker than it appears. Basically the alignment generalized between ethicals scenarios and the model act (very) ethical in all scenarios. But the dangerous generalization is the between currents weaks model with limited possibilities of action and far more powerful models in differents environment with capacity of effectively choosing different options for their actions. If the models don’t have the possibility to take different actions you can’t see that is internalized from outside.
Third, better proxies are still proxies. I agree that we can likely obtain system with better proxies aligned utilities fonctions using carefully selected data for training. But I struggle to understand why we will obtain anything others than proxies with this type of training. (Like for evolution which has not directly written “reproduce and passe your genes” inside my head but proxies instead).
Reducing blackmail to zero for current environment seems nice but I feel that we must be careful on what exactly is solved and about the difficulty still ahead of us.
Make sense, thanks for the answer