Fundraiser (volunteer) for the french branch of Pause AI. I also have a small blog (in french) about the AI alignment problem.
Romain Deléglise
Make sense, thanks for the answer
Hello,
Thanks for your post.
I am happy that some people did this research, but I will pushback against some points.
First, inner alignment is not solved even with this results. Because we don’t know if the model really internalised the ethical values or some proxy that correlated within the environment. Maybe the new tool (Natural Language Auto encoders) by Anthropic can help distinguish but I don’t think the tool is precise and reliable enough.
Second, the generalization seems to be weaker than it appears. Basically the alignment generalized between ethicals scenarios and the model act (very) ethical in all scenarios. But the dangerous generalization is the between currents weaks model with limited possibilities of action and far more powerful models in differents environment with capacity of effectively choosing different options for their actions. If the models don’t have the possibility to take different actions you can’t see that is internalized from outside.
Third, better proxies are still proxies. I agree that we can likely obtain system with better proxies aligned utilities fonctions using carefully selected data for training. But I struggle to understand why we will obtain anything others than proxies with this type of training. (Like for evolution which has not directly written “reproduce and passe your genes” inside my head but proxies instead).
Reducing blackmail to zero for current environment seems nice but I feel that we must be careful on what exactly is solved and about the difficulty still ahead of us.
Maybe I don’t understand yours points.
For my point of view we don’t need a perfect reward/loss function to be in deep troubles with RSI, that’s basically the reason why I expect LLM to likely scale to AGI and then probably ASI.
Likely all we need is a function good enough to begin the circle of RSI. Some domains need humans to verify the quality of the AI answers but it’s seems the most important domains for RSI don’t need human verification.
This domains are :
Code (fully automated)
Maths (fully automated)
Physic and computer simulations (mostly automated cause you can reason without experiments using models, but probably not entirely because probably still needs some experiment for news stuffs?).
AI research itself (benchmark exist today and are a good approximation of the AI capabilities, I expect that the AI can create itself it’s own benchmark and then tests using trials and errors or a better algorithm). Problems like overfitting can maybe slow down a bit the progress here, but again we don’t need perfect reward function, a good enough function is largely enough. The AI needs to be able to do some tests and be capable of measuring the news capabilities basically.
These domains reinforce themselves in some ways (math and code helps a lot in physics and computer simulation, physics and computer simulation help in AI research for example).
Other things can help in the RSI phase but I expect it’s might be enough to boostrap itself.
We don’t need a ASI good in all domains, if it’s good in enough in the importants domains it’s probably over.
A second very important issue is that even is LLM don’t scale to AGI/ASI, they might be able to create an other system (different from LLM) which will scale. You recognize the problem as far as I understand but don’t seem to accept the conclusions (LLM scaling is terrible even if they don’t directly scale to AGI/ASI).
(As an apart somes people seems to have other theories for why LLM will scale but don’t want to speak about it publicly because it’s can be misused. That might be useful to keep in mind also even if we can’t judge there ideas).