I might be crazy, but is this good news? It seems like you’re finding that LLMs somehow learn a general concept of good and bad. One of this big alignment problems is even defining what good and bad are, but if they figure that out automatically that seems great.
If you were pessimistic about LLMs learning a general concept of good/bad, then yes, that should update you.
However, I think it still has the main core problems. If you are doing a simple continual learning loop (LLM → output → retrain to accumulate knowledge; analogous to ICL) then we can ask the question of how robust this process is. Do the values of how to behave drastically diverge. Such as, are there attractors over a hundred days of output that it is dragged towards that aren’t aligned at all? Can it be jail-broken wittingly or not by getting the model to produce garbage responses that it is then trained on?
And then arguments like ‘does this hold up under reflection’ or ’does it attach itself to the concept of good or chatgpt-influenced good (or evil). So while LLMs being capable of learning good is, well, good, there are still big targeting, resolution, and reflection issues.
For this post specifically, I believe it to be bad news. It provides evidence that subtle reward hacking scenarios encourage the model to act misaligned in a more general manner. It is likely quite nontrivial to get rid of reward-hacking like behavior in our larger and larger training runs.
So if the model gets into a period of time where reward-hacking is rewarded, a continual learning scenario is easiest to imagine but even in training, then it may drastically change its behavior.
I might be crazy, but is this good news? It seems like you’re finding that LLMs somehow learn a general concept of good and bad. One of this big alignment problems is even defining what good and bad are, but if they figure that out automatically that seems great.
EDIT: I just saw this similar post and nevermind, we’re so doomed.
I think you’re referring to their previous work? Or you might find it relevant if you didn’t run into it. https://www.lesswrong.com/posts/ifechgnJRtJdduFGC/emergent-misalignment-narrow-finetuning-can-produce-broadly
If you were pessimistic about LLMs learning a general concept of good/bad, then yes, that should update you. However, I think it still has the main core problems. If you are doing a simple continual learning loop (LLM → output → retrain to accumulate knowledge; analogous to ICL) then we can ask the question of how robust this process is. Do the values of how to behave drastically diverge. Such as, are there attractors over a hundred days of output that it is dragged towards that aren’t aligned at all? Can it be jail-broken wittingly or not by getting the model to produce garbage responses that it is then trained on? And then arguments like ‘does this hold up under reflection’ or ’does it attach itself to the concept of good or chatgpt-influenced good (or evil). So while LLMs being capable of learning good is, well, good, there are still big targeting, resolution, and reflection issues.
For this post specifically, I believe it to be bad news. It provides evidence that subtle reward hacking scenarios encourage the model to act misaligned in a more general manner. It is likely quite nontrivial to get rid of reward-hacking like behavior in our larger and larger training runs. So if the model gets into a period of time where reward-hacking is rewarded, a continual learning scenario is easiest to imagine but even in training, then it may drastically change its behavior.