In ultra-brief form: misalignment is most likely to come from none of the directions you’ve listed, but instead from an interaction. Improved cognitive capabilities will cause them to consider many more actions and from many more perspectives. Smarter LLMs will think about their actions and their values much more thoroughly, because they’ll be trained to think carefully. This much broader thinking opens up many possibilities for new alignment misgeneralizations. On the whole, I think it’s more likely than not that alignment will misgenaralize at that point.
Full logic in that post.
I think this is a vital question that hasn’t received nearly enough attention. There are plenty of reasons to worry that alignment isn’t currently good enough and only appears to be because LLMs have limited smarts and limited options. So I think placing high odds on alignment by default[1] is unrealistic.
On the other hand, those arguments are not strong enough to place the odds near zero, which many pessimists argue for too. They’re often based on intuitions about the sizes of “goal space” and the effectiveness of training in selecting regions of goal space. Attempts at detailed discussions seem to inevitably break down in mutual frustration.
So more careful discussions seem really important! My most recent and careful contribution to that discussion is in the linked post.
Here I assume you mean “alignment on the default path”, including everything that devs are likely to do to align future versions of LLMs, possibly up to highly agentic and superintelligent ones. That relies heavilly on targeted RL training.
I’ve also seen alignment by default referring to just the base trained model being human-aligned by default. I think this is wildly unrealistic, since humans are quite frequently not aligned with other humans—despite their many built-in and system-level reasons to be so. Hoping for alignment just by training on human language seems wildly optimistic or underinformed. Hopeing for alignment on the current default path seems optimistic, but not wildly so.
I think this perspective deserves to be taken seriously. It’s pretty much the commonsense and most common view.
My response is contained in my recent post LLM AGI may reason about its goals and discover misalignments by default.
In ultra-brief form: misalignment is most likely to come from none of the directions you’ve listed, but instead from an interaction. Improved cognitive capabilities will cause them to consider many more actions and from many more perspectives. Smarter LLMs will think about their actions and their values much more thoroughly, because they’ll be trained to think carefully. This much broader thinking opens up many possibilities for new alignment misgeneralizations. On the whole, I think it’s more likely than not that alignment will misgenaralize at that point.
Full logic in that post.
I think this is a vital question that hasn’t received nearly enough attention. There are plenty of reasons to worry that alignment isn’t currently good enough and only appears to be because LLMs have limited smarts and limited options. So I think placing high odds on alignment by default[1] is unrealistic.
On the other hand, those arguments are not strong enough to place the odds near zero, which many pessimists argue for too. They’re often based on intuitions about the sizes of “goal space” and the effectiveness of training in selecting regions of goal space. Attempts at detailed discussions seem to inevitably break down in mutual frustration.
So more careful discussions seem really important! My most recent and careful contribution to that discussion is in the linked post.
Here I assume you mean “alignment on the default path”, including everything that devs are likely to do to align future versions of LLMs, possibly up to highly agentic and superintelligent ones. That relies heavilly on targeted RL training.
I’ve also seen alignment by default referring to just the base trained model being human-aligned by default. I think this is wildly unrealistic, since humans are quite frequently not aligned with other humans—despite their many built-in and system-level reasons to be so. Hoping for alignment just by training on human language seems wildly optimistic or underinformed. Hopeing for alignment on the current default path seems optimistic, but not wildly so.