Human value may be complex and fragile, but LLMs are good at understanding complex and fragile things, given enough training data. In some ways alignment has turned out to be a lot easier then we feared a decade ago. In hindsight, it now seems rather obvious that anything smart enough to be dangerous would need to be capable enough to understand things that are complex and fragile. And who would have dared suggest a decade ago that just inputting the sentence “You are a smart, helpful assistant.” into your AI would, most of the time, give us a significant chunk of the behavior we need?
Human value may be complex and fragile, but LLMs are good at understanding complex and fragile things, given enough training data. In some ways alignment has turned out to be a lot easier then we feared a decade ago. In hindsight, it now seems rather obvious that anything smart enough to be dangerous would need to be capable enough to understand things that are complex and fragile. And who would have dared suggest a decade ago that just inputting the sentence “You are a smart, helpful assistant.” into your AI would, most of the time, give us a significant chunk of the behavior we need?