I also think it should be easy-ish to keep deep learning-based systems goal-focused, though mostly because I imagine that at some point, we’ll have agents which are actively undergoing more RL while they’re still in deployment. This means you can replicate the way humans learn to stay focused on tasks they’re passionate about by just being positively reinforced for doing it all the time. My contention is just that, to the extent that the RL is misunderstood, it probably won’t lead to a massive catastrophe. It’s hard to think about this in the absence of concrete scenarios, but… I think to get a catastrophe, you need the system to be RL’d in ways that reliably teach it behaviors that steer a given situation towards a catastrophic outcome? I don’t think you like, reliably reinforce the model for being nice to humans, but it misunderstands “being nice to humans” in such a way that causes it to end up steering the future towards some weird undesirable outcome; Claude does well enough at this kind of thing in practice.
I think a real catastrophe has to look something like… you pretrain a model to give it an understanding of the world, then you RL it to be really good at killing people so you can use it as a military weapon, but you don’t also RL it to be nice to people on your own side, and then it goes rogue and starts killing people on your own side. I guess that’s a kind of “misunderstanding your creators’ intentions”, but like… I expect those kinds of errors to follow from like, fairly tractable oversights in terms of teaching a model the right caveats to intended but dangerous behavior. I don’t think e.g. RLing Claude to give good advice to humans when asked could plausibly lead to it acquiring catastrophic values.
edit: actually, maybe a good reference point for this is when humans misunderstand their own reward functions? i.e. “i thought i would enjoy this but i didn’t”? i wonder if you could mitigate problems in this area just by telling an llm the principles used for its constitution. i need to think about this more...
I also think it should be easy-ish to keep deep learning-based systems goal-focused, though mostly because I imagine that at some point, we’ll have agents which are actively undergoing more RL while they’re still in deployment. This means you can replicate the way humans learn to stay focused on tasks they’re passionate about by just being positively reinforced for doing it all the time. My contention is just that, to the extent that the RL is misunderstood, it probably won’t lead to a massive catastrophe. It’s hard to think about this in the absence of concrete scenarios, but… I think to get a catastrophe, you need the system to be RL’d in ways that reliably teach it behaviors that steer a given situation towards a catastrophic outcome? I don’t think you like, reliably reinforce the model for being nice to humans, but it misunderstands “being nice to humans” in such a way that causes it to end up steering the future towards some weird undesirable outcome; Claude does well enough at this kind of thing in practice.
I think a real catastrophe has to look something like… you pretrain a model to give it an understanding of the world, then you RL it to be really good at killing people so you can use it as a military weapon, but you don’t also RL it to be nice to people on your own side, and then it goes rogue and starts killing people on your own side. I guess that’s a kind of “misunderstanding your creators’ intentions”, but like… I expect those kinds of errors to follow from like, fairly tractable oversights in terms of teaching a model the right caveats to intended but dangerous behavior. I don’t think e.g. RLing Claude to give good advice to humans when asked could plausibly lead to it acquiring catastrophic values.
edit: actually, maybe a good reference point for this is when humans misunderstand their own reward functions? i.e. “i thought i would enjoy this but i didn’t”? i wonder if you could mitigate problems in this area just by telling an llm the principles used for its constitution. i need to think about this more...