I’m actually interested in your responses here. This is useful for my strategies how I frame things and understanding different people’s intuitions.
Do you think we can’t make autonomous agents that pursue goals well enough to get things done? Do you really think they’ll lose focus between being goal-focused long enough for useful work, and long enough for taking over the world if they interpret their goals differently than we intended? Do you think there’s no way RL or natural language could be misinterpreted?
I’m thinking it’s easy to keep an LLM agent goal-focused; if RL doesn’t do it, we’d just have a bit of scaffolding that every so often injects a prompt “remember, keep working on [goal]!”
The inference-compute scaling results seem to indicate that chain of thought RL already has o1 and o3 staying task focused for millions of tokens.
If you’re superintelligent/competent, it doesn’t take 100% focus to take over the world, just occasionally coming back to the project and not completely changing your mind.
Ghengis Khan probably got distracted a lot but he did alright at murdering, and he was only human.
Humans are optimizing AI and then AGI to get things done. If they can do that, we should ask what they’re going to want to do.
Deep learning typically generalizes correctly within the training set. Once something is superintelligent and unstoppable, we’re going to be way outside of the training set.
Humans change their goals all the time, when they reach new conclusions about how the world works and how that changes their interpretations of their previous goals.
I am curious about your intuitions but I’ve got to focus on work so that’s got to be my last object-level contribution. Thanks for conversing.
I also think it should be easy-ish to keep deep learning-based systems goal-focused, though mostly because I imagine that at some point, we’ll have agents which are actively undergoing more RL while they’re still in deployment. This means you can replicate the way humans learn to stay focused on tasks they’re passionate about by just being positively reinforced for doing it all the time. My contention is just that, to the extent that the RL is misunderstood, it probably won’t lead to a massive catastrophe. It’s hard to think about this in the absence of concrete scenarios, but… I think to get a catastrophe, you need the system to be RL’d in ways that reliably teach it behaviors that steer a given situation towards a catastrophic outcome? I don’t think you like, reliably reinforce the model for being nice to humans, but it misunderstands “being nice to humans” in such a way that causes it to end up steering the future towards some weird undesirable outcome; Claude does well enough at this kind of thing in practice.
I think a real catastrophe has to look something like… you pretrain a model to give it an understanding of the world, then you RL it to be really good at killing people so you can use it as a military weapon, but you don’t also RL it to be nice to people on your own side, and then it goes rogue and starts killing people on your own side. I guess that’s a kind of “misunderstanding your creators’ intentions”, but like… I expect those kinds of errors to follow from like, fairly tractable oversights in terms of teaching a model the right caveats to intended but dangerous behavior. I don’t think e.g. RLing Claude to give good advice to humans when asked could plausibly lead to it acquiring catastrophic values.
edit: actually, maybe a good reference point for this is when humans misunderstand their own reward functions? i.e. “i thought i would enjoy this but i didn’t”? i wonder if you could mitigate problems in this area just by telling an llm the principles used for its constitution. i need to think about this more...
I’m actually interested in your responses here. This is useful for my strategies how I frame things and understanding different people’s intuitions.
Do you think we can’t make autonomous agents that pursue goals well enough to get things done? Do you really think they’ll lose focus between being goal-focused long enough for useful work, and long enough for taking over the world if they interpret their goals differently than we intended? Do you think there’s no way RL or natural language could be misinterpreted?
I’m thinking it’s easy to keep an LLM agent goal-focused; if RL doesn’t do it, we’d just have a bit of scaffolding that every so often injects a prompt “remember, keep working on [goal]!”
The inference-compute scaling results seem to indicate that chain of thought RL already has o1 and o3 staying task focused for millions of tokens.
If you’re superintelligent/competent, it doesn’t take 100% focus to take over the world, just occasionally coming back to the project and not completely changing your mind.
Ghengis Khan probably got distracted a lot but he did alright at murdering, and he was only human.
Humans are optimizing AI and then AGI to get things done. If they can do that, we should ask what they’re going to want to do.
Deep learning typically generalizes correctly within the training set. Once something is superintelligent and unstoppable, we’re going to be way outside of the training set.
Humans change their goals all the time, when they reach new conclusions about how the world works and how that changes their interpretations of their previous goals.
I am curious about your intuitions but I’ve got to focus on work so that’s got to be my last object-level contribution. Thanks for conversing.
I also think it should be easy-ish to keep deep learning-based systems goal-focused, though mostly because I imagine that at some point, we’ll have agents which are actively undergoing more RL while they’re still in deployment. This means you can replicate the way humans learn to stay focused on tasks they’re passionate about by just being positively reinforced for doing it all the time. My contention is just that, to the extent that the RL is misunderstood, it probably won’t lead to a massive catastrophe. It’s hard to think about this in the absence of concrete scenarios, but… I think to get a catastrophe, you need the system to be RL’d in ways that reliably teach it behaviors that steer a given situation towards a catastrophic outcome? I don’t think you like, reliably reinforce the model for being nice to humans, but it misunderstands “being nice to humans” in such a way that causes it to end up steering the future towards some weird undesirable outcome; Claude does well enough at this kind of thing in practice.
I think a real catastrophe has to look something like… you pretrain a model to give it an understanding of the world, then you RL it to be really good at killing people so you can use it as a military weapon, but you don’t also RL it to be nice to people on your own side, and then it goes rogue and starts killing people on your own side. I guess that’s a kind of “misunderstanding your creators’ intentions”, but like… I expect those kinds of errors to follow from like, fairly tractable oversights in terms of teaching a model the right caveats to intended but dangerous behavior. I don’t think e.g. RLing Claude to give good advice to humans when asked could plausibly lead to it acquiring catastrophic values.
edit: actually, maybe a good reference point for this is when humans misunderstand their own reward functions? i.e. “i thought i would enjoy this but i didn’t”? i wonder if you could mitigate problems in this area just by telling an llm the principles used for its constitution. i need to think about this more...