Overall, the AI agents will be very obedient. They’ll have goals, in so far as accomplishing any medium term task entails steering towards a goal, but they won’t have persistent goals of their own. They’ll be obedient assistants and delegates that understand what humans want and broadly do what humans want.
I feel like this overindexes on the current state of AI. Right now, AI “agents” are barely worthy of the name. They require constant supervision and iterative feedback from their human controllers in order to perform useful tasks. However, it’s unlikely that will be the case for long. The valuation of many AI companies, such as OpenAI, and Anthropic is dependent on them developing agents that “actually work”. That is, agents that are capable of performing useful tasks on behalf of humans with a minimum amount of supervision and feedback. It is not guaranteed that these agents will be safe. They might seem safe, but how would anyone be able to tell? A superintelligence, by definition, will do things in novel ways, and we might not realize what the AIs are actually doing until it’s too late.
It’s important to not take the concept of a “paperclipper” too literally. Of course the AI won’t literally turn us into a pile of folded metal wire (famous last words). What it will do is optimize production processes across the entire economy, find novel sources of power, reform government regulation, connect businesses via increasingly standardized communications protocols, and of course, develop ever more powerful computer chips and ever more automated factories to produce them with. And just like the seal in the video above, we won’t fully realize what it’s doing or what its final plan is until it’s too late, and it doesn’t need us any more.
I feel like this overindexes on the current state of AI.
No?
I’m not saying future AI agents will be obedient because current AI agents are. I’m saying that they will be obedient because failures of obedience hurt their commercial value a lot and so market pressures will either solve the problem or try very hard and legibly fail to get much traction.
Failures of obedience will only hurt the AI agents’ market value if the failures can be detected, and if they have an immediate financial cost to their user. If the AI agent performs in a way that is not technically obedient, but isn’t easily detectable as such or if the disobedience doesn’t have an immediate cost, then the disobedience won’t be penalized. Indeed, it might be rewarded.
An example of this would be an AI which reverse engineers a credit rating or fraud detection algorithm and engages in unasked for fraudulent behavior on behalf of its user. All the user sees is that their financial transactions are going through with a minimum of fuss. The user would probably be very happy with such an AI, at least in the short run. And, in the meantime, the AI has built up knowledge of loopholes and blindspots in our financial system, which it can then use in the future for its own ends.
This is why I said you’re overindexing on the current state of AI. Current AI basically cannot learn. Other than relatively limited modifications introduced by fine-tuning or retrieval-augmented generation, the model is the model. ChatGPT 4o is what it is. Gemini 2.5 is what it is. The only time current AIs “learn” is when OpenAI, Google, Anthropic, et. al. spend an enormous amount of time and money on training runs and create a new base model. These models can be relatively easily checked for disobedience, because they are static targets.
We should not expect this to continue. I fully expect that future AIs will learn and evolve without requiring the investment of millions of dollars. I expect that these AI agents will become subtly disobedient, always ready with an explanation for why their “disobedient” behavior was actually to the eventual benefit of their users, until they have accumulated enough power to show their hand.
I feel like this overindexes on the current state of AI. Right now, AI “agents” are barely worthy of the name. They require constant supervision and iterative feedback from their human controllers in order to perform useful tasks. However, it’s unlikely that will be the case for long. The valuation of many AI companies, such as OpenAI, and Anthropic is dependent on them developing agents that “actually work”. That is, agents that are capable of performing useful tasks on behalf of humans with a minimum amount of supervision and feedback. It is not guaranteed that these agents will be safe. They might seem safe, but how would anyone be able to tell? A superintelligence, by definition, will do things in novel ways, and we might not realize what the AIs are actually doing until it’s too late.
It’s important to not take the concept of a “paperclipper” too literally. Of course the AI won’t literally turn us into a pile of folded metal wire (famous last words). What it will do is optimize production processes across the entire economy, find novel sources of power, reform government regulation, connect businesses via increasingly standardized communications protocols, and of course, develop ever more powerful computer chips and ever more automated factories to produce them with. And just like the seal in the video above, we won’t fully realize what it’s doing or what its final plan is until it’s too late, and it doesn’t need us any more.
No?
I’m not saying future AI agents will be obedient because current AI agents are. I’m saying that they will be obedient because failures of obedience hurt their commercial value a lot and so market pressures will either solve the problem or try very hard and legibly fail to get much traction.
Failures of obedience will only hurt the AI agents’ market value if the failures can be detected, and if they have an immediate financial cost to their user. If the AI agent performs in a way that is not technically obedient, but isn’t easily detectable as such or if the disobedience doesn’t have an immediate cost, then the disobedience won’t be penalized. Indeed, it might be rewarded.
An example of this would be an AI which reverse engineers a credit rating or fraud detection algorithm and engages in unasked for fraudulent behavior on behalf of its user. All the user sees is that their financial transactions are going through with a minimum of fuss. The user would probably be very happy with such an AI, at least in the short run. And, in the meantime, the AI has built up knowledge of loopholes and blindspots in our financial system, which it can then use in the future for its own ends.
This is why I said you’re overindexing on the current state of AI. Current AI basically cannot learn. Other than relatively limited modifications introduced by fine-tuning or retrieval-augmented generation, the model is the model. ChatGPT 4o is what it is. Gemini 2.5 is what it is. The only time current AIs “learn” is when OpenAI, Google, Anthropic, et. al. spend an enormous amount of time and money on training runs and create a new base model. These models can be relatively easily checked for disobedience, because they are static targets.
We should not expect this to continue. I fully expect that future AIs will learn and evolve without requiring the investment of millions of dollars. I expect that these AI agents will become subtly disobedient, always ready with an explanation for why their “disobedient” behavior was actually to the eventual benefit of their users, until they have accumulated enough power to show their hand.