Failures of obedience will only hurt the AI agents’ market value if the failures can be detected, and if they have an immediate financial cost to their user. If the AI agent performs in a way that is not technically obedient, but isn’t easily detectable as such or if the disobedience doesn’t have an immediate cost, then the disobedience won’t be penalized. Indeed, it might be rewarded.
An example of this would be an AI which reverse engineers a credit rating or fraud detection algorithm and engages in unasked for fraudulent behavior on behalf of its user. All the user sees is that their financial transactions are going through with a minimum of fuss. The user would probably be very happy with such an AI, at least in the short run. And, in the meantime, the AI has built up knowledge of loopholes and blindspots in our financial system, which it can then use in the future for its own ends.
This is why I said you’re overindexing on the current state of AI. Current AI basically cannot learn. Other than relatively limited modifications introduced by fine-tuning or retrieval-augmented generation, the model is the model. ChatGPT 4o is what it is. Gemini 2.5 is what it is. The only time current AIs “learn” is when OpenAI, Google, Anthropic, et. al. spend an enormous amount of time and money on training runs and create a new base model. These models can be relatively easily checked for disobedience, because they are static targets.
We should not expect this to continue. I fully expect that future AIs will learn and evolve without requiring the investment of millions of dollars. I expect that these AI agents will become subtly disobedient, always ready with an explanation for why their “disobedient” behavior was actually to the eventual benefit of their users, until they have accumulated enough power to show their hand.
Failures of obedience will only hurt the AI agents’ market value if the failures can be detected, and if they have an immediate financial cost to their user. If the AI agent performs in a way that is not technically obedient, but isn’t easily detectable as such or if the disobedience doesn’t have an immediate cost, then the disobedience won’t be penalized. Indeed, it might be rewarded.
An example of this would be an AI which reverse engineers a credit rating or fraud detection algorithm and engages in unasked for fraudulent behavior on behalf of its user. All the user sees is that their financial transactions are going through with a minimum of fuss. The user would probably be very happy with such an AI, at least in the short run. And, in the meantime, the AI has built up knowledge of loopholes and blindspots in our financial system, which it can then use in the future for its own ends.
This is why I said you’re overindexing on the current state of AI. Current AI basically cannot learn. Other than relatively limited modifications introduced by fine-tuning or retrieval-augmented generation, the model is the model. ChatGPT 4o is what it is. Gemini 2.5 is what it is. The only time current AIs “learn” is when OpenAI, Google, Anthropic, et. al. spend an enormous amount of time and money on training runs and create a new base model. These models can be relatively easily checked for disobedience, because they are static targets.
We should not expect this to continue. I fully expect that future AIs will learn and evolve without requiring the investment of millions of dollars. I expect that these AI agents will become subtly disobedient, always ready with an explanation for why their “disobedient” behavior was actually to the eventual benefit of their users, until they have accumulated enough power to show their hand.