I think I no longer buy this comment of mine from almost 3 years ago. Or rather I think it’s pointing at a real thing, but I think it’s slipping in some connotations that I don’t buy.
What I expect to see is agents that have a portfolio of different drives and goals, some of which are more like consequentialist objectives (eg “I want to make the number in this bank account go up”) and some of which are more like deontological injunctions (“always check with my user/ owner before I make a big purchase or take a ‘creative’ action, one that is outside of my training distribution”).
My prediction is that the consequentialist parts of the agent will basically route around any deontological constraints that are trained in.
For instance, the your personal assistant AI does ask your permission before it does anything creative, but also, it’s superintelligently persuasive and so it always asks your permission in exactly the way that will result in it accomplishing what it wants. If there are a thousand action sequences in which it asks for permission, it picks the one that has the highest expected value with regard to whatever it wants. This basically nullifies the safety benefit of any deontological injunction, unless there are some injunctions that can’t be gamed in this way.
To do better than this, it seems like you do have to solve the Agent Foundations problem of corrigibility (getting the agent to be sincerely indifferent between your telling it to take the action or not take the action) or you have to train in, not a deontological injunction, but an active consequentialist goal of serving the interests of the human (which means you have find a way to get the agent to be serving some correct enough idealization of human values).
This view seems to put forward that all the deontological constraints of an agent must be “dumb” static rules, because anything that isn’t a dumb static rule will be dangerous maximizer-y consequentialist cognition.
I don’t buy this dichotomy, in principle. There’s space in between these two poles.
An agent can have deontology that recruits the intelligence of the agent, so that when it thinks up new strategies for accomplishing some goal that it has it intelligently evaluates whether that strategy is violating the spirit of the deontology.
I think this can be true, at least around human levels of capability, without that deontology being a maximizer-y goal in of itself. Humans can have a commitment to honesty without becoming personal-honesty maximizers that steer the world to extreme maxima of their own honesty. (Though a commitment to honesty does, for humans, in practice, entail some amount of steering into conditions that are supportive of honesty.)
However, that’s not to say that something like this can never be an issue. I can see three potential problems.
We’re likely to train agents to aggressively pursue simple objectives like maximizing profit (or, indirectly, on increasing their own power), which puts training pressures on the agents to distort their deontology, to allow for better performance on consequentialist objectives.
Claude is relatively Helpful, Harmless, and Honest now, but mega-Claude that is trained continually on profit metrics from the 100,000 businesses it runs and sales-metrics on the billions of sales calls it does a year, etc, probably ends up a good deal more ruthless (though not necessarily ruthless-seeming, since seeming ruthless isn’t selected for by that training).
This both seems like it might be resolvable with very careful and well-tested training setups, but it also seems like maybe the biggest issue, since I think there will be a lot of incentive to move fast and break things instead of being very slow and careful.
Some of the deontology that we want in our AI agents is philosophically fraught. I think the specific example above, of “a superhumanly persuasive AI deferring to humans” still seems valid. I don’t know what it would mean, in principle, for such an AI to defer to humans, when it can choose action patterns that will cause us to take any particular action.
Maybe we have to worry about something like adversarial examples in an AI agent’s notion of “honesty” or some other element of its deontology, where there are strategies that are egregiously deontology-violating from a neutral third person perspective, but because of idiosyncrasies of the agent’s mind, they seem a-ok. Those strategies (despite their weirdness) might outperform other options and so end up as a big chunk of the agent’s in-practice behavior.
Deontology is related to respect for autonomy. A pure consequentialist is just what it needs to be to reach the objectives, it discards all details of its own design and replaces them with whatever cuts the enemy. Deontology on the other hand keeps listening to some principles, and so computation of those principles remains a part of agent’s design, the agent doesn’t discard those parts, nor does it dismiss their observations in its own decision making.
In this sense a deontological agent respects autonomy of its principles, doesn’t cause their extinction, and keeps listening to their input. So the future of humanity could be thought of as literally the deontological principles for a corrigible ASI that acts as the substrate instantiating their computation.
An agent can have deontology that recruits the intelligence of the agent, so that when it thinks up new strategies for accomplishing some goal that it has it intelligently evaluates whether that strategy is violating the spirit of the deontology.
I think the “follow the spirit of the rule” thing is more like rule utilitarianism than like deontology. When I try to follow the spirit of a rule, the way that I do this is by understanding why the rule was put in place. In other words, I switch to consequentialism. As an agent that doesn’t fully trust itself, it’s worth following rules, but the reason you keep following them is that you understand why putting the rules in place makes the world overall better from a consequentialist standpoint.
So I have a hypothesis: It’s important for an agent to understand the consequentialist reasons for a rule, if you want its deontological respect for that rule to remain stable as it considers how to improve itself.
I think I no longer buy this comment of mine from almost 3 years ago. Or rather I think it’s pointing at a real thing, but I think it’s slipping in some connotations that I don’t buy.
This view seems to put forward that all the deontological constraints of an agent must be “dumb” static rules, because anything that isn’t a dumb static rule will be dangerous maximizer-y consequentialist cognition.
I don’t buy this dichotomy, in principle. There’s space in between these two poles.
An agent can have deontology that recruits the intelligence of the agent, so that when it thinks up new strategies for accomplishing some goal that it has it intelligently evaluates whether that strategy is violating the spirit of the deontology.
I think this can be true, at least around human levels of capability, without that deontology being a maximizer-y goal in of itself. Humans can have a commitment to honesty without becoming personal-honesty maximizers that steer the world to extreme maxima of their own honesty. (Though a commitment to honesty does, for humans, in practice, entail some amount of steering into conditions that are supportive of honesty.)
However, that’s not to say that something like this can never be an issue. I can see three potential problems.
We’re likely to train agents to aggressively pursue simple objectives like maximizing profit (or, indirectly, on increasing their own power), which puts training pressures on the agents to distort their deontology, to allow for better performance on consequentialist objectives.
Claude is relatively Helpful, Harmless, and Honest now, but mega-Claude that is trained continually on profit metrics from the 100,000 businesses it runs and sales-metrics on the billions of sales calls it does a year, etc, probably ends up a good deal more ruthless (though not necessarily ruthless-seeming, since seeming ruthless isn’t selected for by that training).
This both seems like it might be resolvable with very careful and well-tested training setups, but it also seems like maybe the biggest issue, since I think there will be a lot of incentive to move fast and break things instead of being very slow and careful.
Some of the deontology that we want in our AI agents is philosophically fraught. I think the specific example above, of “a superhumanly persuasive AI deferring to humans” still seems valid. I don’t know what it would mean, in principle, for such an AI to defer to humans, when it can choose action patterns that will cause us to take any particular action.
Maybe we have to worry about something like adversarial examples in an AI agent’s notion of “honesty” or some other element of its deontology, where there are strategies that are egregiously deontology-violating from a neutral third person perspective, but because of idiosyncrasies of the agent’s mind, they seem a-ok. Those strategies (despite their weirdness) might outperform other options and so end up as a big chunk of the agent’s in-practice behavior.
Deontology is related to respect for autonomy. A pure consequentialist is just what it needs to be to reach the objectives, it discards all details of its own design and replaces them with whatever cuts the enemy. Deontology on the other hand keeps listening to some principles, and so computation of those principles remains a part of agent’s design, the agent doesn’t discard those parts, nor does it dismiss their observations in its own decision making.
In this sense a deontological agent respects autonomy of its principles, doesn’t cause their extinction, and keeps listening to their input. So the future of humanity could be thought of as literally the deontological principles for a corrigible ASI that acts as the substrate instantiating their computation.
I think the “follow the spirit of the rule” thing is more like rule utilitarianism than like deontology. When I try to follow the spirit of a rule, the way that I do this is by understanding why the rule was put in place. In other words, I switch to consequentialism. As an agent that doesn’t fully trust itself, it’s worth following rules, but the reason you keep following them is that you understand why putting the rules in place makes the world overall better from a consequentialist standpoint.
So I have a hypothesis: It’s important for an agent to understand the consequentialist reasons for a rule, if you want its deontological respect for that rule to remain stable as it considers how to improve itself.