I’m confused about your bit on deception within Tool AIs. I generally think of Tool AIs not as consequentialists, and therefore there is no “long-term utility” to maximize via short-term deception. What’s the mechanism by which you worry about these tools being deceptive to their users?
I’m thinking of the entire human+tool system as a consequentialist, and I’m basically arguing that that system fails in the same ways as “human in the loop oversight” fails
I’m confused about your bit on deception within Tool AIs. I generally think of Tool AIs not as consequentialists, and therefore there is no “long-term utility” to maximize via short-term deception. What’s the mechanism by which you worry about these tools being deceptive to their users?
I’m thinking of the entire human+tool system as a consequentialist, and I’m basically arguing that that system fails in the same ways as “human in the loop oversight” fails