You’re right, current AIs don’t have utility functions in the strict formal sense. I was using the term loosely to refer to the optimisation objectives we train them on, like loss functions or reward signals. My point is that the current objectives do not reliably reflect human moral values. Even if today’s systems aren’t agents in Yudkowsky’s sense, the concern still applies → as systems gain more general capabilities, optimisation toward poorly aligned goals can have harmful consequences.
My point is that the current objectives do not reliably reflect human moral values
Assuming there is such a coherent entity. And assuming that it is still a problem when the AI is not an agent.
The historic AI doom arguments have a problem: that they assume a bunch of things which aren’t necessarily true. And many renditions of them for.public consumption have a further problem: that they gesture towards these assumptions as though they are widely accepted when they are not. The general public will reject an argument using the term “utility function” because they don’t know what it is; and those knowledgeable about AI will reject it because they do. ..in their eyes , you are saying something false. But you need to come up with arguments that are valid before you worry about the PR issue.
I’d say there’s a meaningful distinction between literalism and what I’m advocating. I’m not arguing for rigid formalism or abandoning all metaphor. I’m calling for clarity, accessibility, and a prioritisation of core arguments, especially when communicating with people outside the field.
Your first critique concerns my statement that “Current utility functions do not optimise toward values in which humans are treated as morally valuable.” I agree this could have been phrased more precisely, for example: “Current AI systems are not necessarily trained to pursue goals that treat humans as morally valuable.” That’s a fair point. I was using “utility function” loosely to refer to optimisation objectives (e.g., loss functions, reward signals) not in the strict agent-theoretic sense.
But the purpose of my post isn’t to adjudicate whether the conclusions drawn by Yudkowsky and others are true or not. I fully acknowledge that the arguments rest on assumptions and that there’s room for serious debate about their validity. I should (and probably will) think more about that (after my exams).
What I am addressing in this post is a communication issue. Even if we accept the core arguments about the risks of developing powerful misaligned AI systems, such as those based on instrumental convergence and the orthogonality thesis, I believe these risks are often communicated in ways that obscure rather than clarify. This is particularly true when metaphors become the primary framing, which can confuse people who are encountering these ideas for the first time.
So to clarify: I’m not trying to resolve the epistemic status of AI risk claims. I’m making a narrower point about how they’re presented, and how this presentation may hinder public understanding or uptake. That’s the focus of the post.
You’re right, current AIs don’t have utility functions in the strict formal sense. I was using the term loosely to refer to the optimisation objectives we train them on, like loss functions or reward signals. My point is that the current objectives do not reliably reflect human moral values. Even if today’s systems aren’t agents in Yudkowsky’s sense, the concern still applies → as systems gain more general capabilities, optimisation toward poorly aligned goals can have harmful consequences.
Yet calling for literalism!
Assuming there is such a coherent entity. And assuming that it is still a problem when the AI is not an agent.
The historic AI doom arguments have a problem: that they assume a bunch of things which aren’t necessarily true. And many renditions of them for.public consumption have a further problem: that they gesture towards these assumptions as though they are widely accepted when they are not. The general public will reject an argument using the term “utility function” because they don’t know what it is; and those knowledgeable about AI will reject it because they do. ..in their eyes , you are saying something false. But you need to come up with arguments that are valid before you worry about the PR issue.
I’d say there’s a meaningful distinction between literalism and what I’m advocating. I’m not arguing for rigid formalism or abandoning all metaphor. I’m calling for clarity, accessibility, and a prioritisation of core arguments, especially when communicating with people outside the field.
Your first critique concerns my statement that “Current utility functions do not optimise toward values in which humans are treated as morally valuable.” I agree this could have been phrased more precisely, for example: “Current AI systems are not necessarily trained to pursue goals that treat humans as morally valuable.” That’s a fair point. I was using “utility function” loosely to refer to optimisation objectives (e.g., loss functions, reward signals) not in the strict agent-theoretic sense.
But the purpose of my post isn’t to adjudicate whether the conclusions drawn by Yudkowsky and others are true or not. I fully acknowledge that the arguments rest on assumptions and that there’s room for serious debate about their validity. I should (and probably will) think more about that (after my exams).
What I am addressing in this post is a communication issue. Even if we accept the core arguments about the risks of developing powerful misaligned AI systems, such as those based on instrumental convergence and the orthogonality thesis, I believe these risks are often communicated in ways that obscure rather than clarify. This is particularly true when metaphors become the primary framing, which can confuse people who are encountering these ideas for the first time.
So to clarify: I’m not trying to resolve the epistemic status of AI risk claims. I’m making a narrower point about how they’re presented, and how this presentation may hinder public understanding or uptake. That’s the focus of the post.