I think the key issue here is what you take as an “outcome” over which utility functions are defined. If you take states to be outcomes, then trying to model sequential decisions is inherently a mess. If you take trajectories to be outcomes, then this problem goes away—but then for any behaviour you can very easily construct totally arbitrary utility functions which that behaviour maximises. At this point I really don’t know what people who talk about coherence arguments on LW are actually defending. But broadly speaking, I expect that everything would be much clearer if phrased in terms of reward rather than utility functions, because reward functions are inherently defined over sequential decisions.
I don’t think utility functions being a poor abstraction for agency in the real world has much bearing on whether there is AI risk. It might change the shape and tenor of the problem, but highly capable agents with alien seed preferences are still likely to be catastrophic to human civilization and human values.
If argument X plays an important role in convincing you of conclusion Y, and also the proponents of Y claim that X is important to their views, then it’s surprising to hear that X has little bearing on Y. Was X redundant all along? Also, you currently state this in binary terms (whether there is AI risk); maybe it’d be clearer to state how you expect your credences to change (or not) based on updates about utility functions.
I think the key issue here is what you take as an “outcome” over which utility functions are defined. If you take states to be outcomes, then trying to model sequential decisions is inherently a mess. If you take trajectories to be outcomes, then this problem goes away
Right, it seems pretty important that utility not be defined over states like that. Besides, relativity tells us that a simple “state” abstraction isn’t quite right.
But broadly speaking, I expect that everything would be much clearer if phrased in terms of reward rather than utility functions, because reward functions are inherently defined over sequential decisions.
I don’t like reward functions, since that implies observability (at least usually it’s taken that way).
I think a reasonable alternative would be to assume that utility is a weighted sum of local value (which is supposed to be similar to reward).
Example 1: reward functions. Utility is a weighted sum over a reward which can be computed for each time-step. You can imagine sliding a little window over the time-series, and deciding how good each step looks. Reward functions are single-step windows, but we could also use larger windows to evaluate properties over several time-steps (although this is not usually important).
Example 2: (average/total) utilitarianism. Utility is a weighted sum over (happiness/etc) of all people. You can imagine sliding a person-sized window over all of space-time, and judging how “good” each view is; in this case, we set the value to 0 (or some other default value) unless there is a person in our window, in which case we evaluate how happy they are (or how much they are thriving, or their preference satisfaction, or what-have-you).
At this point I really don’t know what people who talk about coherence arguments on LW are actually defending.
One thought I had: it’s true that utility functions had better be a function of all time, not just a frozen state. It’s true that this means we can justify any behavior this way. The utility-theory hypothesis therefore doesn’t constrain our predictions about behavior. We could well be better off just reasoning about agent policies rather than utility functions.
However, there seems to be a second thing we use utility theory for, namely, representing our own preferences. My complaint about your proposed alternative, “reward”, was that it was not expressive enough to represent preferences I can think of, and which seem coherent (EG, utilitarianism).
So it might be that we’re defending the ability to represent preferences we think we might have.
Although utility theory doesn’t strictly rule out any policy, a simplicity assumption over agent beliefs and utility functions yields a very different distribution over actions than a simplicity assumption over policies.
It seems to me that there are cases which are better-represented by utility theory. For example, predicting what humans do in unusual situations, but where they have time to think, I expect “simple goals and simple world-models” is going to generalize better than “simple policies”. I suspect this precisely because humans have settled on describing behaviors in terms of goals and beliefs, in addition to habits/norms (which are about policy). If habits/norms did good enough a job of constraining expectations on their own, we probably would not do that.
This also relates to the AI-safety-debate-relevant question, of how to model highly capable systems. If your objection to “utility theory” as an answer is “it doesn’t constrain my expectations”, then I can reply “use a simplicity prior”. The empirical claim made by utility theory here is: highly capable agents will tend to have behavior explainable via simple utility functions. As opposed to merely having simple policies.
OK, but then, what is the argument for this claim? Certainly not the usual coherence arguments?
Well, I’m not sure. Maybe we can modify the coherence arguments to have simplicity assumptions run through them as well. Maybe not.
What I feel more confident about is that the simplicity assumption embodies the content of the debate (or at least an important part of the content).
relativity tells us that a simple “state” abstraction isn’t quite right
Hmm, this sentence feels to me like a type error. It doesn’t seem like the way we reason about agents should depend on the fundamental laws of physics. If agents think in terms of states, then our model of agent goals should involve states regardless of whether that maps onto physics. (Another way of saying this is that agents are at a much higher level of abstraction than relativity.)
I don’t like reward functions, since that implies observability (at least usually it’s taken that way).
Hmm, you mean that reward is taken as observable? Yeah, this does seem like a significant drawback of talking about rewards. But if we assume that rewards are unobservable, I don’t see why reward functions aren’t expressive enough to encode utilitarianism—just let the reward at each timestep be net happiness at that timestep. Then we can describe utilitarians as trying to maximise reward.
I expect “simple goals and simple world-models” is going to generalize better than “simple policies”.
I think we’re talking about different debates here. I agree with the statement above—but the follow-up debate which I’m interested in is the comparison is “utility theory” versus “a naive conception of goals and beliefs” (in philosophical parlance, the folk theory), and so this actually seems like a point in favour of the latter. What does utility theory add to the folk theory of agency? Here’s one example: utility theory says that deontological goals are very complicated. To me, it seems like folk theory wins this one, because lots of people have pretty deontological goals. Or another example: utility theory says that there’s a single type of entity to which we assign value. Folk theory doesn’t have a type system for goals, and again that seems more accurate to me (we have meta-goals, etc).
To be clear, I do think that there are a bunch of things which the folk theory misses (mostly to do with probabilistic reasoning) and which utility theory highlights. But on the fundamental question of the content of goals (e.g. will they be more like “actually obey humans” or “tile the universe with tiny humans saying ‘good job’”) I’m not sure how much utility theory adds.
Hmm, this sentence feels to me like a type error. It doesn’t seem like the way we reason about agents should depend on the fundamental laws of physics. If agents think in terms of states, then our model of agent goals should involve states regardless of whether that maps onto physics. (Another way of saying this is that agents are at a much higher level of abstraction than relativity.)
True, but states aren’t at a much higher level of abstraction than relativity… states are a way to organize a world-model, and a world-model is a way of understanding the world.
From a normative perspective, relativity suggests that there’s ultimately going to be something wrong with designing agents to think in states; states make specific assumptions about time which turn out to be restrictive.
From a descriptive perspective, relativity suggests that agents won’t convergently think in states, because doing so doesn’t reflect the world perfectly.
The way we think about agents shouldn’t depend on how we think about physics, but it accidentally did, in that we accidentally baked linear time into some agent designs. So the reason relativity is able to say something about agent design, here, is because it points out that some agent designs are needlessly restrictive, and rational agents can take more general forms (and probably should).
This is not an argument against an agent carrying internal state, just an argument against using POMDPs to model everything.
Also, it’s pedantic; if you give me an agent model in the POMDP framework, there are probably more interesting things to talk about than whether it should be in the POMDP framework. But I would complain if POMDPs were a central assumption needed to prove a significant claim about rational agents, or something like that. (To give an extreme example, if someone used POMDP-agents to argue against the rationality of assenting to relativity.)
Hmm, you mean that reward is taken as observable? Yeah, this does seem like a significant drawback of talking about rewards. But if we assume that rewards are unobservable, I don’t see why reward functions aren’t expressive enough to encode utilitarianism—just let the reward at each timestep be net happiness at that timestep. Then we can describe utilitarians as trying to maximise reward.
I would complain significantly less about this, yeah. However, the relativity objection stands.
I think we’re talking about different debates here. I agree with the statement above—but the follow-up debate which I’m interested in is the comparison is “utility theory” versus “a naive conception of goals and beliefs” (in philosophical parlance, the folk theory), and so this actually seems like a point in favour of the latter. What does utility theory add to the folk theory of agency?
To state the obvious, it adds formality. For formal treatments, there isn’t much of a competition between naive goals and utility theory: utility theory wins by default, because naive goal theory doesn’t show up to the debate.
If I thought “goals” were a better way of thinking than “utility functions”, I would probably be working on formalizing goal theory. In reality, though, I think utility theory is essentially what you get when you try to do this.
Here’s one example: utility theory says that deontological goals are very complicated. To me, it seems like folk theory wins this one, because lots of people have pretty deontological goals.
So, my theory is not that it is always better to describe realistic agents as pursuing (simple) goals. Rather, I think it is often better to describe realistic agents as following simple policies. It’s just that simple utility functions are often enough a good explanation, that I want to also think in those terms.
Deontological ethics tags actions as good and bad, so, it’s essentially about policy. So, the descriptive utility follows from the usefulness of the policy view. [The normative utility is less obvious, but, there are several reasons why this can be normatively useful; eg, it’s easier to copy than consequentialist ethics, it’s easier to trust deontological agents (they’re more predictable), etc.]
To state it a little more thoroughly:
A good first approximation is the prior where agents have simple policies. (This is basically treating agents as regular objects, and investigating the behavior of those objects.)
Many cases where that does not work well are handled much better by assuming simple utility functions and simple beliefs. So, it is useful to sloppily combine the two.
An even better combination of the two conceives of an agent as a model-based learner who is optimizing a policy. This combines policy simplicity with utility simplicity in a sophisticated way. Of course, even better models are also possible.
Or another example: utility theory says that there’s a single type of entity to which we assign value. Folk theory doesn’t have a type system for goals, and again that seems more accurate to me (we have meta-goals, etc).
I’m not sure what you mean, but I suspect I just agree with this point. Utility functions are bad because they require an input type such as “worlds”. Utility theory, on the other hand, can still be saved, by considering expectation functions (which can measure the expectation of arbitrary propositions, linear combinations of propositions, etc). This allows us to talk about meta-goals as expectations-of-goals (“I don’t think I should want pizza”).
To be clear, I do think that there are a bunch of things which the folk theory misses (mostly to do with probabilistic reasoning) and which utility theory highlights. But on the fundamental question of the content of goals (e.g. will they be more like “actually obey humans” or “tile the universe with tiny humans saying ‘good job’”) I’m not sure how much utility theory adds.
Again, it would seem to add formality, which seems pretty useful.
Here are two ways to relate to formality. Approach 1: this formal system is much less useful for thinking about the phenomenon than our intuitive understanding, but we should keep developing it anyway because eventually it may overtake our intuitive understanding.
Approach 2: by formalising our intuitive understanding, we have already improved it. When we make arguments about the phenomenon, using concepts from the formalism is better than using our intuitive concepts.
I have no problem with the approach 1; most formalisms start off bad, and get better over time. But it seems like a lot of people around here are taking the latter approach, and believe that the formalism of utility theory should be the primary lens by which we think about the goals of AGIs.
I’m not sure if you defend the latter. If you do, then it’s not sufficient to say that utility theory adds formalism, you also need to explain why that formalism is net positive for our understanding. When you’re talking about complex systems, there are plenty of ways that formalisms can harm our understanding. E.g. I’d say behaviourism in psychology was more formal and also less correct than intuitive psychology. So even though it made a bunch of contributions to our understanding of RL, which have been very useful, at the time people should have thought of it using approach 1 not approach 2. I think of utility theory in a similar way to how I think of behaviourism: it’s a useful supplementary lens to see things through, but (currently) highly misleading as a main lens to see things like AI risk arguments through.
If I thought “goals” were a better way of thinking than “utility functions”, I would probably be working on formalizing goal theory.
See my point above. You can believe that “goals” are a better way of thinking than “utility functions” while still believing that working on utility functions is more valuable. (Indeed, “utility functions” seem to be what “formalising goal theory” looks like!)
Utility theory, on the other hand, can still be saved
Oh, cool. I haven’t thought enough about the Jeffrey-Bolker approach enough to engage with it here, but I’ll tentatively withdraw this objection in the context of utility theory.
From a descriptive perspective, relativity suggests that agents won’t convergently think in states, because doing so doesn’t reflect the world perfectly.
I still strongly disagree (with what I think you’re saying). There are lots of different problems which agents will need to think about. Some of these problems (which involve relativity) are more physically fundamental. But that doesn’t mean that the types of thinking which help solve them need to be more mentally fundamental to our agents. Our thinking doesn’t reflect relativity very well (especially on the intuitive level which shapes our goals the most), but we manage to reason about it alright at a high level. Instead, our thinking is shaped most to be useful for the types of problems we tend to encounter at human scales; and we should expect our agents to also converge to thinking in whatever way is most useful for the majority of problems which they face, which likely won’t involve relativity much.
(I think this argument also informs our disagreement about the normative claim, but that seems like a trickier one to dig into, so I’ll skip it for now.)
If agents think in terms of states, then our model of agent goals should involve states regardless of whether that maps onto physics.
Realistic agents don’t have the option of thinking in terms of detailed world states anyway, so the relativistic objection is the least of their worries.
I think the key issue here is what you take as an “outcome” over which utility functions are defined. If you take states to be outcomes, then trying to model sequential decisions is inherently a mess. If you take trajectories to be outcomes, then this problem goes away—but then for any behaviour you can very easily construct totally arbitrary utility functions which that behaviour maximises. At this point I really don’t know what people who talk about coherence arguments on LW are actually defending. But broadly speaking, I expect that everything would be much clearer if phrased in terms of reward rather than utility functions, because reward functions are inherently defined over sequential decisions.
If argument X plays an important role in convincing you of conclusion Y, and also the proponents of Y claim that X is important to their views, then it’s surprising to hear that X has little bearing on Y. Was X redundant all along? Also, you currently state this in binary terms (whether there is AI risk); maybe it’d be clearer to state how you expect your credences to change (or not) based on updates about utility functions.
Right, it seems pretty important that utility not be defined over states like that. Besides, relativity tells us that a simple “state” abstraction isn’t quite right.
I don’t like reward functions, since that implies observability (at least usually it’s taken that way).
I think a reasonable alternative would be to assume that utility is a weighted sum of local value (which is supposed to be similar to reward).
Example 1: reward functions. Utility is a weighted sum over a reward which can be computed for each time-step. You can imagine sliding a little window over the time-series, and deciding how good each step looks. Reward functions are single-step windows, but we could also use larger windows to evaluate properties over several time-steps (although this is not usually important).
Example 2: (average/total) utilitarianism. Utility is a weighted sum over (happiness/etc) of all people. You can imagine sliding a person-sized window over all of space-time, and judging how “good” each view is; in this case, we set the value to 0 (or some other default value) unless there is a person in our window, in which case we evaluate how happy they are (or how much they are thriving, or their preference satisfaction, or what-have-you).
One thought I had: it’s true that utility functions had better be a function of all time, not just a frozen state. It’s true that this means we can justify any behavior this way. The utility-theory hypothesis therefore doesn’t constrain our predictions about behavior. We could well be better off just reasoning about agent policies rather than utility functions.
However, there seems to be a second thing we use utility theory for, namely, representing our own preferences. My complaint about your proposed alternative, “reward”, was that it was not expressive enough to represent preferences I can think of, and which seem coherent (EG, utilitarianism).
So it might be that we’re defending the ability to represent preferences we think we might have.
(Of course, I think even utility functions are too restrictive.)
Another thought I had:
Although utility theory doesn’t strictly rule out any policy, a simplicity assumption over agent beliefs and utility functions yields a very different distribution over actions than a simplicity assumption over policies.
It seems to me that there are cases which are better-represented by utility theory. For example, predicting what humans do in unusual situations, but where they have time to think, I expect “simple goals and simple world-models” is going to generalize better than “simple policies”. I suspect this precisely because humans have settled on describing behaviors in terms of goals and beliefs, in addition to habits/norms (which are about policy). If habits/norms did good enough a job of constraining expectations on their own, we probably would not do that.
This also relates to the AI-safety-debate-relevant question, of how to model highly capable systems. If your objection to “utility theory” as an answer is “it doesn’t constrain my expectations”, then I can reply “use a simplicity prior”. The empirical claim made by utility theory here is: highly capable agents will tend to have behavior explainable via simple utility functions. As opposed to merely having simple policies.
OK, but then, what is the argument for this claim? Certainly not the usual coherence arguments?
Well, I’m not sure. Maybe we can modify the coherence arguments to have simplicity assumptions run through them as well. Maybe not.
What I feel more confident about is that the simplicity assumption embodies the content of the debate (or at least an important part of the content).
Hmm, this sentence feels to me like a type error. It doesn’t seem like the way we reason about agents should depend on the fundamental laws of physics. If agents think in terms of states, then our model of agent goals should involve states regardless of whether that maps onto physics. (Another way of saying this is that agents are at a much higher level of abstraction than relativity.)
Hmm, you mean that reward is taken as observable? Yeah, this does seem like a significant drawback of talking about rewards. But if we assume that rewards are unobservable, I don’t see why reward functions aren’t expressive enough to encode utilitarianism—just let the reward at each timestep be net happiness at that timestep. Then we can describe utilitarians as trying to maximise reward.
I think we’re talking about different debates here. I agree with the statement above—but the follow-up debate which I’m interested in is the comparison is “utility theory” versus “a naive conception of goals and beliefs” (in philosophical parlance, the folk theory), and so this actually seems like a point in favour of the latter. What does utility theory add to the folk theory of agency? Here’s one example: utility theory says that deontological goals are very complicated. To me, it seems like folk theory wins this one, because lots of people have pretty deontological goals. Or another example: utility theory says that there’s a single type of entity to which we assign value. Folk theory doesn’t have a type system for goals, and again that seems more accurate to me (we have meta-goals, etc).
To be clear, I do think that there are a bunch of things which the folk theory misses (mostly to do with probabilistic reasoning) and which utility theory highlights. But on the fundamental question of the content of goals (e.g. will they be more like “actually obey humans” or “tile the universe with tiny humans saying ‘good job’”) I’m not sure how much utility theory adds.
True, but states aren’t at a much higher level of abstraction than relativity… states are a way to organize a world-model, and a world-model is a way of understanding the world.
From a normative perspective, relativity suggests that there’s ultimately going to be something wrong with designing agents to think in states; states make specific assumptions about time which turn out to be restrictive.
From a descriptive perspective, relativity suggests that agents won’t convergently think in states, because doing so doesn’t reflect the world perfectly.
The way we think about agents shouldn’t depend on how we think about physics, but it accidentally did, in that we accidentally baked linear time into some agent designs. So the reason relativity is able to say something about agent design, here, is because it points out that some agent designs are needlessly restrictive, and rational agents can take more general forms (and probably should).
This is not an argument against an agent carrying internal state, just an argument against using POMDPs to model everything.
Also, it’s pedantic; if you give me an agent model in the POMDP framework, there are probably more interesting things to talk about than whether it should be in the POMDP framework. But I would complain if POMDPs were a central assumption needed to prove a significant claim about rational agents, or something like that. (To give an extreme example, if someone used POMDP-agents to argue against the rationality of assenting to relativity.)
I would complain significantly less about this, yeah. However, the relativity objection stands.
To state the obvious, it adds formality. For formal treatments, there isn’t much of a competition between naive goals and utility theory: utility theory wins by default, because naive goal theory doesn’t show up to the debate.
If I thought “goals” were a better way of thinking than “utility functions”, I would probably be working on formalizing goal theory. In reality, though, I think utility theory is essentially what you get when you try to do this.
So, my theory is not that it is always better to describe realistic agents as pursuing (simple) goals. Rather, I think it is often better to describe realistic agents as following simple policies. It’s just that simple utility functions are often enough a good explanation, that I want to also think in those terms.
Deontological ethics tags actions as good and bad, so, it’s essentially about policy. So, the descriptive utility follows from the usefulness of the policy view. [The normative utility is less obvious, but, there are several reasons why this can be normatively useful; eg, it’s easier to copy than consequentialist ethics, it’s easier to trust deontological agents (they’re more predictable), etc.]
To state it a little more thoroughly:
A good first approximation is the prior where agents have simple policies. (This is basically treating agents as regular objects, and investigating the behavior of those objects.)
Many cases where that does not work well are handled much better by assuming simple utility functions and simple beliefs. So, it is useful to sloppily combine the two.
An even better combination of the two conceives of an agent as a model-based learner who is optimizing a policy. This combines policy simplicity with utility simplicity in a sophisticated way. Of course, even better models are also possible.
I’m not sure what you mean, but I suspect I just agree with this point. Utility functions are bad because they require an input type such as “worlds”. Utility theory, on the other hand, can still be saved, by considering expectation functions (which can measure the expectation of arbitrary propositions, linear combinations of propositions, etc). This allows us to talk about meta-goals as expectations-of-goals (“I don’t think I should want pizza”).
Again, it would seem to add formality, which seems pretty useful.
Here are two ways to relate to formality. Approach 1: this formal system is much less useful for thinking about the phenomenon than our intuitive understanding, but we should keep developing it anyway because eventually it may overtake our intuitive understanding.
Approach 2: by formalising our intuitive understanding, we have already improved it. When we make arguments about the phenomenon, using concepts from the formalism is better than using our intuitive concepts.
I have no problem with the approach 1; most formalisms start off bad, and get better over time. But it seems like a lot of people around here are taking the latter approach, and believe that the formalism of utility theory should be the primary lens by which we think about the goals of AGIs.
I’m not sure if you defend the latter. If you do, then it’s not sufficient to say that utility theory adds formalism, you also need to explain why that formalism is net positive for our understanding. When you’re talking about complex systems, there are plenty of ways that formalisms can harm our understanding. E.g. I’d say behaviourism in psychology was more formal and also less correct than intuitive psychology. So even though it made a bunch of contributions to our understanding of RL, which have been very useful, at the time people should have thought of it using approach 1 not approach 2. I think of utility theory in a similar way to how I think of behaviourism: it’s a useful supplementary lens to see things through, but (currently) highly misleading as a main lens to see things like AI risk arguments through.
See my point above. You can believe that “goals” are a better way of thinking than “utility functions” while still believing that working on utility functions is more valuable. (Indeed, “utility functions” seem to be what “formalising goal theory” looks like!)
Oh, cool. I haven’t thought enough about the Jeffrey-Bolker approach enough to engage with it here, but I’ll tentatively withdraw this objection in the context of utility theory.
I still strongly disagree (with what I think you’re saying). There are lots of different problems which agents will need to think about. Some of these problems (which involve relativity) are more physically fundamental. But that doesn’t mean that the types of thinking which help solve them need to be more mentally fundamental to our agents. Our thinking doesn’t reflect relativity very well (especially on the intuitive level which shapes our goals the most), but we manage to reason about it alright at a high level. Instead, our thinking is shaped most to be useful for the types of problems we tend to encounter at human scales; and we should expect our agents to also converge to thinking in whatever way is most useful for the majority of problems which they face, which likely won’t involve relativity much.
(I think this argument also informs our disagreement about the normative claim, but that seems like a trickier one to dig into, so I’ll skip it for now.)
Realistic agents don’t have the option of thinking in terms of detailed world states anyway, so the relativistic objection is the least of their worries.