I also believe that we’re not necessarily that far from having a usable definition of goal-directedness (based on the thing I presented at AISU, changed according to your feedback), but I know that not everyone agree.
Even with a significantly improved definition of goal-directedness, I think we’d be pretty far from taking arbitrary code/NNs and evaluating their goals. Definitions resembling yours require an environment to be given; but this will always be an imperfect environment-model. Inner optimizers could then exploit differences between that environment-model and the true environment to appear benign.
But I’m happy to include your approach in the final document!
Even if I’m wrong about being able to formalize goal-directedness, I’m pretty convinced that the cluster of intuitions around goal-directed is what should be applied to a definition of mesa-optimization, because I expect inner alignment problems beyond inner optimizers.
Can you elaborate on this?
To be clear on whether I understand you, do you mean that even with very limited error, a mesa-optimizer/goal-directed agent could bid its time and use a single action well-placed to make a catastrophic treacherous turn?
Right. Low total error for, eg, imitation learning, might be associated with catastrophic outcomes. This is partly due to the way imitation learning is readily measured in terms of predictive accuracy, when what we really care about is expected utility (although we can’t specify our utility function, which is one reason we may want to lean on imitation, of course).
But even if we measure quality-of-model in terms of expected utility, we can still have a problem, since we’re bound to measure average expected utility wrt to some distribution, so utility could still be catastrophic wrt the real world.
That being said, wouldn’t the mesa-optimizer still have to pay the price to maintain at all time two goals, and to keep track of what things means related to both? Or are you arguing that this mutual information means that the mesa-optimizer will already be modeling many goal-directed systems, and so can just reuse that knowledge/information?
Right. If you have a proposal whereby you think (malign) mesa-optimizers have to pay a cost in some form of complexity, I’d be happy to hear it, but “systems performing complex tasks in complex environments have to pay that cost anyway” seems like a big problem for arguments of this kind. The question becomes where they put the complexity.
I don’t understand what you’re saying. Your first sentence seems to point out that humans are constant-time, not linear time. An algorithm for a fixed sized is constant time, after all.
I meant time as a function of data (I’m not sure how else to quantify complexity here). Humans have a basically constant reaction time, but our reactions depend on memory, which depends on our entire history. So to simulate my response after X data, you’d need O(X).
A memoryless alg could be constant time; IE, even though you have and X-long history, you just need to feed it the most recent thing, so its response time is not a function of X. Similarly with finite context windows.
The issue here is that we don’t have a scaled version of the algorithms humans are solving (analogous to generalized games). So we can’t discuss the asymptotic complexity of human-brain algorithms.
I agree than in principle we could decode the brain’s algorithms and say “actually, that’s quadratic time” or whatever; EG, quadratic-in-size-of-working-memory or something. This would tell us something about what it would mean to scale up human intelligence. But I don’t think this detracts from the concern about algorithms which are linear-time (and even constant-time) as a function of data. The concern is essentially that there’s nothing stopping such algorithms from being faithful-enough human models, which demonstrates that they could be mesa-optimizers.
But maybe you actually have an argument related to that which I missed?
I think the crux here is what we’re measuring runtime as-a-function-of. LMK if you still think something else is going on.
About penalizing time complexity, I really like this part of RLO (which is sadly missed by basically everyone as it’s not redescribed in the intro or conclusion):
I actually struggled with where to place this in the text. I wanted to discuss the double-edged-sword thing, but, I didn’t find a place where it felt super appropriate to discuss it.
One point about the time/description complexity penalty that I feel you don’t point enough is that even if there was a threshold under which mesa-optimization doesn’t appear, maybe it’s just too low to be competitive. That’s my main internal reason to doubt complexity penalties as a solution to the emergence of mesa-optimizers.
Right. I just didn’t discuss this due to wanting to get this out as a quick sketch of where I’m going.
Even with a significantly improved definition of goal-directedness, I think we’d be pretty far from taking arbitrary code/NNs and evaluating their goals. Definitions resembling yours require an environment to be given; but this will always be an imperfect environment-model. Inner optimizers could then exploit differences between that environment-model and the true environment to appear benign.
Oh, definitely. I think a better definition of goal-directedness is a prerequisite to be able to do that, so it’s only the first step. That being said, I think I’m more optimistic than you on the result, for a couple of reasons:
One way I imagine the use of a definition of goal-directedness is to filter against very goal-directed systems. A good definition (if it’s possible) should clarify whether low goal-directed systems can be competitive, as well as the consequences of different parts and aspects of goal-directedness. You can see that as a sort of analogy to the complexity penalties, although it might risk being similarly uncompetitive.
One hope with a definition we can actually toy with is to find some properties of the environments and the behavior of the systems that 1) capture a lot of the information we care about and 2) are easy to abstract. Something like what Alex has done for his POWER-seeking results, where the relevant aspect of the environment are the symmetries it contains.
Even arguing for your point, that evaluating goals and/or goal-directedness of actual NNs would be really hard, is made easier by a deconfused notion of goal-directedness.
Can you elaborate on this?
What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those.
That’s one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set of worrying models in that context, with a hopefully clean definition.
But even if we measure quality-of-model in terms of expected utility, we can still have a problem, since we’re bound to measure average expected utility wrt to some distribution, so utility could still be catastrophic wrt the real world.
Maybe irrelevant, but this makes me think of the problem with defining average complexity in complexity theory. You can prove things for some distributions over instances of the problem, but it’s really difficult to find a distribution that capture the instances you will meet in the real world. This means that you tend to be limited to worst case reasoning.
One cool way to address that is through smoothed complexity: the complexity for an instance x is the expected complexity over the distribution on instances created by adding some Gaussian noise to x. I wonder if we can get some guarantees like that, which might improve over worst-case reasoning.
Right. If you have a proposal whereby you think (malign) mesa-optimizers have to pay a cost in some form of complexity, I’d be happy to hear it, but “systems performing complex tasks in complex environments have to pay that cost anyway” seems like a big problem for arguments of this kind. The question becomes where they put the complexity.
Agreed. I don’t have such a story, but I think this is a good reframing of the crux underlying this line of argument.
I meant time as a function of data (I’m not sure how else to quantify complexity here). Humans have a basically constant reaction time, but our reactions depend on memory, which depends on our entire history. So to simulate my response after X data, you’d need O(X).
For whatever reason, I thought about complexity depending on the size of the brain, which is really weird. But as complexity depending on the size of the data, I guess this makes more sense? I’m not sure why piling on more data wouldn’t make the reliance on memory more difficult (so something like O(X^2) ?), but I don’t think it’s that important.
I agree than in principle we could decode the brain’s algorithms and say “actually, that’s quadratic time” or whatever; EG, quadratic-in-size-of-working-memory or something. This would tell us something about what it would mean to scale up human intelligence. But I don’t think this detracts from the concern about algorithms which are linear-time (and even constant-time) as a function of data. The concern is essentially that there’s nothing stopping such algorithms from being faithful-enough human models, which demonstrates that they could be mesa-optimizers.
Agreed that this a pretty strong argument that complexity doesn’t preclude mesa-optimizers.
I actually struggled with where to place this in the text. I wanted to discuss the double-edged-sword thing, but, I didn’t find a place where it felt super appropriate to discuss it.
Maybe in “Why this doesn’t seem to work” for pure computational complexity?
What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those.
That’s one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set of worrying models in that context, with a hopefully clean definition.
Ah, on this point, I very much agree.
I’m not sure why piling on more data wouldn’t make the reliance on memory more difficult (so something like O(X^2) ?), but I don’t think it’s that important.
I was treating the brain as fixed in size, so, having some upper bound on memory. Naturally this isn’t quite true in practice (for all we know, healthy million-year-olds might have measurably larger heads if they existed, due to slow brain growth, but either way this seems like a technicality).
Even with a significantly improved definition of goal-directedness, I think we’d be pretty far from taking arbitrary code/NNs and evaluating their goals. Definitions resembling yours require an environment to be given; but this will always be an imperfect environment-model. Inner optimizers could then exploit differences between that environment-model and the true environment to appear benign.
But I’m happy to include your approach in the final document!
Can you elaborate on this?
Right. Low total error for, eg, imitation learning, might be associated with catastrophic outcomes. This is partly due to the way imitation learning is readily measured in terms of predictive accuracy, when what we really care about is expected utility (although we can’t specify our utility function, which is one reason we may want to lean on imitation, of course).
But even if we measure quality-of-model in terms of expected utility, we can still have a problem, since we’re bound to measure average expected utility wrt to some distribution, so utility could still be catastrophic wrt the real world.
Right. If you have a proposal whereby you think (malign) mesa-optimizers have to pay a cost in some form of complexity, I’d be happy to hear it, but “systems performing complex tasks in complex environments have to pay that cost anyway” seems like a big problem for arguments of this kind. The question becomes where they put the complexity.
I meant time as a function of data (I’m not sure how else to quantify complexity here). Humans have a basically constant reaction time, but our reactions depend on memory, which depends on our entire history. So to simulate my response after X data, you’d need O(X).
A memoryless alg could be constant time; IE, even though you have and X-long history, you just need to feed it the most recent thing, so its response time is not a function of X. Similarly with finite context windows.
I agree than in principle we could decode the brain’s algorithms and say “actually, that’s quadratic time” or whatever; EG, quadratic-in-size-of-working-memory or something. This would tell us something about what it would mean to scale up human intelligence. But I don’t think this detracts from the concern about algorithms which are linear-time (and even constant-time) as a function of data. The concern is essentially that there’s nothing stopping such algorithms from being faithful-enough human models, which demonstrates that they could be mesa-optimizers.
I think the crux here is what we’re measuring runtime as-a-function-of. LMK if you still think something else is going on.
I actually struggled with where to place this in the text. I wanted to discuss the double-edged-sword thing, but, I didn’t find a place where it felt super appropriate to discuss it.
Right. I just didn’t discuss this due to wanting to get this out as a quick sketch of where I’m going.
Oh, definitely. I think a better definition of goal-directedness is a prerequisite to be able to do that, so it’s only the first step. That being said, I think I’m more optimistic than you on the result, for a couple of reasons:
One way I imagine the use of a definition of goal-directedness is to filter against very goal-directed systems. A good definition (if it’s possible) should clarify whether low goal-directed systems can be competitive, as well as the consequences of different parts and aspects of goal-directedness. You can see that as a sort of analogy to the complexity penalties, although it might risk being similarly uncompetitive.
One hope with a definition we can actually toy with is to find some properties of the environments and the behavior of the systems that 1) capture a lot of the information we care about and 2) are easy to abstract. Something like what Alex has done for his POWER-seeking results, where the relevant aspect of the environment are the symmetries it contains.
Even arguing for your point, that evaluating goals and/or goal-directedness of actual NNs would be really hard, is made easier by a deconfused notion of goal-directedness.
What I mean is that when I think about inner alignment issues, I actually think of learned goal-directed models instead of learned inner optimizers. In that context, the former includes the latter. But I also expect that relatively powerful goal-directed systems can exist without a powerful simple structure like inner optimization, and that we should also worry about those.
That’s one way in which I expect deconfusing goal-directedness to help here: by replacing a weirdly-defined subset of the models we should worry about by what I expect to be the full set of worrying models in that context, with a hopefully clean definition.
Maybe irrelevant, but this makes me think of the problem with defining average complexity in complexity theory. You can prove things for some distributions over instances of the problem, but it’s really difficult to find a distribution that capture the instances you will meet in the real world. This means that you tend to be limited to worst case reasoning.
One cool way to address that is through smoothed complexity: the complexity for an instance x is the expected complexity over the distribution on instances created by adding some Gaussian noise to x. I wonder if we can get some guarantees like that, which might improve over worst-case reasoning.
Agreed. I don’t have such a story, but I think this is a good reframing of the crux underlying this line of argument.
For whatever reason, I thought about complexity depending on the size of the brain, which is really weird. But as complexity depending on the size of the data, I guess this makes more sense? I’m not sure why piling on more data wouldn’t make the reliance on memory more difficult (so something like O(X^2) ?), but I don’t think it’s that important.
Agreed that this a pretty strong argument that complexity doesn’t preclude mesa-optimizers.
Maybe in “Why this doesn’t seem to work” for pure computational complexity?
Ah, on this point, I very much agree.
I was treating the brain as fixed in size, so, having some upper bound on memory. Naturally this isn’t quite true in practice (for all we know, healthy million-year-olds might have measurably larger heads if they existed, due to slow brain growth, but either way this seems like a technicality).