You say that in general, recursion can allow myopic agents to do well on non-myopic objectives; this sure sounds like making a kind of assemblage in order to get non-myopicness.
The recursion there is only in the objective, not in the model itself. So there’s no assemblage anywhere other than in the thing that the model is trying to imitate.
HCH is a myopic objective?
Maybe it’ll be more clear to you if you just replace “imitate HCH” with “imitate Evan” or something like that—of course that’s less likely to result in a model that’s capable enough to do anything interesting, but it has the exact same sorts of problems in terms of getting myopia to work.
My best current guess is that you’re saying something like, if the agent is myopic, that means it’s only trained to try to solve the problem right in front of it; so it’s not trained to hide its reasoning in order to game the system across multiple episodes?
We’re just talking about step (1), so we’re not talking about training at all right now. We’re just trying to figure out what a natural class of agents would be that isn’t deceptive.
if it’s predicting a far-conquences-understander, it has to do far-consequences-understanding, therefore it’s able to do far-consequences-understanding
Agree. Any competitive myopic agent would have to be able to fully understand exactly how to do long-term non-myopic reasoning.
therefore it’s (1) liable to, by default, effectively have values it pursues over far-consequences
Agree by default but not by necessity. For step (1) we’re not trying to figure out what would happen by default if you trained a model on something, we’re just trying to understand what it might look like for an agent to be myopic in a natural way.
just replace “imitate HCH” with “imitate Evan” or something like that
So these are both training-myopic, meaning they both are being trained only to do the task right in front of them, and aren’t (directly) rewarded for behavior that sacrifices reward now for reward in future episodes. Neither seem objective-myopic, meaning both of their objective functions are computed (seemingly necessarily) using far-reaching-consequences-understanding. Neither seem behavior-myopic, meaning both of them would successfully target far-reaching-consequences (by assumption of being competitive?). I think if you’re either objective-non-myopic or behavior-non-myopic, then by default you’re thought-non-myopic (meaning you in fact use far-reaching-consequences-understanding in your reasoning). I think if you’re thought-non-myopic, then by default you’re values-non-myopic, meaning you’re pursuing specific far-reaching-consequences. I think if you’re values-non-myopic, then you’re almost certainly deceptive, by strong default.
We’re just talking about step (1), so we’re not talking about training at all right now. We’re just trying to figure out what a natural class of agents would be that isn’t deceptive.
For step (1) we’re not trying to figure out what would happen by default if you trained a model on something, we’re just trying to understand what it might look like for an agent to be myopic in a natural way.
In step (1) you wrote:
I think it is possible to produce a simple, natural description of myopia such that myopic agents are still capable of doing all the powerful things we might want out of an AGI but such that they never have any reason to be deceptive
I think if something happens by default, that’s a kind of naturalness. Maybe I just want to strengthen the claims above to say “by strong default”. In other words, I’m saying it’s a priori very unnatural to have something that’s behavior-non-myopic but thought-myopic, or thought-non-myopic but not deceptive, and overcoming that unnaturality is a huge hurdle. I would definitely be interested in your positive reasons for thinking this is possible.
I think if you’re values-non-myopic, then you’re almost certainly deceptive, by strong default.
I think it would help if you tried to walk through how a model with the goal of “imitating Evan” ends up acting deceptively. I claim that as long as you have a notion of myopic imitation that rules out failure modes like acausal trade (e.g. LCDT) and Evan will never act deceptively, then such a model will never act deceptively.
Your steps (2)-(4) seem to rely fairly heavily on the naturality of the class described in (1), e.g. because (2) has to recognize (1)s which requires that we can point to (1)s. If by “with the [[sole?]] goal of imitating Evan” you mean that
A. the model is actually really *only* trying to imitate Evan,
B. the model is competent to not accidentally also try to do something else (e.g. because the ways it pursues its goal are themselves malign under distributional shift), and
C. the training process you use will not tip the internal dynamics of the model over into a strategically malign state (there was never any incentive to prevent that from happening any more robustly than just barely enough to get good answers on the training set, and I think we agree that there’s a whole pile of [ability to understand and pursue far-reaching consequences] sitting in the model, making strategically malign states pretty close in model-space for natural metrics),
then yes this would plausibly not be deceptive, but it seems like a very unnatural class. I tried to argue that it’s unnatural in the long paragraph with the different kinds of myopia, where “by (strong) default” = “it would be unnatural to be otherwise”.
Note that (A) and (B) are not actually that hard—e.g. LCDT solves both problems.
Your (C), in my opinion, is where all the action is, and is in fact the hardest part of this whole story—which is what I was trying to say in the original post when I said that (2) was the hard part.
Okay, I think I’m getting a little more where you’re coming from? Not sure. Maybe I’ll read the LCDT thing soon (though I’m pretty skeptical of those claims).
(Not sure if it’s useful to say this, but as a meta note, from my perspective the words in the post aren’t pinned down enough to make it at all clear that the hard part is (2) rather than (1); you say “natural” in (1), and I don’t know what you mean by that such that (1) isn’t hard.)
Maybe I’m not emphasizing how unnatural I think (A) is. Like, it’s barely even logically consistent. I know that (A) is logically consistent, for some funny construal of “only trying”, because Evan is a perfect imitation of Evan; and more generally a good WBE could maybe be appropriately construed as not trying to do anything other than imitate Evan; and ideally an FAI could be given an instruction so that it doesn’t, say, have any appreciable impacts other than the impacts of an Evan-imitation. For anything that’s remotely natural and not “shaped” like Evan is “shaped”, I’m not sure it even makes sense to be only trying to imitate Evan; to imitate Evan you have to do a whole lot of stuff, including strategically arranging cognition, reason about far-reaching consequences in general, etc., which already constitutes trying to do something other than imitating Evan. When you’re doing consequentialist reasoning, that already puts you very close in algorithm-space to malign strategic thinking, so “consequentialist but not deceptive (hence not malignly consequentialist)” is very unnatural; IMO like half of the whole the alignment problem is “get consequentialist reasoning that isn’t consequentalisting towards some random thing”.
The recursion there is only in the objective, not in the model itself. So there’s no assemblage anywhere other than in the thing that the model is trying to imitate.
Maybe it’ll be more clear to you if you just replace “imitate HCH” with “imitate Evan” or something like that—of course that’s less likely to result in a model that’s capable enough to do anything interesting, but it has the exact same sorts of problems in terms of getting myopia to work.
We’re just talking about step (1), so we’re not talking about training at all right now. We’re just trying to figure out what a natural class of agents would be that isn’t deceptive.
Agree. Any competitive myopic agent would have to be able to fully understand exactly how to do long-term non-myopic reasoning.
Agree by default but not by necessity. For step (1) we’re not trying to figure out what would happen by default if you trained a model on something, we’re just trying to understand what it might look like for an agent to be myopic in a natural way.
So these are both training-myopic, meaning they both are being trained only to do the task right in front of them, and aren’t (directly) rewarded for behavior that sacrifices reward now for reward in future episodes. Neither seem objective-myopic, meaning both of their objective functions are computed (seemingly necessarily) using far-reaching-consequences-understanding. Neither seem behavior-myopic, meaning both of them would successfully target far-reaching-consequences (by assumption of being competitive?). I think if you’re either objective-non-myopic or behavior-non-myopic, then by default you’re thought-non-myopic (meaning you in fact use far-reaching-consequences-understanding in your reasoning). I think if you’re thought-non-myopic, then by default you’re values-non-myopic, meaning you’re pursuing specific far-reaching-consequences. I think if you’re values-non-myopic, then you’re almost certainly deceptive, by strong default.
In step (1) you wrote:
I think if something happens by default, that’s a kind of naturalness. Maybe I just want to strengthen the claims above to say “by strong default”. In other words, I’m saying it’s a priori very unnatural to have something that’s behavior-non-myopic but thought-myopic, or thought-non-myopic but not deceptive, and overcoming that unnaturality is a huge hurdle. I would definitely be interested in your positive reasons for thinking this is possible.
I think it would help if you tried to walk through how a model with the goal of “imitating Evan” ends up acting deceptively. I claim that as long as you have a notion of myopic imitation that rules out failure modes like acausal trade (e.g. LCDT) and Evan will never act deceptively, then such a model will never act deceptively.
Your steps (2)-(4) seem to rely fairly heavily on the naturality of the class described in (1), e.g. because (2) has to recognize (1)s which requires that we can point to (1)s. If by “with the [[sole?]] goal of imitating Evan” you mean that
A. the model is actually really *only* trying to imitate Evan,
B. the model is competent to not accidentally also try to do something else (e.g. because the ways it pursues its goal are themselves malign under distributional shift), and
C. the training process you use will not tip the internal dynamics of the model over into a strategically malign state (there was never any incentive to prevent that from happening any more robustly than just barely enough to get good answers on the training set, and I think we agree that there’s a whole pile of [ability to understand and pursue far-reaching consequences] sitting in the model, making strategically malign states pretty close in model-space for natural metrics),
then yes this would plausibly not be deceptive, but it seems like a very unnatural class. I tried to argue that it’s unnatural in the long paragraph with the different kinds of myopia, where “by (strong) default” = “it would be unnatural to be otherwise”.
Note that (A) and (B) are not actually that hard—e.g. LCDT solves both problems.
Your (C), in my opinion, is where all the action is, and is in fact the hardest part of this whole story—which is what I was trying to say in the original post when I said that (2) was the hard part.
Okay, I think I’m getting a little more where you’re coming from? Not sure. Maybe I’ll read the LCDT thing soon (though I’m pretty skeptical of those claims).
(Not sure if it’s useful to say this, but as a meta note, from my perspective the words in the post aren’t pinned down enough to make it at all clear that the hard part is (2) rather than (1); you say “natural” in (1), and I don’t know what you mean by that such that (1) isn’t hard.)
Maybe I’m not emphasizing how unnatural I think (A) is. Like, it’s barely even logically consistent. I know that (A) is logically consistent, for some funny construal of “only trying”, because Evan is a perfect imitation of Evan; and more generally a good WBE could maybe be appropriately construed as not trying to do anything other than imitate Evan; and ideally an FAI could be given an instruction so that it doesn’t, say, have any appreciable impacts other than the impacts of an Evan-imitation. For anything that’s remotely natural and not “shaped” like Evan is “shaped”, I’m not sure it even makes sense to be only trying to imitate Evan; to imitate Evan you have to do a whole lot of stuff, including strategically arranging cognition, reason about far-reaching consequences in general, etc., which already constitutes trying to do something other than imitating Evan. When you’re doing consequentialist reasoning, that already puts you very close in algorithm-space to malign strategic thinking, so “consequentialist but not deceptive (hence not malignly consequentialist)” is very unnatural; IMO like half of the whole the alignment problem is “get consequentialist reasoning that isn’t consequentalisting towards some random thing”.