The key idea, in the case of HCH, would be to direct that optimization towards the goal of producing an action that is maximally close to what HCH would do.
Why do you expect this to be any easier than directing that optimisation towards the goal of “doing what the human wants”? In particular, if you train a system on the objective “imitate HCH”, why wouldn’t it just end up with the same long-term goals as HCH has? That seems like a much more natural thing for it to learn than the concept of imitating HCH, because in the process of imitating HCH it still has to do long-term planning anyway.
(I feel like this is basically the same set of concerns/objections that I raised in this post. I also think that myopia is a fairly central example of the thing that Eliezer was objecting to with his “water” metaphor in our dialogue, and I endorse his objection in this context.)
Why do you expect this to be any easier than directing that optimisation towards the goal of “doing what the human wants”? In particular, if you train a system on the objective “imitate HCH”, why wouldn’t it just end up with the same long-term goals as HCH has?
To be clear, I was only talking about (1) here, which is just about what it might look like for an agent to be myopic, not how to actually get an agent that satisfies (1). I agree that you would most likely get a proxy-aligned model if you just trained on “imitate HCH”—but just training on “imitating HCH” is definitely not the plan. See (2), (3), (4), (5) for how we actually get an agent that satisfies (1).
In terms of ease of getting (1)/naturalness of (1), all we need out of (1) there is for our concept of myopia to not cost so many bits that it’s too unnatural to get (2), (3), and (4) to work, not that it’s the most natural thing for you to get if all you do is just train on imitative amplification.
That all makes sense. But I had a skim of (2), (3), (4), and (5) and it doesn’t seem like they help explain why myopia is significantly more natural than “obey humans”?
I mean, that’s because this is just a sketch, but a simple argument for why myopia is more natural than “obey humans” is that if we don’t care about competitiveness, we already know how to build myopic optimizers, whereas we don’t know how to build an optimizer to “obey humans” at any level of capabilities.
Furthermore, LCDT is a demonstration that we can at least reduce the complexity of specifying myopia to the complexity of specifying agency. I suspect we can get much better upper bounds on the complexity than that, though.
Furthermore, LCDT is a demonstration that we can at least reduce the complexity of specifying myopia to the complexity of specifying agency.
It’s an interesting idea, but are you confident that LCDT actually works? E.g. have you thought more about the issues I talked about here and concluded they’re not serious problems?
I still don’t see how we could get e.g. an HCH simulator without agentic components (or the simulator’s qualifying as an agent). As soon as an LCDT agent expects that it may create agentic components in its simulation, it’s going to reason horribly about them (e.g. assuming that any adjustment it makes to other parts of its simulation can’t possibly impact their existence or behaviour, relative to the prior).
I think LCDT does successfully remove the incentives you’re aiming to remove. I just expect it to be too broken to do anything useful. I can’t currently see how we could get the good parts without the brokenness.
Why do you expect this to be any easier than directing that optimisation towards the goal of “doing what the human wants”? In particular, if you train a system on the objective “imitate HCH”, why wouldn’t it just end up with the same long-term goals as HCH has? That seems like a much more natural thing for it to learn than the concept of imitating HCH, because in the process of imitating HCH it still has to do long-term planning anyway.
(I feel like this is basically the same set of concerns/objections that I raised in this post. I also think that myopia is a fairly central example of the thing that Eliezer was objecting to with his “water” metaphor in our dialogue, and I endorse his objection in this context.)
To be clear, I was only talking about (1) here, which is just about what it might look like for an agent to be myopic, not how to actually get an agent that satisfies (1). I agree that you would most likely get a proxy-aligned model if you just trained on “imitate HCH”—but just training on “imitating HCH” is definitely not the plan. See (2), (3), (4), (5) for how we actually get an agent that satisfies (1).
In terms of ease of getting (1)/naturalness of (1), all we need out of (1) there is for our concept of myopia to not cost so many bits that it’s too unnatural to get (2), (3), and (4) to work, not that it’s the most natural thing for you to get if all you do is just train on imitative amplification.
That all makes sense. But I had a skim of (2), (3), (4), and (5) and it doesn’t seem like they help explain why myopia is significantly more natural than “obey humans”?
I mean, that’s because this is just a sketch, but a simple argument for why myopia is more natural than “obey humans” is that if we don’t care about competitiveness, we already know how to build myopic optimizers, whereas we don’t know how to build an optimizer to “obey humans” at any level of capabilities.
Furthermore, LCDT is a demonstration that we can at least reduce the complexity of specifying myopia to the complexity of specifying agency. I suspect we can get much better upper bounds on the complexity than that, though.
It’s an interesting idea, but are you confident that LCDT actually works? E.g. have you thought more about the issues I talked about here and concluded they’re not serious problems?
I still don’t see how we could get e.g. an HCH simulator without agentic components (or the simulator’s qualifying as an agent).
As soon as an LCDT agent expects that it may create agentic components in its simulation, it’s going to reason horribly about them (e.g. assuming that any adjustment it makes to other parts of its simulation can’t possibly impact their existence or behaviour, relative to the prior).
I think LCDT does successfully remove the incentives you’re aiming to remove. I just expect it to be too broken to do anything useful. I can’t currently see how we could get the good parts without the brokenness.
What are you referring to here?