Thanks for the reply. Just to prevent us from spinning our wheels too much, I’m going to start labelling specific agent designs, since it seems like some talking-past-each-other may be happening where we’re thinking of agents that work in different ways when making our points.
Sounds good.
Comments on ThermodynamicBot
This bot is of course a bounded agent, and so the world model can’t be perfect, but consider the following steps: [...] This is a finite sequence of computational steps that terminates without self-reference, so no logical induction is needed here. Now you may fairly object that there is still an issue of computational complexity. The space of histories is exponentially large, so in practice the computation couldn’t be completed in time. This is the known-to-be-hard problem of computing the partition function. But the problem is tractable in many special cases, and humans get by well enough in our own reasoning about a world full of combinatorial explosions. We can suppose that, at the cost of making itself even more of an approximation, the world model has a way to efficiently sample from the distribution, even given the difficulty of computing Z. To take one particular concrete way this could be implemented, if the world model is a factor graph with few or no loops, then we can do the computation by adding on a few factors to account for exp(U(h)/T) and then using belief propagation to solve it.
If we assume that the agent is making decisions by (approximately) plugging in every possible h into U(h) and picking based on (the partition function derived from) that, then of course you need U(h) to be adversarially robust! I disagree with that as a model of how planning works or should work. IMO, not only is “plug every possible h into U(h)” extremely computationally infeasible, but even if it were feasible it would be a forseeably-broken (because fragile) planning strategy.
Quote from a comment of TurnTrout about argmax planning, though I think it also applies to ThermodynamicBot, since that just does a softened version of argmax planning (converging to argmax planning as T->0):
if you design an AI that doesn’t argmax (or do its best to approximate argmax), and someone else builds one that does, your AI will lose a fair fight (e.g., economic competition starting from equal capabilities and resource endowments), so it would be nice if alignment doesn’t mean giving up argmax.
This seems exactly backwards to me. Argmax violates the non-adversarial principle and wastes computation. Argmax requires you to spend effort hardening your own utility function against the effort you’re also expending searching across all possible inputs to your utility function (including the adversarial inputs!). For example, if I argmaxed over my own plan-evaluations, I’d have to consider the most terrifying-to-me basilisks possible, and rate none of them unusually highly. I’d have to spend effort hardening my own ability to evaluate plans, in order to safely consider those possibilities.
It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.
I think the sorts of planning methods that try to approximate in the real world the behavior of “think about all possible plans and pick a good one” are unworkable in the limit, not just from an alignment standpoint but also from a practical capability standpoint, so I don’t expect us to build competent agents that use them, so I don’t worry about them or their attendant need for adversarial robustness.
I agree with your predictions for what would happen here. ThermodynamicBot isn’t really a GAN, though. If we tried to train a GAN to sub-in for ThermodynamicBot’s world model, then we could do it two ways (2nd way is most similar to your proposal). In both cases, the generator produces candidate histories, h.
Right, I wasn’t thinking of it as actually a GAN, just giving an analogy where similar causal patterns are in play, to make my point clearer. But yeah, if we wanted to actually use a GAN, your suggestions sound reasonable.
Comments on PolicyGradientBot & General Position
So, the standard criticism of policy gradient is that it’s noisy and doesn’t really allow for credit assignment. In particular, I think the lack of credit assignment is really a crucial flaw that will prevent policy gradient from ever being used to create an AGI. As far as I know, no agent powered purely by policy gradient does anything particularly impressive, though I could be wrong about that. Do you have any arguments / ideas for how policy gradient could be made to work here?
I guess, but I’m confused why we’re talking about competitiveness all of a sudden. I mean, variants on policy gradient algorithms (PPO, Actor-Critic, etc.) do some impressive things (at least to the extent any RL algorithms currently do impressive things). And I can imagine more sophisticated versions of even plain policy gradient that would do impressive things, if the action space includes mental actions like “sample a rollout from my world model”.
But that’s neither here nor there IMO. In the previous comment, I tried to be clear that it makes me ~no difference where the gradients come from, when I said
Could be from rewards or other “external” feedback, could be from TD/bootstrapped errors, could be from an imitation loss or something else.
Because I think that the outline of my argument doesn’t depend on one specific RL algorithm.
Just to clarify, my model doesn’t predict that the AI will use introspection on its own value function, or even look at its own source code at all. Some AIs may be designed to do that, but it’s not required for the failure case I’m considering. If gradients are flowing backwards from the value function to the actor (Bot-specific statement alert: Actor-Critic bots) then the actor has probably absorbed a significant amount of information about the value function’s misalignment. I don’t think it needs to take any additional steps of inspecting the agent’s own weights, or anything like that. You may object that gradients need not necessarily flow backwards there. After all, policy gradient and temporal difference learning are a thing. Let’s join the thread of discussion where you make that objection onto the PolicyGradientBot discussion.
Consider ActorCriticBot. I would make the same argument for it as I did in my previous comment, that the actor is optimized by the critic’s outputs but that the actor is not optimizing for critic outputs. It does not particularly matter that Critic("I made counterfeit money") >= Critic("I got money for doing the task") or that Critic("Extremely out-of-distribution state that Critic happens to evaluate super highly") >> Critic("I got money for doing the task"), because the actor in actor-critic doesn’t make its decisions by running an internal CriticEstimator(plan) and doing whatever evaluates best. It makes its decisions by looking at the current state and weighing the decision-factors triggered by that state; decision-factors that were internalized because they were upstream of actual positive feedback it received in the past from the environment+critic.
In the training distribution, the agent never reached a state like "I made counterfeit money", so the critic never gave feedback from that “misaligned” portion of the state space, so the actor never got gradients from the critic that differentially upweighted the actor’s concern for counterfeit money-specific factors, so the actor never internalized the particular antecedents of counterfeit money as motivating, so actor doesn’t factor them into its decision-making. Whereas the actor does factor things like “Am I doing the task yet?” into its decisions, because the actor internalized those as antecedents of reward/value, because the actor got gradients from the critic that differentially upweighted the agent’s concern those factors, because those were part of the state space covered by the training distribution, because the agent actually reached states where it got paid for e.g. task-completion.
IMO, not only is “plug every possible h into U(h)” extremely computationally infeasible
To be clear, I’m not saying Thermodynamic bot does the computation the slow exponential way. I already explained how it could be done in polynomial time, at least for a world model that looks like a factor graph that’s a tree. Call this ThermodynamicBot-F. You could also imagine the role of “world model” being filled by a neural network (blob of weights) that approximates the full thermodynamic computation. We can call this ThermodynamicBot-N.
Yes, I understand that running a search that will kill you if it succeeds is dumb. This has been known for many years. The question is how do we actually write a program to do a sane search? You quote TurnTrout:
It would be far wiser to not consider all possible plans, and instead close off large parts of the search space. You can consider what plans to think about next, and how long to think, and so on. And then you aren’t argmaxing. You’re using resources effectively.
I don’t find this particularly helpful. If we know which plans are adversarial so we can eliminate them from the search space, we’re already half way to solving alignment. I don’t think the plans a bounded agent is going to eliminate so that it can finish its thinking on time are automatically going to be the adversarial ones. I think this is a problem that is going to take actual effort.
In particular for ThermodynamicBot:
Case where the world model is implemented in a factor graph (ThermodynamicBot-F): This gives exactly the same result as searching across all inputs, but the computation is efficient, and not really wasteful in any sense. If we imagine trying to “improve” the belief propagation algorithm to simultaneously make it more efficient and also remove some subset of plans it’s searching over that are “adversarial”, I can’t really imagine a way to do that, and it would certainly make the algorithm more complicated and less elegant.
Case where a neural network world model is being used (ThermodynamicBot-N): In this case there are likely plans that will be missed by ThermodynamicBot-N because of the bounded nature of its world model, even though they would be found by searching across all inputs. But if we imagine training the world model to make it better, I would generally expect this to increase the world model’s ability to find adversarial plans just like it increases its ability to find good plans. In general, I don’t expect there to be any correlation where all the adversarial plans happen to be eliminated due to bounded reasoning. Why should we be so lucky that all the errors we’re making happen to cancel each other out?
I think the sorts of planning methods that try to approximate in the real world the behavior of “think about all possible plans and pick a good one” are unworkable in the limit, not just from an alignment standpoint but also from a practical capability standpoint, so I don’t expect us to build competent agents that use them, so I don’t worry about them or their attendant need for adversarial robustness.
I agree if we’re literally talking about brute force search here. If we’re talking about the more realistic ThermodynamicBot designs I’ve mentioned, then I’m not sure I agree. In some sense, all methods an agent could use to plan are “picking plans from plan-space that are better than most other plans”. Even ActorCriticBot is “trying” to approximate argmax. If we could train it to minimal loss, it would be an ArgMaxBot. Is there some particular approximation or heuristic that we can adopt, where if we do adopt it we go from dangerously approaching ArgMaxBot to safely searching through only good plans? An approximation used by ActorCriticBot, but not by ThermodynamicBot-N? If so, I have no idea what the crucial approximation is that you could be thinking of.
I also don’t think it’s at all obvious that ThermodynamicBot designs are necessarily capability-limited. It makes a lot of sense to integrate planning very closely with the world model. Might be worth betting on the direction of future RL research here if we can set sufficiently objective resolution criteria? In any case, I do think this counts as some progress in this discussion, since we’ve found an example of an agent that we both agree your argument doesn’t apply to.
Comments on PolicyGradientBot vs ActorCriticBot
In my view, there’s kind of a huge gulf between PolicyGradientBot and ActorCriticBot, where the gradients flowing backwards into ActorCriticBot’s actor end up carrying a lot of information. This allows for much better performance, and in particular much better sample efficiency, at the cost that some of the information is about weaknesses in ActorCriticBot’s critic.
To take a particular example, if the critic overvalues blue diamonds, then gradients flowing into the actor are going to be steeper for actions that obtain blue diamonds. Then in a new environment where there’s a bucket of blue paint sitting in the corner, it seems reasonable to expect that the actor might try to use that bucket to paint diamonds blue, at least assuming it’s sufficiently intelligent and flexible.
For PolicyGradientBot on the other hand, while it could still result in alignment failures, it seems much more like we’re just directly training a policy. But PolicyGradientBot is very slow when it comes to sample efficiency.
WRT other algorithms like temporal difference learning that lie kind of in between PolicyGradientBot and ActorCriticBot, I think the question of what happens for ActorCriticBot is already a crux in this discussion, but feel free to add more bot types if you think it would be useful.
Is ActorCriticBot robust?
the actor in actor-critic doesn’t make its decisions by running an internal CriticEstimator(plan) and doing whatever evaluates best
Again, I’m not saying a brute force search over plans is being done here, but I’d generally expect that what the actor is doing is very strongly linked to what the critic values, and I’d say it’s very likely that the Actor has lots of components inside of it roughly related to the question “what is the critic going to think about this situation?” For example, if the critic consistently overvalues blue, then I’d predict that the actor has lots of circuits inside of it related to blueness. Do you disagree with this?
Obviously the actor’s ideas of what’s good aren’t going to be perfectly faithful to the critic: There will exist some adversarial plans that the actor just isn’t going to generate, but again the question is: Why should we be so lucky that the errors we’re making exactly cancel out? I don’t see any reason to expect that the actor’s imperfect approximation of the critic and critic’s imperfect approximation of our true desires should cancel out so well that the actor never generates any adversarial plans at all.
Sounds good.
Comments on ThermodynamicBot
If we assume that the agent is making decisions by (approximately) plugging in every possible
h
intoU(h)
and picking based on (the partition function derived from) that, then of course you needU(h)
to be adversarially robust! I disagree with that as a model of how planning works or should work. IMO, not only is “plug every possibleh
intoU(h)
” extremely computationally infeasible, but even if it were feasible it would be a forseeably-broken (because fragile) planning strategy.Quote from a comment of TurnTrout about argmax planning, though I think it also applies to ThermodynamicBot, since that just does a softened version of argmax planning (converging to argmax planning as
T
->0):I think the sorts of planning methods that try to approximate in the real world the behavior of “think about all possible plans and pick a good one” are unworkable in the limit, not just from an alignment standpoint but also from a practical capability standpoint, so I don’t expect us to build competent agents that use them, so I don’t worry about them or their attendant need for adversarial robustness.
Right, I wasn’t thinking of it as actually a GAN, just giving an analogy where similar causal patterns are in play, to make my point clearer. But yeah, if we wanted to actually use a GAN, your suggestions sound reasonable.
Comments on PolicyGradientBot & General Position
I guess, but I’m confused why we’re talking about competitiveness all of a sudden. I mean, variants on policy gradient algorithms (PPO, Actor-Critic, etc.) do some impressive things (at least to the extent any RL algorithms currently do impressive things). And I can imagine more sophisticated versions of even plain policy gradient that would do impressive things, if the action space includes mental actions like “sample a rollout from my world model”.
But that’s neither here nor there IMO. In the previous comment, I tried to be clear that it makes me ~no difference where the gradients come from, when I said
Because I think that the outline of my argument doesn’t depend on one specific RL algorithm.
Consider ActorCriticBot. I would make the same argument for it as I did in my previous comment, that the actor is optimized by the critic’s outputs but that the actor is not optimizing for critic outputs. It does not particularly matter that
Critic("I made counterfeit money") >= Critic("I got money for doing the task")
or thatCritic("Extremely out-of-distribution state that Critic happens to evaluate super highly") >> Critic("I got money for doing the task")
, because the actor in actor-critic doesn’t make its decisions by running an internalCriticEstimator(plan)
and doing whatever evaluates best. It makes its decisions by looking at the current state and weighing the decision-factors triggered by that state; decision-factors that were internalized because they were upstream of actual positive feedback it received in the past from the environment+critic.In the training distribution, the agent never reached a state like
"I made counterfeit money"
, so the critic never gave feedback from that “misaligned” portion of the state space, so the actor never got gradients from the critic that differentially upweighted the actor’s concern for counterfeit money-specific factors, so the actor never internalized the particular antecedents of counterfeit money as motivating, so actor doesn’t factor them into its decision-making. Whereas the actor does factor things like “Am I doing the task yet?” into its decisions, because the actor internalized those as antecedents of reward/value, because the actor got gradients from the critic that differentially upweighted the agent’s concern those factors, because those were part of the state space covered by the training distribution, because the agent actually reached states where it got paid for e.g. task-completion.Comments on ThermodynamicBot
To be clear, I’m not saying Thermodynamic bot does the computation the slow exponential way. I already explained how it could be done in polynomial time, at least for a world model that looks like a factor graph that’s a tree. Call this ThermodynamicBot-F. You could also imagine the role of “world model” being filled by a neural network (blob of weights) that approximates the full thermodynamic computation. We can call this ThermodynamicBot-N.
Yes, I understand that running a search that will kill you if it succeeds is dumb. This has been known for many years. The question is how do we actually write a program to do a sane search? You quote TurnTrout:
I don’t find this particularly helpful. If we know which plans are adversarial so we can eliminate them from the search space, we’re already half way to solving alignment. I don’t think the plans a bounded agent is going to eliminate so that it can finish its thinking on time are automatically going to be the adversarial ones. I think this is a problem that is going to take actual effort.
In particular for ThermodynamicBot:
Case where the world model is implemented in a factor graph (ThermodynamicBot-F): This gives exactly the same result as searching across all inputs, but the computation is efficient, and not really wasteful in any sense. If we imagine trying to “improve” the belief propagation algorithm to simultaneously make it more efficient and also remove some subset of plans it’s searching over that are “adversarial”, I can’t really imagine a way to do that, and it would certainly make the algorithm more complicated and less elegant.
Case where a neural network world model is being used (ThermodynamicBot-N): In this case there are likely plans that will be missed by ThermodynamicBot-N because of the bounded nature of its world model, even though they would be found by searching across all inputs. But if we imagine training the world model to make it better, I would generally expect this to increase the world model’s ability to find adversarial plans just like it increases its ability to find good plans. In general, I don’t expect there to be any correlation where all the adversarial plans happen to be eliminated due to bounded reasoning. Why should we be so lucky that all the errors we’re making happen to cancel each other out?
I agree if we’re literally talking about brute force search here. If we’re talking about the more realistic ThermodynamicBot designs I’ve mentioned, then I’m not sure I agree. In some sense, all methods an agent could use to plan are “picking plans from plan-space that are better than most other plans”. Even ActorCriticBot is “trying” to approximate argmax. If we could train it to minimal loss, it would be an ArgMaxBot. Is there some particular approximation or heuristic that we can adopt, where if we do adopt it we go from dangerously approaching ArgMaxBot to safely searching through only good plans? An approximation used by ActorCriticBot, but not by ThermodynamicBot-N? If so, I have no idea what the crucial approximation is that you could be thinking of.
I also don’t think it’s at all obvious that ThermodynamicBot designs are necessarily capability-limited. It makes a lot of sense to integrate planning very closely with the world model. Might be worth betting on the direction of future RL research here if we can set sufficiently objective resolution criteria? In any case, I do think this counts as some progress in this discussion, since we’ve found an example of an agent that we both agree your argument doesn’t apply to.
Comments on PolicyGradientBot vs ActorCriticBot
In my view, there’s kind of a huge gulf between PolicyGradientBot and ActorCriticBot, where the gradients flowing backwards into ActorCriticBot’s actor end up carrying a lot of information. This allows for much better performance, and in particular much better sample efficiency, at the cost that some of the information is about weaknesses in ActorCriticBot’s critic.
To take a particular example, if the critic overvalues blue diamonds, then gradients flowing into the actor are going to be steeper for actions that obtain blue diamonds. Then in a new environment where there’s a bucket of blue paint sitting in the corner, it seems reasonable to expect that the actor might try to use that bucket to paint diamonds blue, at least assuming it’s sufficiently intelligent and flexible.
For PolicyGradientBot on the other hand, while it could still result in alignment failures, it seems much more like we’re just directly training a policy. But PolicyGradientBot is very slow when it comes to sample efficiency.
WRT other algorithms like temporal difference learning that lie kind of in between PolicyGradientBot and ActorCriticBot, I think the question of what happens for ActorCriticBot is already a crux in this discussion, but feel free to add more bot types if you think it would be useful.
Is ActorCriticBot robust?
Again, I’m not saying a brute force search over plans is being done here, but I’d generally expect that what the actor is doing is very strongly linked to what the critic values, and I’d say it’s very likely that the Actor has lots of components inside of it roughly related to the question “what is the critic going to think about this situation?” For example, if the critic consistently overvalues blue, then I’d predict that the actor has lots of circuits inside of it related to blueness. Do you disagree with this?
Obviously the actor’s ideas of what’s good aren’t going to be perfectly faithful to the critic: There will exist some adversarial plans that the actor just isn’t going to generate, but again the question is: Why should we be so lucky that the errors we’re making exactly cancel out? I don’t see any reason to expect that the actor’s imperfect approximation of the critic and critic’s imperfect approximation of our true desires should cancel out so well that the actor never generates any adversarial plans at all.