Suppose we design the LCDT agent with the “prior” that “After this decision right now, I’m just going to do nothing at all ever again, instead I’m just going to NOOP until the end of time.” And we design it to never update away from that prior. In that case, then the LCDT agent will not try to execute multi-step plans.
Whereas if the LCDT agent has the “prior” that it’s going to make future decisions using a similar algorithm as what it’s using now, then it would do the first step of a multi-step plan, secure in the knowledge that it will later proceed to the next step.
Your explanation of the paperclip factory is spot on. That being said, it is important to precise that the link to building the factory must have no agent in it, or the LCDT agent would think its actions doesn’t change anything.
The weird part (that I don’t personally know how to address) is deciding where the prior comes from. Most of the post argues that it doesn’t matter for our problems, but in this example (and other weird multi-step plans, it does.
If so, I’m concerned about capabilities here because I normally think that, for capabilities reasons, we’ll need reasoning to be a multi-step sequential process, involving thinking about different aspects in different ways. So if we do the first “prior”, where LCDT assumes that it’s going to NOOP forever starting 0.1 seconds from now, it won’t try to “think things through”, gather background knowledge etc. But if we do the more human-like “prior” where LCDT assumes that it’s going to make future decisions in a similar way as present decisions, then we’re back to long-term planning.
That’s a fair concern. Our point in the post is that LCDT can think things through when simulating other systems (like HCH) for imitating them. And so it should have strong capabilities there. But you’re right that its an issue for long term planning if we expect an LCDT agent to directly solve problems.
Different topic: If the human’s “space of possible actions” at t=1 depends on the LCDT agent’s action at t=0, then I’m confused about how the LCDT agent is supposed to pretend that the human’s decision is independent of its current choice.
The technical answer is that the LCDT agent computes its distribution over actions spaces for the human by marginalizing the human’s current distribution with the LCDT agent distribution over its own action. The intuition is something like: “I believe that the human has already some model of which action I will take, and nothing I can do will change that”.
The technical answer is that the LCDT agent computes its distribution over actions spaces for the human by marginalizing the human’s current distribution with the LCDT agent distribution over its own action. The intuition is something like: “I believe that the human has already some model of which action I will take, and nothing I can do will change that”.
I’m with Steve in being confused how this works in practice.
Let’s say I’m an LCDT agent, and you’re a human flying a kite.
My action set: [Say “lovely day, isn’t it?”] [Burn your kite] Your action set: [Move kite left] [Move kite right] [Angrily gesticulate]
Let’s say I initially model you as having p = 1⁄3 of each option, based on your expectation of my actions. Now I decide to burn your kite. What should I imagine will happen? If I burn it, your kite pointers are dangling. Do the [Move kite left] and [Move kite right] actions become NOOPs? Do I assume that my [burn kite] action fails?
I’m clear on ways you could technically say I didn’t influence the decision—but if I can predict I’ll have a huge influence on the output of that decision, I’m not sure what that buys us. (and if I’m not permitted to infer any such influence, I think I just become a pure nihilist with no preference for any action over any other)
I’m clear on ways you could technically say I didn’t influence the decision—but if I can predict I’ll have a huge influence on the output of that decision, I’m not sure what that buys us. (and if I’m not permitted to infer any such influence, I think I just become a pure nihilist with no preference for any action over any other)
In your example (and Steve’s example), you believe that the human action (and action space) will depend uniquely on your prior over your own decision (which you can’t control). So yes, in this situation you are actually indifferent, because you don’t think anything you do will change the result.
This basically points at the issue with approval-direction (or even asking a human to open a door); our counter argument is to use LCDT agents as simulators of agents, where the myopia mostly guarantee that they will not alter what they’re simulating.
(A subtlety I just noticed is that to make an LCDT agent change its model of an agent, you must create a task where its evaluation isn’t through influencing the actions of the agent, but some other “measure” that the model is better. Not unsurmountable, but a point to keep in mind).
Ok, so if I understand you correctly (and hopefully I don’t!), you’re saying that as an LCDT agent I believe my prior determines my prediction of: 1) The distribution over action spaces of the human. 2) The distribution over actions the human would take given any particular action space.
So in my kite example, let’s say my prior has me burn your kite with 10% probability. So I believe that you start out with: 0.9 chance of the action set [Move kite left] [Move kite right] [Angrily gesticulate] 0.1 chance of the action set [Angrily gesticulate]
In considering my [burn kite] option, I must believe that taking the action doesn’t change your distribution over action sets—i.e. that after I do [burn kite] you still have a 0.9 chance of the action set [Move kite left] [Move kite right] [Angrily gesticulate]. So I must believe that [burn kite] does nothing.
Is that right so far, or am I missing something?
Similarly, I must believe that any action I can take that would change the distribution over action sets of any agent at any time in the future must also do nothing. That doesn’t seem to leave much (or rather it seems to leave nothing in most worlds).
To put it another way, I don’t think the intuition works for action-set changes the way it does for decision-given-action-set changes. I can coherently assume that an agent ignores the consequences of my actions in its decision-given-an-action-set, since that only requires I assume something strange about its thinking. I cannot coherently assume that the agent has a distribution over action sets that it does not have: this requires a contradiction in my world model.
It’s not clear to me how the simulator-of-agents approach helps, but I may just be confused. Currently the only coherent LCDT agent I can make sense of is trivial.
I’m confused because while your description is correct (except on your conclusion at the end), I already say that in the approval-direction problem: LCDT agents cannot believe in ANY influence of their actions on other agents.
For the world-model, it’s not actually incoherent because we cut the link and update the distribution of the subsequent agent.
And for usefulness/triviality when simulating or being overseen, LCDT doesn’t need to influence an agent, and so it will do its job while not being deceptive.
LCDT agents cannot believe in ANY influence of their actions on other agents.
And my point is simply that once this is true, they cannot (coherently) believe in any influence of their actions on the world (in most worlds).
In (any plausible model of) the real world, any action taken that has any consequences will influence the distribution over future action sets of other agents.
I.e. I’m saying that [plausible causal world model] & [influences no agents] ⇒ [influences nothing]
So the only way I can see it ‘working’ are: 1) To agree it always influences nothing (I must believe that any action I take as an LCDT agent does precisely nothing). or 2) To have an incoherent world model: one in which I can believe with 99% certainty that a kite no longer exists, and with 80% certainty that you’re still flying that probably-non-existent kite.
So I don’t see how an LCDT agent makes any reliable predictions.
[EDIT: if you still think this isn’t a problem, and that I’m confused somewhere (which I may be), then I think it’d be helpful if you could give an LCDT example where: The LCDT agent has an action x which alters the action set of a human. The LCDT agent draws coherent conclusions about the combined impact of x and its prediction of the human’s action. (of course I’m not saying the conclusions should be rational—just that they shouldn’t be nonsense)]
To have an incoherent world model: one in which I can believe with 99% certainty that a kite no longer exists, and with 80% certainty that you’re still flying that probably-non-existent kite.
I feel pretty willing to bite the bullet on this—what sorts of bad things do you think LCDT agents would do given such a world model (at decision time)? Such an LCDT agent should still be perfectly capable of tasks like simulating HCH without being deceptive—and should still be perfectly capable of learning and improving its world model, since the incoherence only shows up at decision-time and learning is done independently.
Perhaps I’m now understanding correctly(??). An undesirable action that springs to mind: delete itself to free up disk space. Its future self is assumed to give the same output regardless of this action. More generally, actions with arbitrarily bad side-effects on agents, to gain marginal utility. Does that make sense?
I need to think more about the rest.
[EDIT and see rambling reply to Adam re ways to avoid the incoherence. TLDR: I think placing a [predicted agent action set alterations] node directly after the LCDT decision node in the original causal diagram, deducing what can be deduced from that node, and treating it as an agent at decision-time might work. It leaves the LCDT agent predicting that many of its actions don’t do much, but it does get rid of the incoherence (I think). Currently unclear whether this throws the baby out with the bathwater; I don’t think it does anything about negative side-effects]
Perhaps I’m now understanding correctly(??). An undesirable action that springs to mind: delete itself to free up disk space. Its future self is assumed to give the same output regardless of this action.
More generally, actions with arbitrarily bad side-effects on agents, to gain marginal utility. Does that make sense?
Sure—that’s totally fine. The point of LCDT isn’t to produce an aligned agent, but to produce an agent that’s never deceptive. That way, if your AI is going to delete itself to free up disk space, it’ll do it in training and you can see that it’s going to do that and correct that behavior.
With an LCDT agent, the idea is that if it does the right thing in training, you know it’s not just doing that because it’s trying to trick you and it’ll actually do the wrong thing later in deployment. The point of LCDT, in my head, is to give you (defects in deployment) iff (defects in training), that way you can just take your LCDT agent and tweak it until it does the right thing in training, and then not be surprised by it actually having been deceiving you and defecting when you deploy it.
Right, as far as I can see, it achieves the won’t-be-deceptive aim. My issue is in seeing how we find a model that will consistently do the right thing in training (given that it’s using LCDT).
As I understand it, under LCDT an agent is going to trade an epsilon utility gain on non-agent-influencing-paths for an arbitrarily bad outcome on agent-influencing-paths (since by design it doesn’t care about those). So it seems that it’s going to behave unacceptably for almost all goals in almost all environments in which there can be negative side-effects on agents we care about.
We can use it to run simulations, but it seems to me that most problems (deception in particular) get moved to the simulation rather than solved.
Quite possibly I’m still missing something, but I don’t currently see how the LCDT decisions do much useful work here (Am I wrong? Do you see LCDT decisions doing significant optimisation?). I can picture its being a useful wrapper around a simulation, but it’s not clear to me in what ways finding a non-deceptive (/benign) simulation is an easier problem than finding a non-deceptive (/benign) agent. (maybe side-channel attacks are harder??)
My issue is in seeing how we find a model that will consistently do the right thing in training (given that it’s using LCDT).
How about an LCDT agent with the objective of imitating HCH? Such an agent should be aligned and competitive, assuming the same is true of HCH. Such an agent certainly shouldn’t delete itself to free up disk space, since HCH wouldn’t do that—nor should it fall prey to the general argument you’re making about taking epsilon utility in a non-agent path, since there’s only one utility node it can influence without going through other agents, which is the delta between its next action and HCH’s action.
We can use it to run simulations, but it seems to me that most problems (deception in particular) get moved to the simulation rather than solved.
I claim that, for a reasonably accurate HCH model that’s within some broad basin of attraction, an LCDT agent attempting to imitate that HCH model will end up aligned—and that the same is not true for any other decision theory/agent model that I know of. And LCDT can do this while being able to manage things like how to simulate most efficiently and how to allocate resources between different methods of simulation. The core idea is that LCDT solves the hard problem of being able to put optimization power into simulating something efficiently in a safe way.
Ok thanks, I think I see a little more clearly where you’re coming from now. (it still feels potentially dangerous during training, but I’m not clear on that)
A further thought:
The core idea is that LCDT solves the hard problem of being able to put optimization power into simulating something efficiently in a safe way
Ok, so suppose for the moment that HCH is aligned, and that we’re able to specify a sufficiently accurate HCH model. The hard part of the problem seems to be safe-and-efficient simulation of the output of that HCH model. I’m not clear on how this part works: for most priors, it seems that the LCDT agent is going to assign significant probability to its creating agentic elements within its simulation. But by assumption, it doesn’t think it can influence anything downstream of those (or the probability that they exist, I assume).
That seems to be the place where LCDT needs to do real work, and I don’t currently see how it can do so efficiently. If there are agentic elements contributing to the simulation’s output, then it won’t think it can influence the output. Avoiding agentic elements seems impossible almost by definition: if you can create an arbitrarily accurate HCH simulation without its qualifying as agentic, then your test-for-agents can’t be sufficiently inclusive.
But by assumption, it doesn’t think it can influence anything downstream of those (or the probability that they exist, I assume).
This is not true—LCDT is happy to influence nodes downstream of agent nodes, it just doesn’t believe it can influence them through those agent nodes. So LCDT (at decision time) doesn’t believe it can change what HCH does, but it’s happy to change what it does to make it agree with what it thinks HCH will do, even though that utility node is downstream of the HCH agent nodes.
However, I still don’t see how LCDT can make good decisions over adjustments to its simulation. That simulation must presumably eventually contain elements classed as agentic. Then given any adjustment X which influences the simulation outcome both through agentic paths and non-agentic paths, the LCDT agent will ignore the influence [relative to the prior] through the agentic paths. Therefore it will usually be incorrect about what X is likely to accomplish.
It seems to me that you’ll also have incoherence issues here too: X can change things so that p(Y = 0) is 0.99 through a non-agentic path, whereas the agents assumes the equivalent of [p(Y = 0) is 0.5] through an agentic path.
I don’t see how an LCDT agent can make efficient adjustments to its simulation when it won’t be able to decide rationally on those judgements in the presence of agentic elements (which again, I assume must exist to simulate HCH).
That’s a really interesting thought—I definitely think you’re pointing at a real concern with LCDT now. Some thoughts:
Note that this problem is only with actually running agents internally, not with simply having the objective of imitating/simulating an agent—it’s just that LCDT will try to simulate that agent exclusively via non-agentic means.
That might actually be a good thing, though! If it’s possible to simulate an agent via non-agentic means, that certainly seems a lot safer than internally instantiating agents—though it might just be impossible to efficiently simulate an agent without instantiating any agents internally, in which case it would be a problem.
In some sense, the core problem here is just that the LCDT agent needs to understand how to decompose its own decision nodes into individual computations so it can efficiently compute things internally and then know when and when not to label its internal computations as agents. How to decompose nodes into subnodes to properly work with multiple layers is a problem with all CDT-based decision theories, though—and it’s hopefully the sort of problem that finite factored sets will help with.
Ok, that mostly makes sense to me. I do think that there are still serious issues (but these may be due to my remaining confusions about the setup: I’m still largely reasoning about it “from outside”, since it feels like it’s trying to do the impossible).
For instance:
I agree that the objective of simulating an agent isn’t a problem. I’m just not seeing how that objective can be achieved without the simulation taken as a whole qualifying as an agent. Am I missing some obvious distinction here? If for all x in X, sim_A(x) = A(x), then if A is behaviourally an agent over X, sim_A seems to be also.(Replacing equality with approximate equality doesn’t seem to change the situation much in principle) [Pre-edit: Or is the idea that we’re usually only concerned with simulating some subset of the agent’s input->output mapping, and that a restriction of some function may have different properties from the original function? (agenthood being such a property)]
I can see that it may be possible to represent such a simulation as a group of nodes none of which is individually agentic—but presumably the same could be done with a human. It can’t be ok for LCDT to influence agents based on having represented them as collections of individually non-agentic components.
Even if sim_A is constructed as a Chinese room (w.r.t. agenthood), it’s behaving collectively as an agent.
“it’s just that LCDT will try to simulate that agent exclusively via non-agentic means”—mostly agreed, and agreed that this would be a good thing (to the extent possible). However, I do think there’s a significant difference between e.g.: [LCDT will not aim to instantiate agents] (true) vs [LCDT will not instantiate agents] (potentially false: they may be side-effects)
Side-effect-agents seem plausible if e.g.: a) The LCDT agent applies adjustments over collections within its simulation. b) An adjustment taking [useful non-agent] to [more useful non-agent] also sometimes takes [useful non-agent] to [agent].
Here it seems important that LCDT may reason poorly if it believes that it might create an agent. I agree that pre-decision-time processing should conclude that LCDT won’t aim to create an agent. I don’t think it will conclude that it won’t create an agent.
Agreed that finite factored sets seem promising to address any issues that are essentially artefacts of representations. However, the above seem more fundamental, unless I’m missing something.
Assuming this is actually a problem, it struck me that it may be worth thinking about a condition vaguely like:
An LCDTn agent cuts links at decision time to every agent other than [LCDTm agents where m > n].
The idea being to specify a weaker condition that does enough forwarding-the-guarantee to allow safe instantiation of particular types of agent while still avoiding deception.
I’m far from clear that anything along these lines would help: it probably doesn’t work, and it doesn’t seem to solve the side-effect-agent problem anyway: [complete indifference to influence on X] and [robustly avoiding creation of X] seem fundamentally incompatible.
[EDIT: if you still think this isn’t a problem, and that I’m confused somewhere (which I may be), then I think it’d be helpful if you could give an LCDT example where: The LCDT agent has an action x which alters the action set of a human. The LCDT agent draws coherent conclusions about the combined impact of x and its prediction of the human’s action. (of course I’m not saying the conclusions should be rational—just that they shouldn’t be nonsense)]
There is no such example. The confusion I feel you have is not about what LCDT does in such cases, but about the necessity to solve such cases to be competitive and valuable. As Evan points out in his comment, simulating HCH or anything really doesn’t require altering the action set of a human/agent. And if some actions can do that, LCDT ends up having no incentives to do anything to change the human/agent, which is exactly what we want. That’s really the crux here IMO
Also, I feel part of the misunderstanding hinges on what I mention in this comment answering Steve.
[Pre-emptive apologies for the stream-of-consciousness: I made the mistake of thinking while I wrote. Hopefully I ended up somewhere reasonable, but I make no promises]
simulating HCH or anything really doesn’t require altering the action set of a human/agent
My point there wasn’t that it requires it, but that it entails it. After any action by the LCDT agent, the distribution over future action sets of some agents will differ from those same distributions based on the prior (perhaps very slightly).
E.g. if I burn your kite, your actual action set doesn’t involve kite-flying; your prior action set does. After I take the [burn kite] action, my prediction of [kite exists] doesn’t have a reliable answer.
If I’m understanding correctly (and, as ever, I may not be), this is just to say that it’d come out differently based on the way you set up the pre-link-cutting causal diagram. If the original diagram effectively had [kite exists iff Adam could fly kite], then I’d think it’d still exist after [burn kite]; if the original had [kite exists iff Joe didn’t burn kite] then I’d think that it wouldn’t.
In the real world, those two setups should be logically equivalent. The link-cutting breaks the equivalence. Each version of the final diagram functions in its own terms, but the answer to [kite exists] becomes an artefact of the way we draw the initial diagram. (I think!)
In this sense, it’s incoherent (so Evan’s not claiming there’s no bullet, but that he’s biting it); it’s just less clear that it matters that it’s incoherent.
I still tend to think that it does matter—but I’m not yet sure whether it’s just offending my delicate logical sensibilities, or if there’s a real problem.
For instance, in my reply to Evan, I think the [delete yourself to free up memory] action probably looks good if there’s e.g. an [available memory] node directly downstream of the [delete yourself...] action. If instead the path goes [delete yourself...] --> [memory footprint of future self] --> [available memory], then deleting yourself isn’t going to look useful, since [memory footprint...] shouldn’t change.
Perhaps it’d work in general to construct the initial causal diagrams in this way: You route maximal causality through agents, when there’s any choice. So you then tend to get [LCDT action] --> [Agent action-set-alteration] --> [Whatever can be deduced from action-set-alteration].
You couldn’t do precisely this in general, since you’d need backwards-in-time causality—but I think you could do some equivalent. I.e. you’d put an [expected agent action set distribution] node immediately after the LCDT decision, treat that like an agent at decision time, and deduce values of intermediate nodes from that.
So in my kite example, let’s say you’ll only get to fly your kite (if it exists) two months from my decision, and there’s a load of intermediate nodes. But directly downstream of my [burn kite] action we put a [prediction of Adam’s future action set] node. All of the causal implications of [burn kite] get routed through the action set prediction node.
Then at decision time the action-set prediction node gets treated as part of an agent, and there’s no incoherence. (but I predict that my [burn kite] fails to burn your kite)
Anyway, quite possibly doing things this way would have a load of downsides (or perhaps it doesn’t even work??), but it seems plausible to me.
My remaining worry is whether getting rid of the incoherence in this way is too limiting—since the LCDT agent gets left thinking its actions do almost nothing (given that many/most actions would be followed by nodes which negate their consequences relative to the prior).
[I’ll think more about whether I’m claiming much/any of this impacts the simulation setup (beyond any self-deletion issues)]
For the world-model, it’s not actually incoherent because we cut the link and update the distribution of the subsequent agent.
I’m gonna see if I can explain this in more detail—you can correct me if I’m wrong.
In common sense, I would say “Suppose I burn the kite. What happens in the future? Is it good or bad? OK, suppose I don’t burn the kite. What happens in the future? Is it good or bad?” And then decide on that basis.
But that’s EDT.
CDT is different.
In CDT I can have future expectations that follow logically from burning the kite, but they don’t factor in as considerations, because they don’t causally flow from the decision according to the causal diagram in my head.
Smoking lesion is a pretty intuitive example for us to think about, because smoking lesion involves a plausible causal diagram of the world.
Here we’re taking the same idea, but I (=the LCDT agent) have a wildly implausible causal diagram of the world. “If I burn the kite, then the person won’t move the kite, but c’mon, that’s not because I burned the kite!”
Just like the smoking lesion, I have the idea that the kite might or might not be there, but that’s a fact about the world that’s somehow predetermined before decision time, not because of my decision, and therefore doesn’t factor into my decision.
…Maybe. Did I get that right?
Anyway, I usually think of a world-model as having causality in it, as opposed to causal diagrams being a separate layer that exists on top of a world model. So I would disagree with “not actually incoherent”. Specifically, I think if an agent can do the kind of reasoning that would allow it to create a causal world-model in the first place, then the same kind of reasoning would lead it to realize that there is in fact supposed to be a link at each of the places where we manually cut it—i.e., that the causal world-model is incoherent.
Specifically, I think if an agent can do the kind of reasoning that would allow it to create a causal world-model in the first place, then the same kind of reasoning would lead it to realize that there is in fact supposed to be a link at each of the places where we manually cut it—i.e., that the causal world-model is incoherent.
An LCDT agent should certainly be aware of the fact that those causal chains actually exist—it just shouldn’t care about that. If you want to argue that it’ll change to not using LCDT to make decisions anymore, you have to argue that, under the decision rules of LCDT, it will choose to self-modify in some particular situation—but LCDT should rule out its ability to ever believe that any self-modification will do anything, thus ensuring that, once an agent starts making decisions using LCDT, it shouldn’t stop.
In addition to Evan’s answer (with which I agree), I want to make explicit an assumption I realized after reading your last paragraph: we assume that the causal graph is the final result of the LCDT agent consulting its world model to get a “model” of the task at hand. After that point (which includes drawing causality and how the distributions impacts each other, as well as the sources’ distributions), the LCDT agent only decides based on this causal graph. In this case it cuts the causal links to agent and then decide CDT style.
None of this result in an incoherent world model because the additional knowledge that could be used to realize that the cuts are not “real”, is not available in the truncated causal model, and thus cannot be accessed while making the decision.
I honestly feel this is the crux of our talking past each other (same with Joe) in the last few comments. Do you think that’s right?
Thanks for the comment!
Your explanation of the paperclip factory is spot on. That being said, it is important to precise that the link to building the factory must have no agent in it, or the LCDT agent would think its actions doesn’t change anything.
The weird part (that I don’t personally know how to address) is deciding where the prior comes from. Most of the post argues that it doesn’t matter for our problems, but in this example (and other weird multi-step plans, it does.
That’s a fair concern. Our point in the post is that LCDT can think things through when simulating other systems (like HCH) for imitating them. And so it should have strong capabilities there. But you’re right that its an issue for long term planning if we expect an LCDT agent to directly solve problems.
The technical answer is that the LCDT agent computes its distribution over actions spaces for the human by marginalizing the human’s current distribution with the LCDT agent distribution over its own action. The intuition is something like: “I believe that the human has already some model of which action I will take, and nothing I can do will change that”.
I’m with Steve in being confused how this works in practice.
Let’s say I’m an LCDT agent, and you’re a human flying a kite.
My action set: [Say “lovely day, isn’t it?”] [Burn your kite]
Your action set: [Move kite left] [Move kite right] [Angrily gesticulate]
Let’s say I initially model you as having p = 1⁄3 of each option, based on your expectation of my actions.
Now I decide to burn your kite.
What should I imagine will happen? If I burn it, your kite pointers are dangling.
Do the [Move kite left] and [Move kite right] actions become NOOPs?
Do I assume that my [burn kite] action fails?
I’m clear on ways you could technically say I didn’t influence the decision—but if I can predict I’ll have a huge influence on the output of that decision, I’m not sure what that buys us. (and if I’m not permitted to infer any such influence, I think I just become a pure nihilist with no preference for any action over any other)
In your example (and Steve’s example), you believe that the human action (and action space) will depend uniquely on your prior over your own decision (which you can’t control). So yes, in this situation you are actually indifferent, because you don’t think anything you do will change the result.
This basically points at the issue with approval-direction (or even asking a human to open a door); our counter argument is to use LCDT agents as simulators of agents, where the myopia mostly guarantee that they will not alter what they’re simulating.
(A subtlety I just noticed is that to make an LCDT agent change its model of an agent, you must create a task where its evaluation isn’t through influencing the actions of the agent, but some other “measure” that the model is better. Not unsurmountable, but a point to keep in mind).
Ok, so if I understand you correctly (and hopefully I don’t!), you’re saying that as an LCDT agent I believe my prior determines my prediction of:
1) The distribution over action spaces of the human.
2) The distribution over actions the human would take given any particular action space.
So in my kite example, let’s say my prior has me burn your kite with 10% probability.
So I believe that you start out with:
0.9 chance of the action set [Move kite left] [Move kite right] [Angrily gesticulate]
0.1 chance of the action set [Angrily gesticulate]
In considering my [burn kite] option, I must believe that taking the action doesn’t change your distribution over action sets—i.e. that after I do [burn kite] you still have a 0.9 chance of the action set [Move kite left] [Move kite right] [Angrily gesticulate]. So I must believe that [burn kite] does nothing.
Is that right so far, or am I missing something?
Similarly, I must believe that any action I can take that would change the distribution over action sets of any agent at any time in the future must also do nothing.
That doesn’t seem to leave much (or rather it seems to leave nothing in most worlds).
To put it another way, I don’t think the intuition works for action-set changes the way it does for decision-given-action-set changes. I can coherently assume that an agent ignores the consequences of my actions in its decision-given-an-action-set, since that only requires I assume something strange about its thinking. I cannot coherently assume that the agent has a distribution over action sets that it does not have: this requires a contradiction in my world model.
It’s not clear to me how the simulator-of-agents approach helps, but I may just be confused.
Currently the only coherent LCDT agent I can make sense of is trivial.
I’m confused because while your description is correct (except on your conclusion at the end), I already say that in the approval-direction problem: LCDT agents cannot believe in ANY influence of their actions on other agents.
For the world-model, it’s not actually incoherent because we cut the link and update the distribution of the subsequent agent.
And for usefulness/triviality when simulating or being overseen, LCDT doesn’t need to influence an agent, and so it will do its job while not being deceptive.
And my point is simply that once this is true, they cannot (coherently) believe in any influence of their actions on the world (in most worlds).
In (any plausible model of) the real world, any action taken that has any consequences will influence the distribution over future action sets of other agents.
I.e. I’m saying that [plausible causal world model] & [influences no agents] ⇒ [influences nothing]
So the only way I can see it ‘working’ are:
1) To agree it always influences nothing (I must believe that any action I take as an LCDT agent does precisely nothing).
or
2) To have an incoherent world model: one in which I can believe with 99% certainty that a kite no longer exists, and with 80% certainty that you’re still flying that probably-non-existent kite.
So I don’t see how an LCDT agent makes any reliable predictions.
[EDIT: if you still think this isn’t a problem, and that I’m confused somewhere (which I may be), then I think it’d be helpful if you could give an LCDT example where:
The LCDT agent has an action x which alters the action set of a human.
The LCDT agent draws coherent conclusions about the combined impact of x and its prediction of the human’s action. (of course I’m not saying the conclusions should be rational—just that they shouldn’t be nonsense)]
I feel pretty willing to bite the bullet on this—what sorts of bad things do you think LCDT agents would do given such a world model (at decision time)? Such an LCDT agent should still be perfectly capable of tasks like simulating HCH without being deceptive—and should still be perfectly capable of learning and improving its world model, since the incoherence only shows up at decision-time and learning is done independently.
Ah ok. Weird, but ok. Thanks.
Perhaps I’m now understanding correctly(??). An undesirable action that springs to mind: delete itself to free up disk space. Its future self is assumed to give the same output regardless of this action.
More generally, actions with arbitrarily bad side-effects on agents, to gain marginal utility. Does that make sense?
I need to think more about the rest.
[EDIT and see rambling reply to Adam re ways to avoid the incoherence. TLDR: I think placing a [predicted agent action set alterations] node directly after the LCDT decision node in the original causal diagram, deducing what can be deduced from that node, and treating it as an agent at decision-time might work. It leaves the LCDT agent predicting that many of its actions don’t do much, but it does get rid of the incoherence (I think). Currently unclear whether this throws the baby out with the bathwater; I don’t think it does anything about negative side-effects]
Sure—that’s totally fine. The point of LCDT isn’t to produce an aligned agent, but to produce an agent that’s never deceptive. That way, if your AI is going to delete itself to free up disk space, it’ll do it in training and you can see that it’s going to do that and correct that behavior.
With an LCDT agent, the idea is that if it does the right thing in training, you know it’s not just doing that because it’s trying to trick you and it’ll actually do the wrong thing later in deployment. The point of LCDT, in my head, is to give you (defects in deployment) iff (defects in training), that way you can just take your LCDT agent and tweak it until it does the right thing in training, and then not be surprised by it actually having been deceiving you and defecting when you deploy it.
Right, as far as I can see, it achieves the won’t-be-deceptive aim. My issue is in seeing how we find a model that will consistently do the right thing in training (given that it’s using LCDT).
As I understand it, under LCDT an agent is going to trade an epsilon utility gain on non-agent-influencing-paths for an arbitrarily bad outcome on agent-influencing-paths (since by design it doesn’t care about those). So it seems that it’s going to behave unacceptably for almost all goals in almost all environments in which there can be negative side-effects on agents we care about.
We can use it to run simulations, but it seems to me that most problems (deception in particular) get moved to the simulation rather than solved.
Quite possibly I’m still missing something, but I don’t currently see how the LCDT decisions do much useful work here (Am I wrong? Do you see LCDT decisions doing significant optimisation?).
I can picture its being a useful wrapper around a simulation, but it’s not clear to me in what ways finding a non-deceptive (/benign) simulation is an easier problem than finding a non-deceptive (/benign) agent. (maybe side-channel attacks are harder??)
How about an LCDT agent with the objective of imitating HCH? Such an agent should be aligned and competitive, assuming the same is true of HCH. Such an agent certainly shouldn’t delete itself to free up disk space, since HCH wouldn’t do that—nor should it fall prey to the general argument you’re making about taking epsilon utility in a non-agent path, since there’s only one utility node it can influence without going through other agents, which is the delta between its next action and HCH’s action.
I claim that, for a reasonably accurate HCH model that’s within some broad basin of attraction, an LCDT agent attempting to imitate that HCH model will end up aligned—and that the same is not true for any other decision theory/agent model that I know of. And LCDT can do this while being able to manage things like how to simulate most efficiently and how to allocate resources between different methods of simulation. The core idea is that LCDT solves the hard problem of being able to put optimization power into simulating something efficiently in a safe way.
Ok thanks, I think I see a little more clearly where you’re coming from now.
(it still feels potentially dangerous during training, but I’m not clear on that)
A further thought:
Ok, so suppose for the moment that HCH is aligned, and that we’re able to specify a sufficiently accurate HCH model. The hard part of the problem seems to be safe-and-efficient simulation of the output of that HCH model.
I’m not clear on how this part works: for most priors, it seems that the LCDT agent is going to assign significant probability to its creating agentic elements within its simulation. But by assumption, it doesn’t think it can influence anything downstream of those (or the probability that they exist, I assume).
That seems to be the place where LCDT needs to do real work, and I don’t currently see how it can do so efficiently. If there are agentic elements contributing to the simulation’s output, then it won’t think it can influence the output.
Avoiding agentic elements seems impossible almost by definition: if you can create an arbitrarily accurate HCH simulation without its qualifying as agentic, then your test-for-agents can’t be sufficiently inclusive.
...but hopefully I’m still confused somewhere.
This is not true—LCDT is happy to influence nodes downstream of agent nodes, it just doesn’t believe it can influence them through those agent nodes. So LCDT (at decision time) doesn’t believe it can change what HCH does, but it’s happy to change what it does to make it agree with what it thinks HCH will do, even though that utility node is downstream of the HCH agent nodes.
Ah yes, you’re right there—my mistake.
However, I still don’t see how LCDT can make good decisions over adjustments to its simulation. That simulation must presumably eventually contain elements classed as agentic.
Then given any adjustment X which influences the simulation outcome both through agentic paths and non-agentic paths, the LCDT agent will ignore the influence [relative to the prior] through the agentic paths. Therefore it will usually be incorrect about what X is likely to accomplish.
It seems to me that you’ll also have incoherence issues here too: X can change things so that p(Y = 0) is 0.99 through a non-agentic path, whereas the agents assumes the equivalent of [p(Y = 0) is 0.5] through an agentic path.
I don’t see how an LCDT agent can make efficient adjustments to its simulation when it won’t be able to decide rationally on those judgements in the presence of agentic elements (which again, I assume must exist to simulate HCH).
That’s a really interesting thought—I definitely think you’re pointing at a real concern with LCDT now. Some thoughts:
Note that this problem is only with actually running agents internally, not with simply having the objective of imitating/simulating an agent—it’s just that LCDT will try to simulate that agent exclusively via non-agentic means.
That might actually be a good thing, though! If it’s possible to simulate an agent via non-agentic means, that certainly seems a lot safer than internally instantiating agents—though it might just be impossible to efficiently simulate an agent without instantiating any agents internally, in which case it would be a problem.
In some sense, the core problem here is just that the LCDT agent needs to understand how to decompose its own decision nodes into individual computations so it can efficiently compute things internally and then know when and when not to label its internal computations as agents. How to decompose nodes into subnodes to properly work with multiple layers is a problem with all CDT-based decision theories, though—and it’s hopefully the sort of problem that finite factored sets will help with.
Ok, that mostly makes sense to me. I do think that there are still serious issues (but these may be due to my remaining confusions about the setup: I’m still largely reasoning about it “from outside”, since it feels like it’s trying to do the impossible).
For instance:
I agree that the objective of simulating an agent isn’t a problem. I’m just not seeing how that objective can be achieved without the simulation taken as a whole qualifying as an agent. Am I missing some obvious distinction here?
If for all x in X, sim_A(x) = A(x), then if A is behaviourally an agent over X, sim_A seems to be also.(Replacing equality with approximate equality doesn’t seem to change the situation much in principle)
[Pre-edit: Or is the idea that we’re usually only concerned with simulating some subset of the agent’s input->output mapping, and that a restriction of some function may have different properties from the original function? (agenthood being such a property)]
I can see that it may be possible to represent such a simulation as a group of nodes none of which is individually agentic—but presumably the same could be done with a human. It can’t be ok for LCDT to influence agents based on having represented them as collections of individually non-agentic components.
Even if sim_A is constructed as a Chinese room (w.r.t. agenthood), it’s behaving collectively as an agent.
“it’s just that LCDT will try to simulate that agent exclusively via non-agentic means”—mostly agreed, and agreed that this would be a good thing (to the extent possible).
However, I do think there’s a significant difference between e.g.:
[LCDT will not aim to instantiate agents] (true)
vs
[LCDT will not instantiate agents] (potentially false: they may be side-effects)
Side-effect-agents seem plausible if e.g.:
a) The LCDT agent applies adjustments over collections within its simulation.
b) An adjustment taking [useful non-agent] to [more useful non-agent] also sometimes takes [useful non-agent] to [agent].
Here it seems important that LCDT may reason poorly if it believes that it might create an agent. I agree that pre-decision-time processing should conclude that LCDT won’t aim to create an agent. I don’t think it will conclude that it won’t create an agent.
Agreed that finite factored sets seem promising to address any issues that are essentially artefacts of representations. However, the above seem more fundamental, unless I’m missing something.
Assuming this is actually a problem, it struck me that it may be worth thinking about a condition vaguely like:
An LCDTn agent cuts links at decision time to every agent other than [LCDTm agents where m > n].
The idea being to specify a weaker condition that does enough forwarding-the-guarantee to allow safe instantiation of particular types of agent while still avoiding deception.
I’m far from clear that anything along these lines would help: it probably doesn’t work, and it doesn’t seem to solve the side-effect-agent problem anyway: [complete indifference to influence on X] and [robustly avoiding creation of X] seem fundamentally incompatible.
Thoughts welcome. With luck I’m still confused.
There is no such example. The confusion I feel you have is not about what LCDT does in such cases, but about the necessity to solve such cases to be competitive and valuable. As Evan points out in his comment, simulating HCH or anything really doesn’t require altering the action set of a human/agent. And if some actions can do that, LCDT ends up having no incentives to do anything to change the human/agent, which is exactly what we want. That’s really the crux here IMO
Also, I feel part of the misunderstanding hinges on what I mention in this comment answering Steve.
[Pre-emptive apologies for the stream-of-consciousness: I made the mistake of thinking while I wrote. Hopefully I ended up somewhere reasonable, but I make no promises]
My point there wasn’t that it requires it, but that it entails it. After any action by the LCDT agent, the distribution over future action sets of some agents will differ from those same distributions based on the prior (perhaps very slightly).
E.g. if I burn your kite, your actual action set doesn’t involve kite-flying; your prior action set does. After I take the [burn kite] action, my prediction of [kite exists] doesn’t have a reliable answer.
If I’m understanding correctly (and, as ever, I may not be), this is just to say that it’d come out differently based on the way you set up the pre-link-cutting causal diagram. If the original diagram effectively had [kite exists iff Adam could fly kite], then I’d think it’d still exist after [burn kite]; if the original had [kite exists iff Joe didn’t burn kite] then I’d think that it wouldn’t.
In the real world, those two setups should be logically equivalent. The link-cutting breaks the equivalence. Each version of the final diagram functions in its own terms, but the answer to [kite exists] becomes an artefact of the way we draw the initial diagram. (I think!)
In this sense, it’s incoherent (so Evan’s not claiming there’s no bullet, but that he’s biting it); it’s just less clear that it matters that it’s incoherent.
I still tend to think that it does matter—but I’m not yet sure whether it’s just offending my delicate logical sensibilities, or if there’s a real problem.
For instance, in my reply to Evan, I think the [delete yourself to free up memory] action probably looks good if there’s e.g. an [available memory] node directly downstream of the [delete yourself...] action.
If instead the path goes [delete yourself...] --> [memory footprint of future self] --> [available memory], then deleting yourself isn’t going to look useful, since [memory footprint...] shouldn’t change.
Perhaps it’d work in general to construct the initial causal diagrams in this way:
You route maximal causality through agents, when there’s any choice.
So you then tend to get [LCDT action] --> [Agent action-set-alteration] --> [Whatever can be deduced from action-set-alteration].
You couldn’t do precisely this in general, since you’d need backwards-in-time causality—but I think you could do some equivalent. I.e. you’d put an [expected agent action set distribution] node immediately after the LCDT decision, treat that like an agent at decision time, and deduce values of intermediate nodes from that.
So in my kite example, let’s say you’ll only get to fly your kite (if it exists) two months from my decision, and there’s a load of intermediate nodes.
But directly downstream of my [burn kite] action we put a [prediction of Adam’s future action set] node. All of the causal implications of [burn kite] get routed through the action set prediction node.
Then at decision time the action-set prediction node gets treated as part of an agent, and there’s no incoherence. (but I predict that my [burn kite] fails to burn your kite)
Anyway, quite possibly doing things this way would have a load of downsides (or perhaps it doesn’t even work??), but it seems plausible to me.
My remaining worry is whether getting rid of the incoherence in this way is too limiting—since the LCDT agent gets left thinking its actions do almost nothing (given that many/most actions would be followed by nodes which negate their consequences relative to the prior).
[I’ll think more about whether I’m claiming much/any of this impacts the simulation setup (beyond any self-deletion issues)]
I’m gonna see if I can explain this in more detail—you can correct me if I’m wrong.
In common sense, I would say “Suppose I burn the kite. What happens in the future? Is it good or bad? OK, suppose I don’t burn the kite. What happens in the future? Is it good or bad?” And then decide on that basis.
But that’s EDT.
CDT is different.
In CDT I can have future expectations that follow logically from burning the kite, but they don’t factor in as considerations, because they don’t causally flow from the decision according to the causal diagram in my head.
The classic example is smoking lesion.
Smoking lesion is a pretty intuitive example for us to think about, because smoking lesion involves a plausible causal diagram of the world.
Here we’re taking the same idea, but I (=the LCDT agent) have a wildly implausible causal diagram of the world. “If I burn the kite, then the person won’t move the kite, but c’mon, that’s not because I burned the kite!”
Just like the smoking lesion, I have the idea that the kite might or might not be there, but that’s a fact about the world that’s somehow predetermined before decision time, not because of my decision, and therefore doesn’t factor into my decision.
…Maybe. Did I get that right?
Anyway, I usually think of a world-model as having causality in it, as opposed to causal diagrams being a separate layer that exists on top of a world model. So I would disagree with “not actually incoherent”. Specifically, I think if an agent can do the kind of reasoning that would allow it to create a causal world-model in the first place, then the same kind of reasoning would lead it to realize that there is in fact supposed to be a link at each of the places where we manually cut it—i.e., that the causal world-model is incoherent.
Sorry if I’m confused.
An LCDT agent should certainly be aware of the fact that those causal chains actually exist—it just shouldn’t care about that. If you want to argue that it’ll change to not using LCDT to make decisions anymore, you have to argue that, under the decision rules of LCDT, it will choose to self-modify in some particular situation—but LCDT should rule out its ability to ever believe that any self-modification will do anything, thus ensuring that, once an agent starts making decisions using LCDT, it shouldn’t stop.
In addition to Evan’s answer (with which I agree), I want to make explicit an assumption I realized after reading your last paragraph: we assume that the causal graph is the final result of the LCDT agent consulting its world model to get a “model” of the task at hand. After that point (which includes drawing causality and how the distributions impacts each other, as well as the sources’ distributions), the LCDT agent only decides based on this causal graph. In this case it cuts the causal links to agent and then decide CDT style.
None of this result in an incoherent world model because the additional knowledge that could be used to realize that the cuts are not “real”, is not available in the truncated causal model, and thus cannot be accessed while making the decision.
I honestly feel this is the crux of our talking past each other (same with Joe) in the last few comments. Do you think that’s right?