Let’s say I initially model you as having p = 1⁄3 of each option, based on your> expectation of my actions.
Now I decide to burn your kite.
What should I imagine will happen? If I burn it, your kite pointers are dangling.
Do the [Move kite left] and [Move kite right] actions become NOOPs?
Do I assume that my [burn kite] action fails?
My take is that there is indeed a problem that ‘your kite pointers are
dangling’ in projection that the LCDT world model will compute. So
the world projected will be somewhat weird.
In my mental picture of the most obvious way to implement LCDT and the
structural functions attached to the LCDT model, the projection will be weird
in the following way. After [burn kite], the action [Move kite left],
when applied to the world state produced by [burn kite], will produce a world state where the
human is miming that they are flying a kite. They will make the right
gestures to move an invisible kite left, they might even be holding a
kite rope when making the gestures, but the rope will not be connected
to an actual kite.
So this is weird. However, I would not call it ‘incoherent’ or
‘requiring a contradiction’ as Joe does:
I cannot coherently assume that the agent has a distribution over action sets that it does not have: this requires a contradiction in my world model.
The phrasing ‘contradiction in the world model’ evokes the concern
that the LCDT-constructed world model might crash or not be solvable,
when we use it to score the action [burn kite]. But a nice feature of
causal models, even counterfactual ones as generated by LCDT, is that
they will ever crash: they will always compute a future reward score
for any possible candidate action or policy. The score may however be
weird. There is a potential GIGO problem here.
The word ‘incoherent’ invokes the concern that the model will be so
twisted that we can definitely expect weird scores being computed more
often than not. If so, the agent actions computed may be ineffective,
strangely inappropriate, or even even dangerous when applied to the
real world.
In other words: garbage world model in, garbage agent decision out.
One specific worry discussed
here
is that a counterfactual model may output potentially dangerous
garbage because it pushes the inputs of the structural functions being
used way out of training distribution.
That being said, there can be advantages to imperfection too. If we
design just the right kind of ‘garbage’ into the agent’s world model,
we may be able to suppress certain dangerous agent incentives, while
still having an agent that is otherwise fairly good at doing the job we intend
it to do. This is what LCDT is doing, for certain agent jobs, and it
is also what my counterfactual planning agents designs
here are doing,
for certain other agent jobs.
That being said, it is clear (from the comments and I think also from the
original post) that most feel that applying LCDT does not produce useful outcomes
for all possible jobs we would want agents to do. Notably, when
applied to a decision making problem where the agent has to come up
with a multi-step reward-maximizing policy/plan, i.e. a typical MDP or
RL benchmark problem, LCDT will produce an agent with a hugely
impaired planning ability. How hugely will depend in part on the prior
used.
Evan’s take is that he is not too concerned with this, as he has other
agent applications in mind:
an LCDT agent should still be perfectly capable of tasks like simulating HCH
i.e. we can apply LCDT when building an imitation learner, which is
different from a reinforcement learner. In the argmax HCH examples
above, the agent furthermore is not imitating a human mentor who is
present in the real agent environment, but a simulated mentor built up
out of simulated humans consulting simulated humans.
On a philosophical thought-experiment level, this combination of LCDT
and HCH works for me, it is even elegant. But in applied safety
engineering terms, I see several risks with using HCH. For example,
if the learned model of humans that the agent uses in HCH calculations
is not perfect, then the recursive nature of HCH might amplify these
imperfections rather than dampen them, producing outcomes that are
very much unaligned. Also, on a more moral-philosophical point, might
all these simulated humans become aware that they live in a
simulation, and if so will they then seek to take sweet revenge on the
people who put them there?
Back to the topic of incoherence. Joe also asks:
Specifically, do you think the issue I’m pointing at is a difference between LCDT and counterfactual planners? (or perhaps that I’m just wrong about the incoherence??)
I see LCDT agents as a subset of all possible counterfactual planning
agent architectures, so in that sense there is no difference.
However, in my
sequence and
paper on counterfactual planning, I construct planning worlds by using
quite different world model editing steps than those considered in
LCDT. These different steps produce different results in terms of the
weirdness or garbage-ness of the planning world model.
The editing steps I consider in the main examples of counterfactual
planning is that
I edit the real world model to construct a planning world model that
has a different agent compute core in it, while leaving the physical world outside
of the compute core unchanged. Specifically, the planning world
models I considered do not accurately depict the software running inside the agent
compute core, they depict a compute core running different software.
In terms of plausibility and internal consistency, a compute core
running different software is more plausible/coherent than what can
happen in the models constructed by LCDT.
As I currently understand things, I believe that CPs are doing planning in a counterfactual-but-coherent world, whereas LCDT is planning in an (intentionally) incoherent world—but I might be wrong in either case.
You are right in both cases, at least if we picture coherence as a sliding scale,
not as a binary property. It also depends on the world model you
start out with, of course.
Thanks, that’s interesting. [I did mean to reply sooner, but got distracted]
A few quick points:
Yes, by “incoherent causal model” I only mean something like “causal model that has no clear mapping back to a distribution over real worlds” (e.g. where different parts of the model assume that [kite exists] has different probabilities). Agreed that the models LCDT would use are coherent in their own terms. My worry is, as you say, along garbage-in-garbage-out lines.
Having LCDT simulate HCH seems more plausible than its taking useful action in the world—but I’m still not clear how we’d avoid the LCDT agent creating agential components (or reasoning based on its prediction that it might create such agential components) [more on this here: point (1) there seems ok for prediction-of-HCH-doing-narrow-task (since all we need is some non-agential solution to exist); point (2) seems like a general problem unless the LCDT agent has further restrictions].
Agreed on HCH practical difficulties—I think Evan and Adam are a bit more optimistic on HCH than I am, but no-one’s saying it’s a non-problem. From the LCDT side, it seems we’re ok so long as it can simulate [something capable and aligned]; HCH seems like a promising candidate.
On HCH-simulation practical specifics, I think a lot depends on how you’re generating data / any model of H, and the particular way any [system that limits to HCH] would actually limit to HCH. E.g. in an IDA setup, the human(s) in any training step will know that their subquestions are answered by an approximate model.
I think we may be ok on error-compounding, so long as the learned model of humans is not overconfident of its own accuracy (as a model of humans). You’d hope to get compounding uncertainty rather than compounding errors.
Joe asked me in this comment:
To illustrate his point on incoherence, Joe gives a kite example:
My take is that there is indeed a problem that ‘your kite pointers are dangling’ in projection that the LCDT world model will compute. So the world projected will be somewhat weird.
In my mental picture of the most obvious way to implement LCDT and the structural functions attached to the LCDT model, the projection will be weird in the following way. After [burn kite], the action [Move kite left], when applied to the world state produced by [burn kite], will produce a world state where the human is miming that they are flying a kite. They will make the right gestures to move an invisible kite left, they might even be holding a kite rope when making the gestures, but the rope will not be connected to an actual kite.
So this is weird. However, I would not call it ‘incoherent’ or ‘requiring a contradiction’ as Joe does:
The phrasing ‘contradiction in the world model’ evokes the concern that the LCDT-constructed world model might crash or not be solvable, when we use it to score the action [burn kite]. But a nice feature of causal models, even counterfactual ones as generated by LCDT, is that they will ever crash: they will always compute a future reward score for any possible candidate action or policy. The score may however be weird. There is a potential GIGO problem here.
The word ‘incoherent’ invokes the concern that the model will be so twisted that we can definitely expect weird scores being computed more often than not. If so, the agent actions computed may be ineffective, strangely inappropriate, or even even dangerous when applied to the real world.
In other words: garbage world model in, garbage agent decision out.
One specific worry discussed here is that a counterfactual model may output potentially dangerous garbage because it pushes the inputs of the structural functions being used way out of training distribution.
That being said, there can be advantages to imperfection too. If we design just the right kind of ‘garbage’ into the agent’s world model, we may be able to suppress certain dangerous agent incentives, while still having an agent that is otherwise fairly good at doing the job we intend it to do. This is what LCDT is doing, for certain agent jobs, and it is also what my counterfactual planning agents designs here are doing, for certain other agent jobs.
That being said, it is clear (from the comments and I think also from the original post) that most feel that applying LCDT does not produce useful outcomes for all possible jobs we would want agents to do. Notably, when applied to a decision making problem where the agent has to come up with a multi-step reward-maximizing policy/plan, i.e. a typical MDP or RL benchmark problem, LCDT will produce an agent with a hugely impaired planning ability. How hugely will depend in part on the prior used.
Evan’s take is that he is not too concerned with this, as he has other agent applications in mind:
i.e. we can apply LCDT when building an imitation learner, which is different from a reinforcement learner. In the argmax HCH examples above, the agent furthermore is not imitating a human mentor who is present in the real agent environment, but a simulated mentor built up out of simulated humans consulting simulated humans.
On a philosophical thought-experiment level, this combination of LCDT and HCH works for me, it is even elegant. But in applied safety engineering terms, I see several risks with using HCH. For example, if the learned model of humans that the agent uses in HCH calculations is not perfect, then the recursive nature of HCH might amplify these imperfections rather than dampen them, producing outcomes that are very much unaligned. Also, on a more moral-philosophical point, might all these simulated humans become aware that they live in a simulation, and if so will they then seek to take sweet revenge on the people who put them there?
Back to the topic of incoherence. Joe also asks:
I see LCDT agents as a subset of all possible counterfactual planning agent architectures, so in that sense there is no difference.
However, in my sequence and paper on counterfactual planning, I construct planning worlds by using quite different world model editing steps than those considered in LCDT. These different steps produce different results in terms of the weirdness or garbage-ness of the planning world model.
The editing steps I consider in the main examples of counterfactual planning is that I edit the real world model to construct a planning world model that has a different agent compute core in it, while leaving the physical world outside of the compute core unchanged. Specifically, the planning world models I considered do not accurately depict the software running inside the agent compute core, they depict a compute core running different software.
In terms of plausibility and internal consistency, a compute core running different software is more plausible/coherent than what can happen in the models constructed by LCDT.
You are right in both cases, at least if we picture coherence as a sliding scale, not as a binary property. It also depends on the world model you start out with, of course.
Thanks, that’s interesting. [I did mean to reply sooner, but got distracted]
A few quick points:
Yes, by “incoherent causal model” I only mean something like “causal model that has no clear mapping back to a distribution over real worlds” (e.g. where different parts of the model assume that [kite exists] has different probabilities).
Agreed that the models LCDT would use are coherent in their own terms. My worry is, as you say, along garbage-in-garbage-out lines.
Having LCDT simulate HCH seems more plausible than its taking useful action in the world—but I’m still not clear how we’d avoid the LCDT agent creating agential components (or reasoning based on its prediction that it might create such agential components) [more on this here: point (1) there seems ok for prediction-of-HCH-doing-narrow-task (since all we need is some non-agential solution to exist); point (2) seems like a general problem unless the LCDT agent has further restrictions].
Agreed on HCH practical difficulties—I think Evan and Adam are a bit more optimistic on HCH than I am, but no-one’s saying it’s a non-problem. From the LCDT side, it seems we’re ok so long as it can simulate [something capable and aligned]; HCH seems like a promising candidate.
On HCH-simulation practical specifics, I think a lot depends on how you’re generating data / any model of H, and the particular way any [system that limits to HCH] would actually limit to HCH. E.g. in an IDA setup, the human(s) in any training step will know that their subquestions are answered by an approximate model.
I think we may be ok on error-compounding, so long as the learned model of humans is not overconfident of its own accuracy (as a model of humans). You’d hope to get compounding uncertainty rather than compounding errors.