Koen.Holtman comments on LCDT, A Myopic Decision Theory

Koen.Holtman 25 Aug 2021 9:49 UTC
LW: 9 AF: 6
0
AF
Joe asked me in this comment:

I’d be interested on your take on Evan’s comment on incoherence in LCDT.

To illustrate his point on incoherence, Joe gives a kite example:

Let’s say I’m an LCDT agent, and you’re a human flying a kite.

My action set: [Say “lovely day, isn’t it?”] [Burn your kite]

Your action set: [Move kite left] [Move kite right] [Angrily gesticulate]

Let’s say I initially model you as having p = ¹⁄₃ of each option, based on your> expectation of my actions.

Now I decide to burn your kite.

What should I imagine will happen? If I burn it, your kite pointers are dangling.

Do the [Move kite left] and [Move kite right] actions become NOOPs?

Do I assume that my [burn kite] action fails?

My take is that there is indeed a problem that ‘your kite pointers are dangling’ in projection that the LCDT world model will compute. So the world projected will be somewhat weird.

In my mental picture of the most obvious way to implement LCDT and the structural functions attached to the LCDT model, the projection will be weird in the following way. After [burn kite], the action [Move kite left], when applied to the world state produced by [burn kite], will produce a world state where the human is miming that they are flying a kite. They will make the right gestures to move an invisible kite left, they might even be holding a kite rope when making the gestures, but the rope will not be connected to an actual kite.

So this is weird. However, I would not call it ‘incoherent’ or ‘requiring a contradiction’ as Joe does:

I cannot coherently assume that the agent has a distribution over action sets that it does not have: this requires a contradiction in my world model.

The phrasing ‘contradiction in the world model’ evokes the concern that the LCDT-constructed world model might crash or not be solvable, when we use it to score the action [burn kite]. But a nice feature of causal models, even counterfactual ones as generated by LCDT, is that they will ever crash: they will always compute a future reward score for any possible candidate action or policy. The score may however be weird. There is a potential GIGO problem here.

The word ‘incoherent’ invokes the concern that the model will be so twisted that we can definitely expect weird scores being computed more often than not. If so, the agent actions computed may be ineffective, strangely inappropriate, or even even dangerous when applied to the real world.

In other words: garbage world model in, garbage agent decision out.

One specific worry discussed here is that a counterfactual model may output potentially dangerous garbage because it pushes the inputs of the structural functions being used way out of training distribution.

That being said, there can be advantages to imperfection too. If we design just the right kind of ‘garbage’ into the agent’s world model, we may be able to suppress certain dangerous agent incentives, while still having an agent that is otherwise fairly good at doing the job we intend it to do. This is what LCDT is doing, for certain agent jobs, and it is also what my counterfactual planning agents designs here are doing, for certain other agent jobs.

That being said, it is clear (from the comments and I think also from the original post) that most feel that applying LCDT does not produce useful outcomes for all possible jobs we would want agents to do. Notably, when applied to a decision making problem where the agent has to come up with a multi-step reward-maximizing policy/plan, i.e. a typical MDP or RL benchmark problem, LCDT will produce an agent with a hugely impaired planning ability. How hugely will depend in part on the prior used.

Evan’s take is that he is not too concerned with this, as he has other agent applications in mind:

an LCDT agent should still be perfectly capable of tasks like simulating HCH

i.e. we can apply LCDT when building an imitation learner, which is different from a reinforcement learner. In the argmax HCH examples above, the agent furthermore is not imitating a human mentor who is present in the real agent environment, but a simulated mentor built up out of simulated humans consulting simulated humans.

On a philosophical thought-experiment level, this combination of LCDT and HCH works for me, it is even elegant. But in applied safety engineering terms, I see several risks with using HCH. For example, if the learned model of humans that the agent uses in HCH calculations is not perfect, then the recursive nature of HCH might amplify these imperfections rather than dampen them, producing outcomes that are very much unaligned. Also, on a more moral-philosophical point, might all these simulated humans become aware that they live in a simulation, and if so will they then seek to take sweet revenge on the people who put them there?

Back to the topic of incoherence. Joe also asks:

Specifically, do you think the issue I’m pointing at is a difference between LCDT and counterfactual planners? (or perhaps that I’m just wrong about the incoherence??)

I see LCDT agents as a subset of all possible counterfactual planning agent architectures, so in that sense there is no difference.

However, in my sequence and paper on counterfactual planning, I construct planning worlds by using quite different world model editing steps than those considered in LCDT. These different steps produce different results in terms of the weirdness or garbage-ness of the planning world model.

The editing steps I consider in the main examples of counterfactual planning is that I edit the real world model to construct a planning world model that has a different agent compute core in it, while leaving the physical world outside of the compute core unchanged. Specifically, the planning world models I considered do not accurately depict the software running inside the agent compute core, they depict a compute core running different software.

In terms of plausibility and internal consistency, a compute core running different software is more plausible/coherent than what can happen in the models constructed by LCDT.

As I currently understand things, I believe that CPs are doing planning in a counterfactual-but-coherent world, whereas LCDT is planning in an (intentionally) incoherent world—but I might be wrong in either case.

You are right in both cases, at least if we picture coherence as a sliding scale, not as a binary property. It also depends on the world model you start out with, of course.
What links here?
- Koen.Holtman's comment on LCDT, A Myopic Decision Theory by adamShimi (25 Aug 2021 9:57 UTC; 1 point)
- Joe Collman 4 Sep 2021 0:13 UTC
  LW: 3 AF: 3
  0
  AF Parent
  Thanks, that’s interesting. [I did mean to reply sooner, but got distracted]
  A few quick points:
  Yes, by “incoherent causal model” I only mean something like “causal model that has no clear mapping back to a distribution over real worlds” (e.g. where different parts of the model assume that [kite exists] has different probabilities).
  Agreed that the models LCDT would use are coherent in their own terms. My worry is, as you say, along garbage-in-garbage-out lines.
  Having LCDT simulate HCH seems more plausible than its taking useful action in the world—but I’m still not clear how we’d avoid the LCDT agent creating agential components (or reasoning based on its prediction that it might create such agential components) [more on this here: point (1) there seems ok for prediction-of-HCH-doing-narrow-task (since all we need is some non-agential solution to exist); point (2) seems like a general problem unless the LCDT agent has further restrictions].
  Agreed on HCH practical difficulties—I think Evan and Adam are a bit more optimistic on HCH than I am, but no-one’s saying it’s a non-problem. From the LCDT side, it seems we’re ok so long as it can simulate [something capable and aligned]; HCH seems like a promising candidate.
  On HCH-simulation practical specifics, I think a lot depends on how you’re generating data / any model of H, and the particular way any [system that limits to HCH] would actually limit to HCH. E.g. in an IDA setup, the human(s) in any training step will know that their subquestions are answered by an approximate model.
  I think we may be ok on error-compounding, so long as the learned model of humans is not overconfident of its own accuracy (as a model of humans). You’d hope to get compounding uncertainty rather than compounding errors.