Here’s an attempt at condensing an issue I’m hung up on currently with ELK. This also serves as a high-level summary that I’d welcome poking at in case I’m getting important parts wrong.
The setup for ELK is that we’re trying to accurately label a dataset of (observation, action, predicted subsequent observation) triples for whether the actions are good. (The predicted subsequent observations can be optimised for accuracy using automated labels—what actually gets observed subsequently—whereas the actions need their labels to come from a source of judgement about what’s good, e.g., a human rater.)
The basic problem is partial observability: the observations don’t encapsulate “everything that’s going on”, so the labeller can’t distinguish good states from bad states that look good. An AI optimising actions for positive labels (and predicted observations for accuracy) may end up preferring to reach bad states that look good over good states, because controlling the observation is easier than controlling the rest of the state and because directly predicting what observations will get positive labels is easier than (what we’d want instead) inferring what states the positive labels are being attributed to and trying to produce those states.
The issue I’m hung up on currently is what seems like a conflation of two problems that may be worth distinguishing.
Problem 1 is that the observations might be misleading evidence. There’s some good state that produces the same observations as some bad state. If the labeller knew they were in the bad state they’d give a negative label, but they can’t tell. Maybe their prior favours the good state, so they assume that’s what they’re seeing and give a positive label.
Problem 2 is that the labeller doesn’t understand the state that produced the observations. In this case I have to be a bit more careful about what I mean by “states”. For now, I’m talking about ways the world could be that the labeller understands well enough to answer questions about what’s important to them, e.g., a state resolves a question like “is the diamond still present?” for the labeller. Problem 2 is that there are ways the world can be that do not resolve such questions for the labeller. Further judgement, deliberation, and understanding is required to determine what the answer should be in these strange worlds. In this case, the labeller will probably produce a label for the state they understand that’s most compatible with the observations, or they’ll be too confused by the observations and conservatively give a negative label. The AI may then optimise for worlds that are deeply confusing with the only fact we can grasp about what’s going on being that the observations look great.
I think the focus on narrow elicitation in the report is about restricting attention to Problem 1 and eschewing Problem 2. Is that right? Either way, I’d say if we restrict to Problem 1 then I claim there’s hope in the fact that the labeller can in principle understand what’s actually going on, and it’s just a matter of showing them some additional observations to expose it. This is what I’d try to figure out how to incentivise. But I’d want to do so without having to worry about the confusing things coming out of Problem 2, and hope to deal with that problem separately.
(If it might help I think I could give more of a formalisation of these problems. I think the natural language description above is probably clearer for now though.)
A formalisation of the ideas in this sequence in higher-order logic, including machine verified proofs of all the theorems, is available here.