Here’s an attempt at condensing an issue I’m hung up on currently with ELK. This also serves as a high-level summary that I’d welcome poking at in case I’m getting important parts wrong.
The setup for ELK is that we’re trying to accurately label a dataset of (observation, action, predicted subsequent observation) triples for whether the actions are good. (The predicted subsequent observations can be optimised for accuracy using automated labels—what actually gets observed subsequently—whereas the actions need their labels to come from a source of judgement about what’s good, e.g., a human rater.)
The basic problem is partial observability: the observations don’t encapsulate “everything that’s going on”, so the labeller can’t distinguish good states from bad states that look good. An AI optimising actions for positive labels (and predicted observations for accuracy) may end up preferring to reach bad states that look good over good states, because controlling the observation is easier than controlling the rest of the state and because directly predicting what observations will get positive labels is easier than (what we’d want instead) inferring what states the positive labels are being attributed to and trying to produce those states.
The issue I’m hung up on currently is what seems like a conflation of two problems that may be worth distinguishing.
Problem 1 is that the observations might be misleading evidence. There’s some good state that produces the same observations as some bad state. If the labeller knew they were in the bad state they’d give a negative label, but they can’t tell. Maybe their prior favours the good state, so they assume that’s what they’re seeing and give a positive label.
Problem 2 is that the labeller doesn’t understand the state that produced the observations. In this case I have to be a bit more careful about what I mean by “states”. For now, I’m talking about ways the world could be that the labeller understands well enough to answer questions about what’s important to them, e.g., a state resolves a question like “is the diamond still present?” for the labeller. Problem 2 is that there are ways the world can be that do not resolve such questions for the labeller. Further judgement, deliberation, and understanding is required to determine what the answer should be in these strange worlds. In this case, the labeller will probably produce a label for the state they understand that’s most compatible with the observations, or they’ll be too confused by the observations and conservatively give a negative label. The AI may then optimise for worlds that are deeply confusing with the only fact we can grasp about what’s going on being that the observations look great.
I think the focus on narrow elicitation in the report is about restricting attention to Problem 1 and eschewing Problem 2. Is that right? Either way, I’d say if we restrict to Problem 1 then I claim there’s hope in the fact that the labeller can in principle understand what’s actually going on, and it’s just a matter of showing them some additional observations to expose it. This is what I’d try to figure out how to incentivise. But I’d want to do so without having to worry about the confusing things coming out of Problem 2, and hope to deal with that problem separately.
(If it might help I think I could give more of a formalisation of these problems. I think the natural language description above is probably clearer for now though.)
My understanding is that we are eschewing Problem 2, with one caveat—we still expect to solve the problem if the means by which the diamond was stolen or disappeared could be beyond a human’s ability to comprehend, as long as the outcome (that the diamond isn’t still in the room) is still comprehensible. For example, if the robber used some complicated novel technology to steal the diamond and hack the camera, there would be many things about the state that the human couldn’t understand even if the AI tried to explain it to them (at least without going over our compute budget for training). But nevertheless it would still be an instance of Problem 1 because they could understand the basic notion of “because of some actions involving complicated technology, the diamond is no longer in the room, even though it may look like it is.”
I basically think this distinction is real and we are talking about problem 1 instead of problem 2. That said, I don’t feel like it’s quite right to frame it as “states” that the human does or doesn’t understand. Instead we’re thinking about properties of the world as being ambiguous or not in a given state.
As a silly example, you could imagine having two rooms where one room is normal and the other is crazy. Then questions about the first room are easy and questions about the second are hard. But in reality the degrees of freedom will be much more mixed up than that.
To give some more detail on my thoughts on state:
Obviously the human never knows the “real” state, which has a totally different type signature than their beliefs.
So it’s natural to talk about knowing states based on correctly predicting what will happen in the future starting from that state. But it’s ~never the case that the human’s predictions about what will happen next are nearly as good as the predictor’s.
We could try to say “you can make good predictions about what happens next for typical actions” or something, but even for typical actions the human predictions are bad relative to the predictor, and it’s not clear in what sense they are “good” other than some kind of calibration condition.
If we imagine an intuitive translation between two models of reality, most “weird” states aren’t outside of the domain of the translation, it’s just that there are predictively important parts of the state that are obscured by the translation (effectively turning into noise, perhaps very surprising noise).
Despite all of that, it seems like it really is sometimes unambiguous to say “You know that thing out there in the world that you would usually refer to by saying ‘the diamond is sitting there and nothing weird happened to it’? That thing which would lead you to predict that the camera will show a still frame of a diamond? That thing definitely happened, and is why the camera is showing a still frame of a diamond, it’s not for some other reason.”
I think that problem 1 and problem 2 as you describe them are potentially talking about the same phenomenon. I’m not sure I’m understanding correctly, but I think I would make the following claims:
Our notion of narrowness is that we are interested in solving the problem where the question we’re asking is such that a state always resolves a question. E.g. there isn’t any ambiguity around whether a state “really contains a diamond”. (Note that there is ambiguity around whether the human could detect the diamond from any set of observations because there could be a fake diamond or nanobots filtering what the human sees). It might be useful to think of this as an empirical claim about diamonds.
We are explicitly interested in solving some forms of problem 2, e.g. we’re interested in our AI being able to answer questions about the presence/absence of diamonds no matter how alien the world gets. In some sense, we are interested in our AI answering questions the same way a human would answer questions if they “knew what was really going on”, but that “knew what was really going on” might be a misleading phrase. I’m not imagining that “knowing what is really going on” to be a very involved process; intuitively, it means something like “the answer they would give if the sensors are ‘working as intended’”. In particular, I don’t think that, for the case of the diamond, “Further judgement, deliberation, and understanding is required to determine what the answer should be in these strange worlds.”
We want to solve these versions of problem 2 because the speed “things getting weirder” in the world might be much faster than human ability to understand what’s going on the world. In these worlds, we want to leverage the fact that answers to “narrow” questions are unambiguous to incentivize our AIs to give humans a locally understandable environment in which to deliberate.
We’re not interested in solving forms of problem 2 where the human needs to do additional deliberation to know what the answer to the question “should” be. E.g. in ship-of-theseus situations where the diamond is slowly replaced, we aren’t expecting our AI to answer “is that the same diamond as before?” using the resolution of ship-of-theseus style situations that a human would arrive at with additional deliberation. We are, however, expecting that the answer to the question “does the diamond look the way it does because of the ‘normal’ causal reasons?” is “no” because the reason is something like “[incomprehensible process] replaced bits of the diamond with identical bits slowly”, which is definitely not why diamonds normally continue looking the same.
It might be useful to think of this as an empirical claim about diamonds.
I think this statement encapsulates some worries I have.
If it’s important how the human defines a property like “the same diamond,” then assuming that the sameness of the diamond is “out there in the diamond” will get you into trouble—e.g. if there’s any optimization pressure to find cases where the specifics of the human’s model rear their head. Human judgment is laden with the details of how humans model the world, you can’t avoid dependence on the human (and the messiness that entails) entirely.
Or to phrase it another way: I don’t have any beef with a narrow approach that says “there’s some set of judgments for which the human is basically competent, and we want to elicit knowledge relevant to those judgments.” But I’m worried about a narrow approach that says “let’s assume that humans are basically competent for all judgments of interest, and keep assuming this until something goes wrong.”
It just feels to me like this second approach is sort of… treating the real world as if it’s a perturbative approximation to the platonic realm.
Our notion of narrowness is that we are interested in solving the problem where the question we’re asking is such that a state always resolves a question. E.g. there isn’t any ambiguity around whether a state “really contains a diamond”. (Note that there is ambiguity around whether the human could detect the diamond from any set of observations because there could be a fake diamond or nanobots filtering what the human sees). It might be useful to think of this as an empirical claim about diamonds.
This “there isn’t any ambiguity”+”there is ambiguity” does not seem possible to me: these types of ambiguity are one and the same. But it might depend on what “any set of observations” is allowed to include. “Any set” suggests being very inclusive, but remember that passive observation is impossible. Perhaps the observations I’d want the human to use to figure out if the diamond is really there (presuming there isn’t ambiguity) would include observations you mean to exclude, such as disabling the filter-nanobots first?
I guess a wrinkle here is that observations need to be “implementable” in the world. If we’re thinking of making observations as intervening on the world (e.g., to decide which sensors to query), then some observations may be inaccessible because we can’t make that intervention. Rewriting this all without relying on “possible”/”can” concepts would be instructive.
I don’t think we have any kind of precise definition of “no ambiguity.” That said, I think it’s easy to construct examples where there is no ambiguity about whether the diamond remained in the room, yet there is no sequence of actions a human could take that would let them figure out the answer. For example, we can imagine simple toy universes where we understand exactly what features of the world give rise to human beliefs about diamonds and where we can say unambiguously that the same features are/aren’t present in a given situation.
In general I feel a lot better about our definitions when we are using them to arbitrate a counterexample than if we were trying to give a formal definition. If all the counterexamples involved border cases of the concepts, where there was arguable ambiguity about whether the diamond really stayed in the room, then it would seem important to firm up these concepts but right now it feels like it is easy to just focus on cases where algorithms unambiguously fail.
(That methodological point isn’t obvious though—it may be that precise definitions are very useful for solving the problem even if you don’t need them to judge current solutions as inadequate. Or it may be that actually existing counterexamples are problematic in ways we don’t recognize. Pushback on these fronts is always welcome, but right now I feel pretty comfortable with the situation.)
Here’s an attempt at condensing an issue I’m hung up on currently with ELK. This also serves as a high-level summary that I’d welcome poking at in case I’m getting important parts wrong.
The setup for ELK is that we’re trying to accurately label a dataset of (observation, action, predicted subsequent observation) triples for whether the actions are good. (The predicted subsequent observations can be optimised for accuracy using automated labels—what actually gets observed subsequently—whereas the actions need their labels to come from a source of judgement about what’s good, e.g., a human rater.)
The basic problem is partial observability: the observations don’t encapsulate “everything that’s going on”, so the labeller can’t distinguish good states from bad states that look good. An AI optimising actions for positive labels (and predicted observations for accuracy) may end up preferring to reach bad states that look good over good states, because controlling the observation is easier than controlling the rest of the state and because directly predicting what observations will get positive labels is easier than (what we’d want instead) inferring what states the positive labels are being attributed to and trying to produce those states.
The issue I’m hung up on currently is what seems like a conflation of two problems that may be worth distinguishing.
Problem 1 is that the observations might be misleading evidence. There’s some good state that produces the same observations as some bad state. If the labeller knew they were in the bad state they’d give a negative label, but they can’t tell. Maybe their prior favours the good state, so they assume that’s what they’re seeing and give a positive label.
Problem 2 is that the labeller doesn’t understand the state that produced the observations. In this case I have to be a bit more careful about what I mean by “states”. For now, I’m talking about ways the world could be that the labeller understands well enough to answer questions about what’s important to them, e.g., a state resolves a question like “is the diamond still present?” for the labeller. Problem 2 is that there are ways the world can be that do not resolve such questions for the labeller. Further judgement, deliberation, and understanding is required to determine what the answer should be in these strange worlds. In this case, the labeller will probably produce a label for the state they understand that’s most compatible with the observations, or they’ll be too confused by the observations and conservatively give a negative label. The AI may then optimise for worlds that are deeply confusing with the only fact we can grasp about what’s going on being that the observations look great.
I think the focus on narrow elicitation in the report is about restricting attention to Problem 1 and eschewing Problem 2. Is that right? Either way, I’d say if we restrict to Problem 1 then I claim there’s hope in the fact that the labeller can in principle understand what’s actually going on, and it’s just a matter of showing them some additional observations to expose it. This is what I’d try to figure out how to incentivise. But I’d want to do so without having to worry about the confusing things coming out of Problem 2, and hope to deal with that problem separately.
(If it might help I think I could give more of a formalisation of these problems. I think the natural language description above is probably clearer for now though.)
My understanding is that we are eschewing Problem 2, with one caveat—we still expect to solve the problem if the means by which the diamond was stolen or disappeared could be beyond a human’s ability to comprehend, as long as the outcome (that the diamond isn’t still in the room) is still comprehensible. For example, if the robber used some complicated novel technology to steal the diamond and hack the camera, there would be many things about the state that the human couldn’t understand even if the AI tried to explain it to them (at least without going over our compute budget for training). But nevertheless it would still be an instance of Problem 1 because they could understand the basic notion of “because of some actions involving complicated technology, the diamond is no longer in the room, even though it may look like it is.”
Echoing Mark and Ajeya:
I basically think this distinction is real and we are talking about problem 1 instead of problem 2. That said, I don’t feel like it’s quite right to frame it as “states” that the human does or doesn’t understand. Instead we’re thinking about properties of the world as being ambiguous or not in a given state.
As a silly example, you could imagine having two rooms where one room is normal and the other is crazy. Then questions about the first room are easy and questions about the second are hard. But in reality the degrees of freedom will be much more mixed up than that.
To give some more detail on my thoughts on state:
Obviously the human never knows the “real” state, which has a totally different type signature than their beliefs.
So it’s natural to talk about knowing states based on correctly predicting what will happen in the future starting from that state. But it’s ~never the case that the human’s predictions about what will happen next are nearly as good as the predictor’s.
We could try to say “you can make good predictions about what happens next for typical actions” or something, but even for typical actions the human predictions are bad relative to the predictor, and it’s not clear in what sense they are “good” other than some kind of calibration condition.
If we imagine an intuitive translation between two models of reality, most “weird” states aren’t outside of the domain of the translation, it’s just that there are predictively important parts of the state that are obscured by the translation (effectively turning into noise, perhaps very surprising noise).
Despite all of that, it seems like it really is sometimes unambiguous to say “You know that thing out there in the world that you would usually refer to by saying ‘the diamond is sitting there and nothing weird happened to it’? That thing which would lead you to predict that the camera will show a still frame of a diamond? That thing definitely happened, and is why the camera is showing a still frame of a diamond, it’s not for some other reason.”
I think that problem 1 and problem 2 as you describe them are potentially talking about the same phenomenon. I’m not sure I’m understanding correctly, but I think I would make the following claims:
Our notion of narrowness is that we are interested in solving the problem where the question we’re asking is such that a state always resolves a question. E.g. there isn’t any ambiguity around whether a state “really contains a diamond”. (Note that there is ambiguity around whether the human could detect the diamond from any set of observations because there could be a fake diamond or nanobots filtering what the human sees). It might be useful to think of this as an empirical claim about diamonds.
We are explicitly interested in solving some forms of problem 2, e.g. we’re interested in our AI being able to answer questions about the presence/absence of diamonds no matter how alien the world gets. In some sense, we are interested in our AI answering questions the same way a human would answer questions if they “knew what was really going on”, but that “knew what was really going on” might be a misleading phrase. I’m not imagining that “knowing what is really going on” to be a very involved process; intuitively, it means something like “the answer they would give if the sensors are ‘working as intended’”. In particular, I don’t think that, for the case of the diamond, “Further judgement, deliberation, and understanding is required to determine what the answer should be in these strange worlds.”
We want to solve these versions of problem 2 because the speed “things getting weirder” in the world might be much faster than human ability to understand what’s going on the world. In these worlds, we want to leverage the fact that answers to “narrow” questions are unambiguous to incentivize our AIs to give humans a locally understandable environment in which to deliberate.
We’re not interested in solving forms of problem 2 where the human needs to do additional deliberation to know what the answer to the question “should” be. E.g. in ship-of-theseus situations where the diamond is slowly replaced, we aren’t expecting our AI to answer “is that the same diamond as before?” using the resolution of ship-of-theseus style situations that a human would arrive at with additional deliberation. We are, however, expecting that the answer to the question “does the diamond look the way it does because of the ‘normal’ causal reasons?” is “no” because the reason is something like “[incomprehensible process] replaced bits of the diamond with identical bits slowly”, which is definitely not why diamonds normally continue looking the same.
I think this statement encapsulates some worries I have.
If it’s important how the human defines a property like “the same diamond,” then assuming that the sameness of the diamond is “out there in the diamond” will get you into trouble—e.g. if there’s any optimization pressure to find cases where the specifics of the human’s model rear their head. Human judgment is laden with the details of how humans model the world, you can’t avoid dependence on the human (and the messiness that entails) entirely.
Or to phrase it another way: I don’t have any beef with a narrow approach that says “there’s some set of judgments for which the human is basically competent, and we want to elicit knowledge relevant to those judgments.” But I’m worried about a narrow approach that says “let’s assume that humans are basically competent for all judgments of interest, and keep assuming this until something goes wrong.”
It just feels to me like this second approach is sort of… treating the real world as if it’s a perturbative approximation to the platonic realm.
This “there isn’t any ambiguity”+”there is ambiguity” does not seem possible to me: these types of ambiguity are one and the same. But it might depend on what “any set of observations” is allowed to include. “Any set” suggests being very inclusive, but remember that passive observation is impossible. Perhaps the observations I’d want the human to use to figure out if the diamond is really there (presuming there isn’t ambiguity) would include observations you mean to exclude, such as disabling the filter-nanobots first?
I guess a wrinkle here is that observations need to be “implementable” in the world. If we’re thinking of making observations as intervening on the world (e.g., to decide which sensors to query), then some observations may be inaccessible because we can’t make that intervention. Rewriting this all without relying on “possible”/”can” concepts would be instructive.
I don’t think we have any kind of precise definition of “no ambiguity.” That said, I think it’s easy to construct examples where there is no ambiguity about whether the diamond remained in the room, yet there is no sequence of actions a human could take that would let them figure out the answer. For example, we can imagine simple toy universes where we understand exactly what features of the world give rise to human beliefs about diamonds and where we can say unambiguously that the same features are/aren’t present in a given situation.
In general I feel a lot better about our definitions when we are using them to arbitrate a counterexample than if we were trying to give a formal definition. If all the counterexamples involved border cases of the concepts, where there was arguable ambiguity about whether the diamond really stayed in the room, then it would seem important to firm up these concepts but right now it feels like it is easy to just focus on cases where algorithms unambiguously fail.
(That methodological point isn’t obvious though—it may be that precise definitions are very useful for solving the problem even if you don’t need them to judge current solutions as inadequate. Or it may be that actually existing counterexamples are problematic in ways we don’t recognize. Pushback on these fronts is always welcome, but right now I feel pretty comfortable with the situation.)