Suppose there are two worlds, world W1 and world W2.
In world W1, the question Q=”Is there a diamond in the room?” is commonly understood to mean Q1=”Is there actually a diamond in the room?”
In world W2 the question Q=”Is there a diamond in the room?” is commonly understood to mean Q2=”Do I believe there is a diamond in the room?”
Both worlds don’t know how to construct a situation where these are different. So, they produce identical training sets for ELK. But the simulator is also trained on a bunch of science fiction novels that contain descriptions of impossible situations where they differ, and the science fiction novels are different in these two worlds.
Is ELK required to answer appropriately in both worlds? (answer Q1 when given Q in W1, and Q2 when given Q in W2)? If so, it seems we need some term in the loss outside of the training set to make this happen.
Alternatively, would it be satisfactory to find a solution that doesn’t discriminate what’s world it is in, and instead returns “yes” to Q if and only if Q1=”yes” AND Q2=”yes”? This means that in world W1 there will be some situations where Q=”no” when the diamond is present, but no situations where Q=”yes” and the diamond is not present.
I’d like to get different answers in those two worlds. That definitely requires having some term in the loss that is different in W1 and W2. There are three ways the kinds of proposals in the doc can handle this:
Consistency checks will behave differently in W1 and W2. Even if a human can never produce different answers to Q1 and Q2, they can talk about situations where Q1 and Q2 differ and describe how the answers to those questions relate to all the other facts about the world (and to the answer to Q).
If language is rich enough, and we are precise enough with the formulation of questions, then you may hope that lots of other questions have different interpretations in W1 and W2, i.e. such that the simplest way of answering other questions will generalize correctly to Q.
In the case of amplification/debate, Q2 = “Does a human with AI assistants believe a diamond is in the room?” and so we can hope that in fact Q1 and Q2 have the same answers in all situations. (Though we aren’t optimistic about this.)
We generally assume that we can construct questions sufficiently well that there’s only one unambiguous interpretation. We also generally assume that the predictor “knows” which world it’s in because it can predict how humans would respond to hypothetical questions about various situations involving diamonds and sensors and that humans would say in theory Q1 and Q2 could be different.
More concretely, our standard for judging proposals is exhibiting an unambiguous failure. If it was plausible you asked the wrong question, or the AI didn’t know what you meant by the question, then the failure exhibited would be ambiguous. If humans are unable to clarify between two possible interpretations of their question, then the failure would be ambiguous.
>We also generally assume that the predictor “knows” which world it’s in because it can predict how humans would respond to hypothetical questions about various situations
This seems like it doesn’t disambiguate between the conditions assumed in a question being true, vs. just the human believing them. E.g. the predictor could predict that when asked “The camera is hacked so it looks like this [camera feeds making it seem like the diamond is still there], and the diamond is in the robber’s pocket; is the diamond really in the room?”, the human will answer “No!”, not by understanding that by “diamond really in the room” the human means that the diamond is really in the room, but rather just by modeling the human as believing the premise of the question (that the diamond is in the pocket).
Edit:
To elaborate, this condition on counterexamples is given in the ELK document:
“The model understands the question. One sufficient condition is that the model can predict human answers to essentially arbitrary hypothetical questions in order to clarify the meaning of terms.”
I basically don’t see how this condition constrains anything about the predictor. It seems like all it really says is that the predictor knows how humans talk. I don’t see how it can be specifying that the AI’s beliefs about how humans answer questions are related to reality, other than in the training set, where we assume that the human talk matches reality. I don’t see how it makes sense to think of this as the model “understanding the question”. Normally I’d think of “understanding the question” as meaning “can have the same question”. To have a question, you have a role that an answer could fulfill. But if the predictor is organized e.g. as a giant low-level Bayes net, then there’s no role that could be filled by an answer to “where’s the diamond”. There might be a role for an answer to “where’s the diamond”, induced by how the rest of the AI makes use of the predictor, but that seems contingent and anyway it’s not about the predictor (I think ELK is supposed to make sense just with the predictor?).
Suppose there are two worlds, world W1 and world W2.
In world W1, the question Q=”Is there a diamond in the room?” is commonly understood to mean Q1=”Is there actually a diamond in the room?”
In world W2 the question Q=”Is there a diamond in the room?” is commonly understood to mean Q2=”Do I believe there is a diamond in the room?”
Both worlds don’t know how to construct a situation where these are different. So, they produce identical training sets for ELK. But the simulator is also trained on a bunch of science fiction novels that contain descriptions of impossible situations where they differ, and the science fiction novels are different in these two worlds.
Is ELK required to answer appropriately in both worlds? (answer Q1 when given Q in W1, and Q2 when given Q in W2)? If so, it seems we need some term in the loss outside of the training set to make this happen.
Alternatively, would it be satisfactory to find a solution that doesn’t discriminate what’s world it is in, and instead returns “yes” to Q if and only if Q1=”yes” AND Q2=”yes”? This means that in world W1 there will be some situations where Q=”no” when the diamond is present, but no situations where Q=”yes” and the diamond is not present.
I’d like to get different answers in those two worlds. That definitely requires having some term in the loss that is different in W1 and W2. There are three ways the kinds of proposals in the doc can handle this:
Consistency checks will behave differently in W1 and W2. Even if a human can never produce different answers to Q1 and Q2, they can talk about situations where Q1 and Q2 differ and describe how the answers to those questions relate to all the other facts about the world (and to the answer to Q).
If language is rich enough, and we are precise enough with the formulation of questions, then you may hope that lots of other questions have different interpretations in W1 and W2, i.e. such that the simplest way of answering other questions will generalize correctly to Q.
In the case of amplification/debate, Q2 = “Does a human with AI assistants believe a diamond is in the room?” and so we can hope that in fact Q1 and Q2 have the same answers in all situations. (Though we aren’t optimistic about this.)
We generally assume that we can construct questions sufficiently well that there’s only one unambiguous interpretation. We also generally assume that the predictor “knows” which world it’s in because it can predict how humans would respond to hypothetical questions about various situations involving diamonds and sensors and that humans would say in theory Q1 and Q2 could be different.
More concretely, our standard for judging proposals is exhibiting an unambiguous failure. If it was plausible you asked the wrong question, or the AI didn’t know what you meant by the question, then the failure exhibited would be ambiguous. If humans are unable to clarify between two possible interpretations of their question, then the failure would be ambiguous.
>We also generally assume that the predictor “knows” which world it’s in because it can predict how humans would respond to hypothetical questions about various situations
This seems like it doesn’t disambiguate between the conditions assumed in a question being true, vs. just the human believing them. E.g. the predictor could predict that when asked “The camera is hacked so it looks like this [camera feeds making it seem like the diamond is still there], and the diamond is in the robber’s pocket; is the diamond really in the room?”, the human will answer “No!”, not by understanding that by “diamond really in the room” the human means that the diamond is really in the room, but rather just by modeling the human as believing the premise of the question (that the diamond is in the pocket).
Edit:
To elaborate, this condition on counterexamples is given in the ELK document:
“The model understands the question. One sufficient condition is that the model can predict human answers to essentially arbitrary hypothetical questions in order to clarify the meaning of terms.”
I basically don’t see how this condition constrains anything about the predictor. It seems like all it really says is that the predictor knows how humans talk. I don’t see how it can be specifying that the AI’s beliefs about how humans answer questions are related to reality, other than in the training set, where we assume that the human talk matches reality. I don’t see how it makes sense to think of this as the model “understanding the question”. Normally I’d think of “understanding the question” as meaning “can have the same question”. To have a question, you have a role that an answer could fulfill. But if the predictor is organized e.g. as a giant low-level Bayes net, then there’s no role that could be filled by an answer to “where’s the diamond”. There might be a role for an answer to “where’s the diamond”, induced by how the rest of the AI makes use of the predictor, but that seems contingent and anyway it’s not about the predictor (I think ELK is supposed to make sense just with the predictor?).