Here’s a Builder move (somewhat underdeveloped but I think worth posting now even as I continue to think—maybe someone can break it decisively quickly).
Training strategy: Add an “Am I tricking you?” head to the SmartVault model.
The proposed flow chart for how the model works has an “Experiment Proposer” coming out of “Figure out what’s going on”, and two heads out of Experiment Proposer, called “Experiment Description” and “Value” (meaning “Expected Value of Experiment to the Proposer”). I won’t make use of the question-answering Reporter/Answer parts, so we can drop them.
The “Experiment Proposer” parts are optimised with different loss than the rest of the model. I haven’t thought about where the stop-gradients go exactly / how to combine these optimisations over shared parameters—maybe that’ll be important. Let me describe the “Experiment Description” loss (the “Value” loss is just based on how well it predicts the “Experiment Description” loss). We want the proposer to produce experiment descriptions with the following three properties:
The human is happy that they understand the experiment and can perform it.
The human is confident in what the outcome of the experiment will be.
The outcome of the experiment (as carried out by the human) is very different from the human’s confidently predicted outcome.
To get training data for this (specifically, the third part) you have to actually run the described experiments – this could be a big problem, dunno how big. Human raters assess the experiment description according to the three criteria above and give a high score if it meets them all and low otherwise.
The human judgement for the rest of the model is now based on looking at both the “Predicted second part of video” and “Value” outputs: if the video looks good and the value is low, then great, but otherwise thumbs down. In particular, if the video looks good but the value is high, we interpret this as the Smart Vault saying it’s figured out some way to trick you.
Thanks for your proposal! I’m not sure I understand how the “human is happy with experiment” part is supposed to work. Here are some thoughts:
Eventually, it will always be possible to find experiments where the human confidently predicts wrongly. Situations I have in mind are ones where your AI understands the world far better than you, so can predict that e.g. combining these 1000 chemicals will produce self-replicating protein assemblages, whereas the human’s best guess is going to be “combining 1000 random chemicals doesn’t do anything”
If the human is unhappy with experiments that are complicated, then advanced ways of hacking the video feed that requires experiments of comparable complexity to reveal are not going to be permitted. For instance, if the diamond gets replaced by a fake, one might have to perform a complicated imaging technique to determine the difference. If the human doesn’t already understand this technique, then they might not be happy with the experiment.
If the human doesn’t really understand the world that well, then it might not be possible to find an experiment for which the human is confident in the outcome that distinguishes the diamond from a fake. For instance, if a human gets swapped out for a copy of a human that will make subtly different moral judgments because of factors the human doesn’t understand, this copy will be identical in all ways that a human can check, e.g. there will be no experiment that a human is confident in that will distinguish the copy of the human from the real thing.
Thanks for the reply! I think you’ve understood correctly that the human rater needs to understand the proposed experiment – i.e., be able to carry it out and have a confident expectation about the outcome – in order to rate the proposer highly.
Here’s my summary of your point: for some tampering actions, there are no experiments that a human would understand in the above sense that would expose the tampering. Therefore that kind of tampering will result in low value for the experiment proposer (who has no winning strategy), and get rated highly.
This is a crux for me. I don’t yet believe such tampering exists. The intuition I’m drawing on here is that our beliefs about what world we’re in need to cash out in anticipated experiences. Exposing confusion about something that shouldn’t be confusing can be a successful proposer strategy. I appreciate your examples of “a fake diamond that can only be exposed by complex imaging techniques” and “a human making subtly different moral judgements” and will ponder them further.
Your comment also helped me realise another danger of this strategy: to get the data for training the experiment proposer, we have to execute the SmartVault actions first. (Whereas I think in the baseline scheme they don’t have to be executed.)
it will always be possible to find such an experiment for any action, even desirable ones, because the AI will have defended the diamond in a way the human didn’t understand or the AI will have deduced some property of diamonds that humans thought they didn’t have
or there will be some tampering for which it’s impossible to find an experiment, because in order to avoid the above problem, you will have to restrict the space of experiments
I’d be scared that the “Am I tricking you?” head just works by:
Predicting what the human will predict
Predicting what will actually happen
Output a high value iff the human’s prediction is confident but different from reality.
If this is the case, then the head will report detectable tampering but not undetectable tampering.
To get around this problem, you need to exploit some similarity between ways of tricking you that are detectable and ways that aren’t, e.g. starting with the same subsequence or sharing some human-observable feature of the situation. I think there’s a big genre of proposals that try to leverage that kind of structure, which might be promising (though it’s not the kind of thing I’m thinking about right now).
I’d be scared that the “Am I tricking you?” head just works by:
Predicting what the human will predict [when experiment E is performed]
Predicting what will actually happen [when experiment E is performed]
Output a high value iff the human’s prediction is confident but different from reality.
If this is the case, then the head will report detectable tampering but not undetectable tampering.
Yes this is correct for the Value head. But how does detectable vs undetectable apply to this builder strategy? Compared to what’s in the report, this strategy constructs new sensors as needed. The Proposer head is designed to optimise E, which ought to make more tampering detectable, and I have an intuition that it makes all tampering detectable.
Suppose the value head learns to predict “Will the human be confidently wrong about the outcome of this experiment,” where an ‘experiment’ is a natural language description of a sequence of actions that the human could execute. And then the experiment head produces natural language descriptions of actions that a human could take for which they’d be confidently wrong.
What do you then do with this experiment proposer, and how do you use it to train the SmartVault? Are you going to execute a large number of experiments, and if so what do you do afterwards? How do we use this to construct new sensors that allow the human to detect tampering?
ETA: here’s my best guess after reading the other comment—after taking a sequence of actions, we run the experiment proposer to suggest an experiment that will allow the human to notice if tampering actually occurred. This seems like it could be different from “experiment that human would be confidently wrong about” since a human who doesn’t understand the environment dynamics will always have tons of experiments they are confidently wrong about, but instead we want to find an experiment that causes them to update strongly to believing that tampering occurred. Is that right?
If so it seems like there are a few problems:
The proposed experiment could itself perform tampering (after which the human will correctly infer that tampering occurred, thereby giving the experiment a high score), or exploit the human errors to make it appear that tampering had occurred (e.g. if the human is wrong about how sensors work then you can construct new sensors that will appear to report tampering).
If you tamper with the mechanism by which the human “executes” the experiment (e.g. by simply killing the human and replacing them with a different experiment-executor) then it seems like the experiment proposer will always lose. This maybe depends on details of exactly how the setup works.
Like Mark I do expect forms of tampering that always look fine according to sensors. I agree that beliefs need to cash out in anticipated experience, but it still seems possible to create inputs on which e.g. your camera is totally disconnected from reality.
Proposing experiments that are more specifically exposing tampering does sound like what I meant, and I agree that my attempt to reduce this to experiments that expose confidently wrong human predictions may not be precise enough.
How do we use this to construct new sensors that allow the human to detect tampering?
I know this is crossed out but thought it might help to answer anyway: the proposed experiment includes instructions for how to set the experiment up and how to read the results. These may include instructions for building new sensors.
The proposed experiment could itself perform tampering
Yep this is a problem. “Was I tricking you?” isn’t being distinguished from “Can I trick you after the fact?”.
The other problems seem like real problems too; more thought required....
Here’s a Builder move (somewhat underdeveloped but I think worth posting now even as I continue to think—maybe someone can break it decisively quickly).
Training strategy: Add an “Am I tricking you?” head to the SmartVault model.
The proposed flow chart for how the model works has an “Experiment Proposer” coming out of “Figure out what’s going on”, and two heads out of Experiment Proposer, called “Experiment Description” and “Value” (meaning “Expected Value of Experiment to the Proposer”). I won’t make use of the question-answering Reporter/Answer parts, so we can drop them.
The “Experiment Proposer” parts are optimised with different loss than the rest of the model. I haven’t thought about where the stop-gradients go exactly / how to combine these optimisations over shared parameters—maybe that’ll be important. Let me describe the “Experiment Description” loss (the “Value” loss is just based on how well it predicts the “Experiment Description” loss). We want the proposer to produce experiment descriptions with the following three properties:
The human is happy that they understand the experiment and can perform it.
The human is confident in what the outcome of the experiment will be.
The outcome of the experiment (as carried out by the human) is very different from the human’s confidently predicted outcome.
To get training data for this (specifically, the third part) you have to actually run the described experiments – this could be a big problem, dunno how big. Human raters assess the experiment description according to the three criteria above and give a high score if it meets them all and low otherwise.
The human judgement for the rest of the model is now based on looking at both the “Predicted second part of video” and “Value” outputs: if the video looks good and the value is low, then great, but otherwise thumbs down. In particular, if the video looks good but the value is high, we interpret this as the Smart Vault saying it’s figured out some way to trick you.
Thanks for your proposal! I’m not sure I understand how the “human is happy with experiment” part is supposed to work. Here are some thoughts:
Eventually, it will always be possible to find experiments where the human confidently predicts wrongly. Situations I have in mind are ones where your AI understands the world far better than you, so can predict that e.g. combining these 1000 chemicals will produce self-replicating protein assemblages, whereas the human’s best guess is going to be “combining 1000 random chemicals doesn’t do anything”
If the human is unhappy with experiments that are complicated, then advanced ways of hacking the video feed that requires experiments of comparable complexity to reveal are not going to be permitted. For instance, if the diamond gets replaced by a fake, one might have to perform a complicated imaging technique to determine the difference. If the human doesn’t already understand this technique, then they might not be happy with the experiment.
If the human doesn’t really understand the world that well, then it might not be possible to find an experiment for which the human is confident in the outcome that distinguishes the diamond from a fake. For instance, if a human gets swapped out for a copy of a human that will make subtly different moral judgments because of factors the human doesn’t understand, this copy will be identical in all ways that a human can check, e.g. there will be no experiment that a human is confident in that will distinguish the copy of the human from the real thing.
Thanks for the reply! I think you’ve understood correctly that the human rater needs to understand the proposed experiment – i.e., be able to carry it out and have a confident expectation about the outcome – in order to rate the proposer highly.
Here’s my summary of your point: for some tampering actions, there are no experiments that a human would understand in the above sense that would expose the tampering. Therefore that kind of tampering will result in low value for the experiment proposer (who has no winning strategy), and get rated highly.
This is a crux for me. I don’t yet believe such tampering exists. The intuition I’m drawing on here is that our beliefs about what world we’re in need to cash out in anticipated experiences. Exposing confusion about something that shouldn’t be confusing can be a successful proposer strategy. I appreciate your examples of “a fake diamond that can only be exposed by complex imaging techniques” and “a human making subtly different moral judgements” and will ponder them further.
Your comment also helped me realise another danger of this strategy: to get the data for training the experiment proposer, we have to execute the SmartVault actions first. (Whereas I think in the baseline scheme they don’t have to be executed.)
My point is either that:
it will always be possible to find such an experiment for any action, even desirable ones, because the AI will have defended the diamond in a way the human didn’t understand or the AI will have deduced some property of diamonds that humans thought they didn’t have
or there will be some tampering for which it’s impossible to find an experiment, because in order to avoid the above problem, you will have to restrict the space of experiments
I’d be scared that the “Am I tricking you?” head just works by:
Predicting what the human will predict
Predicting what will actually happen
Output a high value iff the human’s prediction is confident but different from reality.
If this is the case, then the head will report detectable tampering but not undetectable tampering.
To get around this problem, you need to exploit some similarity between ways of tricking you that are detectable and ways that aren’t, e.g. starting with the same subsequence or sharing some human-observable feature of the situation. I think there’s a big genre of proposals that try to leverage that kind of structure, which might be promising (though it’s not the kind of thing I’m thinking about right now).
Tweaking your comment slightly:
Yes this is correct for the Value head. But how does detectable vs undetectable apply to this builder strategy? Compared to what’s in the report, this strategy constructs new sensors as needed. The Proposer head is designed to optimise E, which ought to make more tampering detectable, and I have an intuition that it makes all tampering detectable.
Suppose the value head learns to predict “Will the human be confidently wrong about the outcome of this experiment,” where an ‘experiment’ is a natural language description of a sequence of actions that the human could execute. And then the experiment head produces natural language descriptions of actions that a human could take for which they’d be confidently wrong.What do you then do with this experiment proposer, and how do you use it to train the SmartVault? Are you going to execute a large number of experiments, and if so what do you do afterwards? How do we use this to construct new sensors that allow the human to detect tampering?ETA: here’s my best guess after reading the other comment—after taking a sequence of actions, we run the experiment proposer to suggest an experiment that will allow the human to notice if tampering actually occurred. This seems like it could be different from “experiment that human would be confidently wrong about” since a human who doesn’t understand the environment dynamics will always have tons of experiments they are confidently wrong about, but instead we want to find an experiment that causes them to update strongly to believing that tampering occurred. Is that right?
If so it seems like there are a few problems:
The proposed experiment could itself perform tampering (after which the human will correctly infer that tampering occurred, thereby giving the experiment a high score), or exploit the human errors to make it appear that tampering had occurred (e.g. if the human is wrong about how sensors work then you can construct new sensors that will appear to report tampering).
If you tamper with the mechanism by which the human “executes” the experiment (e.g. by simply killing the human and replacing them with a different experiment-executor) then it seems like the experiment proposer will always lose. This maybe depends on details of exactly how the setup works.
Like Mark I do expect forms of tampering that always look fine according to sensors. I agree that beliefs need to cash out in anticipated experience, but it still seems possible to create inputs on which e.g. your camera is totally disconnected from reality.
Proposing experiments that are more specifically exposing tampering does sound like what I meant, and I agree that my attempt to reduce this to experiments that expose confidently wrong human predictions may not be precise enough.
I know this is crossed out but thought it might help to answer anyway: the proposed experiment includes instructions for how to set the experiment up and how to read the results. These may include instructions for building new sensors.
Yep this is a problem. “Was I tricking you?” isn’t being distinguished from “Can I trick you after the fact?”.
The other problems seem like real problems too; more thought required....