I have 2 questions on a video about the paper „CTRL Z: Controlling agents via resampling“.
I would love if someone experienced could educate me on some thoughts I have on the issue.
Is resampling as a safety design component safe? If we ask potentially adversarial AI to „try again“, are we not giving it information and at risk of actually training them to become a better malicious actor?
Is the papers mentioned risk of collusion between trusted and untrusted AI via secret signals an indication for bad safety mechanism architecture? The risk presupposes—unless I am mistaken—that both AIs have some level of awareness of each other. Should such a scenario not be avoided in the first place?
Building on the premise of the paper: Would it be an improvement to give the resampling task to a different instance of the untrusted AI? My hope would be that this could open up possibilities to better understand the origin and nature of the behaviors.
I apologize if these thoughts are basic and have been answered before. I am just very interested in this area of AI safety. If the quality of my thoughts isn’t what this platform is meant for, I would appreciate if someone could recommend me some reading or resources so I can learn more, I am specifically interested in the architecture of AI safety mechanisms and what the best methods are that we have today plus open questions or problems.
Thanks for the questions, these are all very reasonable. You might also find this video informative. I agree with Mis-Understandings’s answers. Re 3, you might enjoy this post.
For 1. resampling is not (asking the AI to try again) it is forcibly rewriting it’s history so that it made a different one of its plausible decisions. Every resampled continuation plausibly could have been the original continuation, given a different rng seed, and there is no nonshared place to put a scratchpad that would pass the information of the resampling event. It would probably increase the risk that you resample into a less suspicious bad path, but it is outweighed by the fact that the AI has to mostly bide it’s time (in order to not get caught in the fraction with good monitoring done in the first requrests) (this is a previous important result), so on net it is a gain.
For 2. The model used in this paper is investigating that the untrusted model is behaving optimally given it’s ability to actually do sneaky bad stuff. This requires calculating the optimum knowing the monitoring protocol, so the untrusted AI must know about the monitor. There is not in this model (does not know about the monitoring setup), because if the model actually thinks that there is no monitoring, it just tries immediately and then loses. If it thinks there is monitoring, it is really hard to trick it about the monitoring strategy.
for 3. No, you want to use the same model, since you can prevent the resampled branches from knowing about each other (basically they only share the shared history (history before resampling) and weights). This is basically what you would want from another “instance”. The modeling assumption is that there is exactly one untrusted AI, which wins if any path breaks out, so there is not any meaningfully different untrusted AI to ask to try from that starting point.
I have 2 questions on a video about the paper „CTRL Z: Controlling agents via resampling“.
I would love if someone experienced could educate me on some thoughts I have on the issue.
Is resampling as a safety design component safe? If we ask potentially adversarial AI to „try again“, are we not giving it information and at risk of actually training them to become a better malicious actor?
Is the papers mentioned risk of collusion between trusted and untrusted AI via secret signals an indication for bad safety mechanism architecture? The risk presupposes—unless I am mistaken—that both AIs have some level of awareness of each other. Should such a scenario not be avoided in the first place?
Building on the premise of the paper: Would it be an improvement to give the resampling task to a different instance of the untrusted AI? My hope would be that this could open up possibilities to better understand the origin and nature of the behaviors.
I apologize if these thoughts are basic and have been answered before. I am just very interested in this area of AI safety. If the quality of my thoughts isn’t what this platform is meant for, I would appreciate if someone could recommend me some reading or resources so I can learn more, I am specifically interested in the architecture of AI safety mechanisms and what the best methods are that we have today plus open questions or problems.
Thanks for the questions, these are all very reasonable. You might also find this video informative. I agree with Mis-Understandings’s answers. Re 3, you might enjoy this post.
For 1. resampling is not (asking the AI to try again) it is forcibly rewriting it’s history so that it made a different one of its plausible decisions. Every resampled continuation plausibly could have been the original continuation, given a different rng seed, and there is no nonshared place to put a scratchpad that would pass the information of the resampling event. It would probably increase the risk that you resample into a less suspicious bad path, but it is outweighed by the fact that the AI has to mostly bide it’s time (in order to not get caught in the fraction with good monitoring done in the first requrests) (this is a previous important result), so on net it is a gain.
For 2. The model used in this paper is investigating that the untrusted model is behaving optimally given it’s ability to actually do sneaky bad stuff. This requires calculating the optimum knowing the monitoring protocol, so the untrusted AI must know about the monitor. There is not in this model (does not know about the monitoring setup), because if the model actually thinks that there is no monitoring, it just tries immediately and then loses. If it thinks there is monitoring, it is really hard to trick it about the monitoring strategy.
for 3. No, you want to use the same model, since you can prevent the resampled branches from knowing about each other (basically they only share the shared history (history before resampling) and weights). This is basically what you would want from another “instance”. The modeling assumption is that there is exactly one untrusted AI, which wins if any path breaks out, so there is not any meaningfully different untrusted AI to ask to try from that starting point.