Chr Cz

Karma: 4

Chr Cz’s Shortform

Chr Cz20 Dec 2025 17:32 UTC

1 point

3 comments1 min readLW link

Chr Cz 20 Dec 2025 10:50 UTC
5 points
0
on: Chr Cz’s Shortform
I have 2 questions on a video about the paper „CTRL Z: Controlling agents via resampling“.
I would love if someone experienced could educate me on some thoughts I have on the issue.
1. Is resampling as a safety design component safe? If we ask potentially adversarial AI to „try again“, are we not giving it information and at risk of actually training them to become a better malicious actor?
2. Is the papers mentioned risk of collusion between trusted and untrusted AI via secret signals an indication for bad safety mechanism architecture? The risk presupposes—unless I am mistaken—that both AIs have some level of awareness of each other. Should such a scenario not be avoided in the first place?
3. Building on the premise of the paper: Would it be an improvement to give the resampling task to a different instance of the untrusted AI? My hope would be that this could open up possibilities to better understand the origin and nature of the behaviors.
I apologize if these thoughts are basic and have been answered before. I am just very interested in this area of AI safety. If the quality of my thoughts isn’t what this platform is meant for, I would appreciate if someone could recommend me some reading or resources so I can learn more, I am specifically interested in the architecture of AI safety mechanisms and what the best methods are that we have today plus open questions or problems.