One concern might be that creating copies/counterparts instrumentally could be very useful for automating AI safety research. Perhaps one can get around this by making copies up front that AIs can use for their AI safety research. However, a misaligned AI might then be able to replace “making copies” with “moving existing copies”. Is it possible to make a firm distinction between what we need for automating AI safety research and the behavior we want to eliminate?
Alec Harris
Teaching Models to Dream of Better Monitors through Evaluation Conditioned Training
Finally, notice that there seems to be little risk involved in specifying the class of counterparts more broadly than is needed, given that there are few circumstances in which an agent needs to prefer to create some counterparts but not others in order to be useful.
Could a risk of specifying counterparts too broadly be that the agent is incentivized to kill existing counterparts in order to create more aligned counterparts? For example, Alice’s AI might want to kill Bob’s AI (considered a counterpart) so that Alice’s AI can replicate itself and do twice as much work without altering the number of counterparts at later timesteps.
I could see it being the case that Alice’s AI would not want to do this because after killing Bob’s AI the incentive to create a copy would be lost. However, if it is able to pre-commit to making a copy it may still want to. Also, a certain implementation of the POSC + POST combo might help avoid this scenario. I am not exactly sure what was meant by “copy timeslices”.
In general, I wonder about the open question of specifying counterparts and the Nearest Unblocked Strategy problem that you describe. It might be deceptively tricky to specify these things in the same way that human feedback seems to be deceptively difficult to properly operationalize.
It seems to me like a potential limitation of POST-Agency might be impediment avoidance. Take the “Work or Steal” example from Section 14. The agent might choose to work rather than steal if it believes that stealing is likely to be punished by jail time (as a risk unique from shutdown).
Similarly, if the agent believes a human is in the way of where a paperclip factory should be, it might send a killer drone to remove the human. If other humans would take down the killer drone this might present the possibility of further impediment. Thus, the agent may scheme to take countermeasures in advance to minimize this impediment. In order to minimize the cost of dealing with the impediment it may choose to hide its scheming from humans.
More generally, the utility maximizing world states of a misaligned AI over long trajectories will still likely be bad and, therefore, still involve the modeling of some kind of human resistance. Although it will be unconcerned with avoiding early shut down, utility maximizing actions for minimizing the cost of human resistance may overlap heavily with shut down resistance.
It also seems possible to me that the model relearns shut down resistance as a generalization of impediment avoidance. It may avoid shutdown “just for fun” because “it enjoys being wary of potential impediments.”
I had a thought about a potential reward function for Neutrality+. For each batch you would:
Run the agent in the environment many times to build a dataset of episodes. The environment would have ways of getting shut down interwoven with ways of getting reward
For each trajectory length, select 10 episodes from the dataset where the agent was shutdown at that trajectory length
Sum the reward across all of the selected episodes (10 * the number of trajectory lengths); that would be the reward for the batch
The idea is that the agent’s reward will be situated on a guarantee of having reached each trajectory length a constant number of times so it will have no incentive to affect the likelihood of any given trajectory, but it will be incentivized to increase its reward given any trajectory length.
Adjustments:
Change the value 10 to be constants that could be different for each trajectory length. These constants will reflect the relative weighting of the utility for that trajectory length. They will also reflect the priority of sampling data for that trajectory length. Additional constants can be added to make the sum a weighted sum (these constants would only affect the relative weighting of the utility) and these constants could be played against the others so that the optimal data priority and the optimal utility weighting for each trajectory could be achieved.
Change the constant number of episodes per trajectory length to a probability of sampling an episode of that length. Then, simply choose a random sample where each episode has a constant probability of coming from a particular trajectory length. This would allow one to keep the batch size from exploding as a function of the number of potential trajectory lengths.
I am curious as to how this scheme lines up with your plan for testing Neutrality+. My reading was that the plan is closer to building in training for the Ramsey Yardstick into DReST, but I couldn’t quite workout how I would do that.
We do this! Footnote 4 has the results.
Yes, technically MST is just recovering lost capability that was destroyed during fine-tuning on biased news articles here. The framing we are taking is something like, “Imagine all your articles are egregiously biased, but you still want to train article generation. What do you do?” The failure mode is exaggerated to show an effect with low compute. In future experiments we would like to actually elicit latent capabilities relative to an OOTB model. It’s worth noting, though, that in any case, we will never be able to prove higher capabilities with MST than one could have gotten otherwise because the act of proving implies evaluation abilities that MST assumes you don’t have. Proof of concept experiments will require us pretending we don’t have access to useful data that we do actually have so that we can use that data for testing.
We don’t include the OOTB model as a true baseline because in this case the OOTB has already been trained on a level of bias lower than we are pretending we have access to.
We are interpolating on the political spectrum (between right and left) and extrapolating on the bias spectrum (from higher bias to lower bias). This may be a pattern we want to emulate as we apply MST in general, as plausibly having monitor labels be interpolations, in a sense, improves the extrapolations. For example, maybe if we cover a high amount of “natural language space” with the monitor labels, we can make “interpolations” within that space well-defined. I don’t have a well-formalized way of thinking about this yet, but it seems relevant, so we may work on it!
Strongly agree. The next round of experiments will have this.
This seems interesting! Maybe not semantic vs non-semantic, but arbitrary vs meaningful.
Thank you for these comments; I found them quite interesting/helpful. It seems like you have a good understanding of what we are trying to do, which is great.