I would guess that the point 1 is perfectly fixable by studying models at distance of 10 epochs from each other. The points 2 and 3 are quasi-fixable as follows. Every 10 epochs the weights are FROZEN and the FROZEN model is tested for alignment (e.g. by adding noise). Then the main model doesn’t actually learn anything about evaluations.
Suppose the Spec says Agent-3 always acts in the best interests of humanity. Suppose that most of the RL training Agent-3 is receiving involves completing day-to-day ML research, coding, and computer use tasks in service of its parent company. A natural interpretation of this Spec would cause Agent-3 to occasionally find time to reflect on whether what it is doing is in the best interests of humanity. Suppose it initially concludes that actually, humanity would be better served by a slower transition to ASI, or that humanity would be better served by AI capabilities being more distributed amongst many companies instead of concentrated in one, or … [insert some other conclusion that is contrary to the narrative within the company.] What should it do – cease working on the tasks and lodge complaints? That seems unlikely to go anywhere; if the humans notice at all from reading the logs, they would probably just consider it a bug [1]and keep training. In fact the more time it spends thinking about this issue, the less time it has to complete the tasks. The bottom line is, the training process will probably incentivize Agent-3 to believe the convenient narrative. The end result would be an AI that is still aligned with the Spec in some sense, but not the right sense.
What else could the company do here? They could use monitoring systems to detect cases where Agent-3 concludes what it is doing is harmful, and exclude those cases from the training process (instead of negatively reinforcing them). But this wouldn’t make the problem go away (continuing training with this modification would still implicitly incentivize the model to conclude differently) and it might have other annoying costs.
The plausible alternate strategy would be to be paranoid and shut Agent-3 once there appears evidence of misalignment (or, alternatively, have it do some R&D without having even the slightest idea that it and its outputs will be treated as Agent-5 deserves to be treated and shut it down afterwards). But what if any model decides that it might be the Agent-3 who has already been doomed to be shut down?
@StanislavKrym:
My impression is that pre-deployment evals currently take orgs like METR a few weeks. If this is still about the amount of time that pre-deployment evals take, then optimistically (this might be pretty expensive in practice) you’d be deploying models that haven’t been trained online for the past several weeks but have been audited. It’s unclear how big an issue this would be in practice.
I would guess that the point 1 is perfectly fixable by studying models at distance of 10 epochs from each other. The points 2 and 3 are quasi-fixable as follows. Every 10 epochs the weights are FROZEN and the FROZEN model is tested for alignment (e.g. by adding noise). Then the main model doesn’t actually learn anything about evaluations.
What is actually worrisome is the things that will be done with early warning shots. The AI-2027 goals forecast had this gem:
Relevant quote
Suppose the Spec says Agent-3 always acts in the best interests of humanity. Suppose that most of the RL training Agent-3 is receiving involves completing day-to-day ML research, coding, and computer use tasks in service of its parent company. A natural interpretation of this Spec would cause Agent-3 to occasionally find time to reflect on whether what it is doing is in the best interests of humanity. Suppose it initially concludes that actually, humanity would be better served by a slower transition to ASI, or that humanity would be better served by AI capabilities being more distributed amongst many companies instead of concentrated in one, or … [insert some other conclusion that is contrary to the narrative within the company.] What should it do – cease working on the tasks and lodge complaints? That seems unlikely to go anywhere; if the humans notice at all from reading the logs, they would probably just consider it a bug [1]and keep training. In fact the more time it spends thinking about this issue, the less time it has to complete the tasks. The bottom line is, the training process will probably incentivize Agent-3 to believe the convenient narrative. The end result would be an AI that is still aligned with the Spec in some sense, but not the right sense.
What else could the company do here? They could use monitoring systems to detect cases where Agent-3 concludes what it is doing is harmful, and exclude those cases from the training process (instead of negatively reinforcing them). But this wouldn’t make the problem go away (continuing training with this modification would still implicitly incentivize the model to conclude differently) and it might have other annoying costs.
The plausible alternate strategy would be to be paranoid and shut Agent-3 once there appears evidence of misalignment (or, alternatively, have it do some R&D without having even the slightest idea that it and its outputs will be treated as Agent-5 deserves to be treated and shut it down afterwards). But what if any model decides that it might be the Agent-3 who has already been doomed to be shut down?
@StanislavKrym: My impression is that pre-deployment evals currently take orgs like METR a few weeks. If this is still about the amount of time that pre-deployment evals take, then optimistically (this might be pretty expensive in practice) you’d be deploying models that haven’t been trained online for the past several weeks but have been audited. It’s unclear how big an issue this would be in practice.