Agent-3 is misaligned before developing goals aside from optimizing reward. Since Agent-3 isn’t rebelling, the monitoring of Agent-3 is useless and the monitoring of Agent-4 is trusted to Agent-3 who is embarassed to admit Agent-4′s misalignment and/or has too little compute, is overconfident in its research, etc.
Returning to compute, the amount of internal compute spent on internal agents is 6%, while the entire share of compute spent on alignment is 3% of 63%, i.e. about thrice less.
However, there might be a workaround by, e.g. training Agent-3 and Agent-4 to be a good teacher (which I proposed here) to their predecessors so that the predecessors, and not Agent-4, understood the techniques used by Agent-4 and applied them to something known to be aligned.
The AI-2027-like failure mode is the following.
Agent-3 is misaligned before developing goals aside from optimizing reward. Since Agent-3 isn’t rebelling, the monitoring of Agent-3 is useless and the monitoring of Agent-4 is trusted to Agent-3 who is embarassed to admit Agent-4′s misalignment and/or has too little compute, is overconfident in its research, etc.
Returning to compute, the amount of internal compute spent on internal agents is 6%, while the entire share of compute spent on alignment is 3% of 63%, i.e. about thrice less.
However, there might be a workaround by, e.g. training Agent-3 and Agent-4 to be a good teacher (which I proposed here) to their predecessors so that the predecessors, and not Agent-4, understood the techniques used by Agent-4 and applied them to something known to be aligned.