Big update toward the principal serving a coordinating function only (alwayshasbeen.png).
Subagents will unavoidably operate under their own agency; any design where their agenda is fully set from above would goodhart by definition. The only scenario where there’s non-goodhart coherence seems to be where there’s some sort of alignment between the principal’s agenda and the agency of the subagents.
ETA: The subagent receives the edict of the principal and fills in the details using its own agency. The resulting actions make sense to the extent the subagent has (and uses) “common sense”.
Big update toward the principal serving a coordinating function only (alwayshasbeen.png).
Subagents will unavoidably operate under their own agency; any design where their agenda is fully set from above would goodhart by definition. The only scenario where there’s non-goodhart coherence seems to be where there’s some sort of alignment between the principal’s agenda and the agency of the subagents.
ETA: The subagent receives the edict of the principal and fills in the details using its own agency. The resulting actions make sense to the extent the subagent has (and uses) “common sense”.