unclear whether outcome-based training “leaking” into chain of thought b/c of parameter sharing / short-to-long generalization is good or bad for safetyon the one hand, it makes scheming less likely on the other hand, it can make CoT less monitorable
unclear whether outcome-based training “leaking” into chain of thought b/c of parameter sharing / short-to-long generalization is good or bad for safety
on the one hand, it makes scheming less likely
on the other hand, it can make CoT less monitorable