Though, for the threat model of ‘hazardous knowledge that can be synthesized from many datapoints which are individually innocuous’, this could still be a win if you remove the ‘hazardous knowledge [that] can be pinpointed to individual training datapoints’ and e.g. this forces the model to perform more explicit reasoning through e.g. CoT, which could be easier to monitor (also see these theoretical papers on the need for CoT for increased expressivity/certain types of problems).
+1
Though, for the threat model of ‘hazardous knowledge that can be synthesized from many datapoints which are individually innocuous’, this could still be a win if you remove the ‘hazardous knowledge [that] can be pinpointed to individual training datapoints’ and e.g. this forces the model to perform more explicit reasoning through e.g. CoT, which could be easier to monitor (also see these theoretical papers on the need for CoT for increased expressivity/certain types of problems).