Hmm, I’m sure smarter people than me have thought about this. But I worry about the pros & cons of publicizing the specific rate at which model outputs are monitored. Although it seems models lack the ability to control their CoT and (I have preliminary evidence) struggle to evade probes, there’s a world where models (a) are aware they are being monitored, (b) increase in ability to pursue long-term strategic objectives and (c) wish to evade the monitors. This might lead to a catastrophic failure. To that end, it’s not clear to me the extent to which (a) is updated in the models’ favor by posts like this being made public. I guess a sufficiently capable model would arrive at the assumption that everything’s being monitored anyway and work around this assumption? In that case, publishing things like this seems to have little harm. Nonetheless, I’m curious how you considered the balance here.
Afaik it seems like other labs are doing worse on monitoring in terms of resourcing, coverage and capability. On balance it seems more important to try and make this an industry norm.
Hmm, I’m sure smarter people than me have thought about this. But I worry about the pros & cons of publicizing the specific rate at which model outputs are monitored. Although it seems models lack the ability to control their CoT and (I have preliminary evidence) struggle to evade probes, there’s a world where models (a) are aware they are being monitored, (b) increase in ability to pursue long-term strategic objectives and (c) wish to evade the monitors. This might lead to a catastrophic failure. To that end, it’s not clear to me the extent to which (a) is updated in the models’ favor by posts like this being made public. I guess a sufficiently capable model would arrive at the assumption that everything’s being monitored anyway and work around this assumption? In that case, publishing things like this seems to have little harm. Nonetheless, I’m curious how you considered the balance here.
Afaik it seems like other labs are doing worse on monitoring in terms of resourcing, coverage and capability. On balance it seems more important to try and make this an industry norm.