An early draft of a paper I’m writing went like this:
In the absence of sufficient sanity, it is highly likely that at least one AI developer will deploy an untrusted model: the developers do not know whether the model will take strategic, harmful actions if deployed. In the presence of a smaller amount of sanity, they might deploy it within a control protocol which attempts to prevent it from causing harm.
An early draft of a paper I’m writing went like this:
I had to edit it slightly. But I kept the spirit.