Nit: don’t you also need to require that the predicted (and actual) outputs are (apparently, at least) safe? Interpreted literally as written, developers would be allowed to deploy a model if they can reliably predict that it will cause harm.
Ah, I see. I still think it’s worth thinking about how developers might try to pass this test (and others) adversarially, in order to deploy a model that they themselves are confident is harmful in at least some ways.
I think such intentional deployments are much less likely to be totally catastrophic compared to accident risk, but still risky, and not totally implausible (they don’t necessarily require extreme malice or misanthropy on the part of developers, just a misguided or negligent view of safety concerns, perhaps due to motivated reasoning due to career or profit potential.)
Nit: don’t you also need to require that the predicted (and actual) outputs are (apparently, at least) safe? Interpreted literally as written, developers would be allowed to deploy a model if they can reliably predict that it will cause harm.
That’s right. This test isn’t meant to cover “and it’s safe”, it’s meant to cover “and it’s predictable.” I meant to cover this in:
Ah, I see. I still think it’s worth thinking about how developers might try to pass this test (and others) adversarially, in order to deploy a model that they themselves are confident is harmful in at least some ways.
I think such intentional deployments are much less likely to be totally catastrophic compared to accident risk, but still risky, and not totally implausible (they don’t necessarily require extreme malice or misanthropy on the part of developers, just a misguided or negligent view of safety concerns, perhaps due to motivated reasoning due to career or profit potential.)