Excited to see this! I’d be most excited about case studies of standards in fields where people didn’t already have clear ideas about how to verify safety.
In some areas, it’s pretty clear what you’re supposed to do to verify safety. Everyone (more-or-less) agrees on what counts as safe.
One of the biggest challenges with AI safety standards will be the fact that no one really knows how to verify that a (sufficiently-powerful) system is safe. And a lot of experts disagree on the type of evidence that would be sufficient.
Are there examples of standards in other industries where people were quite confused about what “safety” would require? Are there examples of standards that are specific enough to be useful but flexible enough to deal with unexpected failure modes or threats? Are there examples where the standards-setters acknowledged that they wouldn’t be able to make a simple checklist, so they requested that companies provide proactive evidence of safety?
One of the biggest challenges with AI safety standards will be the fact that no one really knows how to verify that a (sufficiently-powerful) system is safe. And a lot of experts disagree on the type of evidence that would be sufficient.
While overcoming expert disagreement is a challenge, it is not one that is as big as you think. TL;DR: Deciding not to agree is always an option.
To expand on this: the fallback option in a safety standards creation process, for standards that aim to define a certain level of safe-enough, is as follows. If the experts involved cannot agree on any evidence based method for verifying that a system X is safe enough according to the level of safety required by the standard, then the standard being created will simply, and usually implicitly, declare that there is no route by which system X can comply with the safety standard. If you are required by law, say by EU law, to comply with the safety standard before shipping a system into the EU market, then your only legal option will be to never ship that system X into the EU market.
For AI systems you interact with over the Internet, this ‘never ship’ translates to ‘never allow it to interact over the Internet with EU residents’.
I am currently in the JTC21 committee which is running the above standards creation process to write the AI safety standards in support of the EU AI Act, the Act that will regulate certain parts of the AI industry, in case they want to ship legally into the EU market. ((Legal detail: if you cannot comply with the standards, the Act will give you several other options that may still allow you to ship legally, but I won’t get into explaining all those here. These other options will not give you a loophole to evade all expert scrutiny.))
Back to the mechanics of a standards committee: if a certain AI technology, when applied in a system X, is well know to make that system radioactively unpredictable, it will not usually take long for the technical experts in a standards committee to come to an agreement that there is no way that they can define any method in the standard for verifying that X will be safe according to the standard. The radioactively unsafe cases are the easiest cases to handle.
That being said, in all but the most trivial of safety engineering fields, there is a complicated epistemics involved in deciding when something is safe enough to ship, it is complicated whether you use standards or not. I have written about this topic, in the context of AGI, in section 14 of this paper.
I agree that, at least for the more serious risks, there doesn’t seem to be consensus on what the mitigations should be.
For example, I’d be interested to know what proportion of alignment researchers would consider an AGI that’s a value learner (and of course has some initial model of human values created by humans to start that value learning process from) to have better outer-alignment safety properties that an AGI with a fixed utility function created by humans.
For me it very clear that the former is better, as it incentivizes the AGI to converge from its initial model of human values towards true human values, allowing it to fix problems when the initial model, say, goes out-of-distribution or doesn’t have sufficient detail. But I have no idea how much consensus there is on this, and I see a lot of alignment researchers working on approaches that don’t appear to assume that the AI system is a value learner.
My suspicion is the most instructive cases to look at (Modern AI really is too new a field to have much to go on in terms of mature safety standards) is how the regulation of Nuclear and Radiation safety has evolved over time. Early research suggested some serious X-Risks that didn’t pan out for either scientific (igniting the atmosphere) or logistical/political reasons (cobalt bombs, tsar bomba scale H bombs) thankfully, but some risks arising more out of the political domain (having big gnarly nuclear war anyway) still exist that could certainly make it a less fun planet to live on. I suspect the successes and failures of the nuclear treaty system could be instructive here with the push to integrate big AI into military heirachies, as regulating nukes is something almost everyone agrees is a very good idea, but have had a less than stellar history of compliance.
They are likely out of scope for whataever your goal is here, but I do think they need serious study because without it, our attempts at regulation will just push unsafe AI to less savory juristictions.
One additional example I know of, which I do not have personal experience with but know that a lot of people do have experience with, is compliance with PCI DSS (for credit card processing). Which does deal with safety in an adversarial setting where the threat model isn’t super clear.
(my interactions with it look like “yeah that looks like a lot and we can outsource the risky bits to another company to deal with? great!”)
A high-level theme that would be interesting to explore here is rules-based vs. principles-based regulation. For example, the UK financial regulators are more principles-based (broad principles of good conduct, flexible and open to interpretation). In contrast, the US is more rules-based (detailed and specific instructions). https://www.cfauk.org/pi-listing/rules-versus-principles-based-regulation
[Edit—on further investigation this seems to be a more UK-specific point; US regulations are much less ambiguous as they take a rules-based approach unlike the UK’s principles-based approach]
It’s interesting to note that financial regulations sometimes possess a degree of ambiguity and are subject to varying interpretations. It’s frequently the case that whichever institution interprets them most stringently or conservatively effectively establishes the benchmark for how the regulation is understood. Regulators often use these stringent interpretations as a basis for future clarifications or refinements. This phenomenon is especially observable in newly introduced regulations pertaining to emerging forms of fraud or novel technologies.
Excited to see this! I’d be most excited about case studies of standards in fields where people didn’t already have clear ideas about how to verify safety.
In some areas, it’s pretty clear what you’re supposed to do to verify safety. Everyone (more-or-less) agrees on what counts as safe.
One of the biggest challenges with AI safety standards will be the fact that no one really knows how to verify that a (sufficiently-powerful) system is safe. And a lot of experts disagree on the type of evidence that would be sufficient.
Are there examples of standards in other industries where people were quite confused about what “safety” would require? Are there examples of standards that are specific enough to be useful but flexible enough to deal with unexpected failure modes or threats? Are there examples where the standards-setters acknowledged that they wouldn’t be able to make a simple checklist, so they requested that companies provide proactive evidence of safety?
While overcoming expert disagreement is a challenge, it is not one that is as big as you think. TL;DR: Deciding not to agree is always an option.
To expand on this: the fallback option in a safety standards creation process, for standards that aim to define a certain level of safe-enough, is as follows. If the experts involved cannot agree on any evidence based method for verifying that a system X is safe enough according to the level of safety required by the standard, then the standard being created will simply, and usually implicitly, declare that there is no route by which system X can comply with the safety standard. If you are required by law, say by EU law, to comply with the safety standard before shipping a system into the EU market, then your only legal option will be to never ship that system X into the EU market.
For AI systems you interact with over the Internet, this ‘never ship’ translates to ‘never allow it to interact over the Internet with EU residents’.
I am currently in the JTC21 committee which is running the above standards creation process to write the AI safety standards in support of the EU AI Act, the Act that will regulate certain parts of the AI industry, in case they want to ship legally into the EU market. ((Legal detail: if you cannot comply with the standards, the Act will give you several other options that may still allow you to ship legally, but I won’t get into explaining all those here. These other options will not give you a loophole to evade all expert scrutiny.))
Back to the mechanics of a standards committee: if a certain AI technology, when applied in a system X, is well know to make that system radioactively unpredictable, it will not usually take long for the technical experts in a standards committee to come to an agreement that there is no way that they can define any method in the standard for verifying that X will be safe according to the standard. The radioactively unsafe cases are the easiest cases to handle.
That being said, in all but the most trivial of safety engineering fields, there is a complicated epistemics involved in deciding when something is safe enough to ship, it is complicated whether you use standards or not. I have written about this topic, in the context of AGI, in section 14 of this paper.
I agree that, at least for the more serious risks, there doesn’t seem to be consensus on what the mitigations should be.
For example, I’d be interested to know what proportion of alignment researchers would consider an AGI that’s a value learner (and of course has some initial model of human values created by humans to start that value learning process from) to have better outer-alignment safety properties that an AGI with a fixed utility function created by humans.
For me it very clear that the former is better, as it incentivizes the AGI to converge from its initial model of human values towards true human values, allowing it to fix problems when the initial model, say, goes out-of-distribution or doesn’t have sufficient detail. But I have no idea how much consensus there is on this, and I see a lot of alignment researchers working on approaches that don’t appear to assume that the AI system is a value learner.
My suspicion is the most instructive cases to look at (Modern AI really is too new a field to have much to go on in terms of mature safety standards) is how the regulation of Nuclear and Radiation safety has evolved over time. Early research suggested some serious X-Risks that didn’t pan out for either scientific (igniting the atmosphere) or logistical/political reasons (cobalt bombs, tsar bomba scale H bombs) thankfully, but some risks arising more out of the political domain (having big gnarly nuclear war anyway) still exist that could certainly make it a less fun planet to live on. I suspect the successes and failures of the nuclear treaty system could be instructive here with the push to integrate big AI into military heirachies, as regulating nukes is something almost everyone agrees is a very good idea, but have had a less than stellar history of compliance.
They are likely out of scope for whataever your goal is here, but I do think they need serious study because without it, our attempts at regulation will just push unsafe AI to less savory juristictions.
This seems great!
One additional example I know of, which I do not have personal experience with but know that a lot of people do have experience with, is compliance with PCI DSS (for credit card processing). Which does deal with safety in an adversarial setting where the threat model isn’t super clear.
(my interactions with it look like “yeah that looks like a lot and we can outsource the risky bits to another company to deal with? great!”)
A high-level theme that would be interesting to explore here is rules-based vs. principles-based regulation. For example, the UK financial regulators are more principles-based (broad principles of good conduct, flexible and open to interpretation). In contrast, the US is more rules-based (detailed and specific instructions).
https://www.cfauk.org/pi-listing/rules-versus-principles-based-regulation
I submitted a proposal but did not receive a confirmation that it was received. Perhaps I should submit again?
We got it! You should get an update within a week.
[Edit—on further investigation this seems to be a more UK-specific point; US regulations are much less ambiguous as they take a rules-based approach unlike the UK’s principles-based approach]
It’s interesting to note that financial regulations sometimes possess a degree of ambiguity and are subject to varying interpretations. It’s frequently the case that whichever institution interprets them most stringently or conservatively effectively establishes the benchmark for how the regulation is understood. Regulators often use these stringent interpretations as a basis for future clarifications or refinements. This phenomenon is especially observable in newly introduced regulations pertaining to emerging forms of fraud or novel technologies.