Nathan Helm-Burger comments on Controlling AGI Risk

Nathan Helm-Burger 15 Mar 2024 18:10 UTC
2 points
0
I like your effort to think holistically about the sociotechnical systems we are embedded in and the impacts we should expect AI to have on those systems.
I have a couple of minor critiques of the way you are breaking things down that I think could be improved.
First, a meta thing. The general pattern of being a bit to black & white about described very complicated sets of things. This is nice because it makes it easier to reason about complicated situations, but it risks over simplifying and leading to seeming strong conclusions which don’t actually follow from the true reality. The devil is in the details, as they say.
Efforts to-date have largely gravitated into the two camps of value alignment and governance.
I don’t think this fully describes the set of camps. I think that these are two of the camps, yes, but there are others.
My breakdown would be:
Governance—Using regulation to set up patterns of behavior where AI will be used and developed in safe rather than harmful ways. Forcing companies to internalize their externalities (e.g. risks to society). Preventing and enforcing human-misuse-of-AI scenarios. Attempting to regulate novel technologies which arise because of accelerated R&D as result of AI. Setting up preventative measures to detect and halt rogue AI or human-misused AI in the act of doing bad things before the worst consequences can come to pass. Preventing acceleration spirals of recursive self-improvement from proceeding so rapidly that humanity becomes intellectually eclipsed and looses control over its destiny.
Value alignment—getting the AIs to behave as much as possible in accordance with the values of humanity generally. Getting the AI to be moral / ethical / cautious about harming people or making irreversible changes with potentially large negative consequences. Ideally, if an AI were ‘given free reign’ to act in the world, we’d want it to act in ways which were win-win for itself and humanity, and no matter what to err on the side of not harming humanity.
Operator alignment—technical methods to get the AI to be obedient to the instructions of the operators. To make the AI behave in accordance with the intuitive spirit of their instructions (‘do what I mean’) rather than like an evil genie which follows only the letter of the law. Making the AI safe and intuitive to use. Avoiding unintended negative consequences.
Control—finding ways to keep the operators of AI can maintain control over the AIs they create even if a given AI gets made wrong such that it tries to behave in harmful undesirable ways (out of alignment with operators). This involves things like technical methods of sandboxing new AIs, and thoroughly safety-testing them within the sandbox before deploying them. Once deployed, it involves making sure you retain the ability to shut them off if something goes wrong, making sure the model’s weights don’t get exfiltrated by outside actors or by the model itself. Having good cybersecurity, employee screening, and internal infosec practices so that hackers/spies can’t steal your model weights, design docs, and code.
A minor nitpick:
Sociotechnical system/s (STS)
A system in which agents (traditionally, people) interact with objects (including technologies) to achieve aims and fulfill purposes
Not sure if objects is the right word here, or rather, not sure if that word alone is sufficient. Maybe objects and information/ideas/concepts? Much of the work I’ve been doing recently is observing what potential risks might arise from AI systems capable of rapidly integrating technical information from a large set of sources. This is not exactly making new discoveries, but just putting disparate pieces of information together in such a way as to create a novel recipe for technology. In general, this is a wonderful ability. In the specific case of weapons of mass destruction, it’s a dangerous ability.
Nested STS / Human STS
Yes, historically, all STS have been human STS. But novel AI agents could, in addition to integrating into and interacting with human STS, form their own entirely independent STS. A sufficiently powerful amoral AGI would see human STS as useless if it could make its own that served its needs better. Such a scenario would likely turn out quite badly for humans. This is the concept of “the AI doesn’t hate you, it’s just that humans and their ecosphere are made of atoms that the AI has preferred uses for.”
This doesn’t contradict your ideas, just suggests an expansion of possible avenues of risk which should be guarded against. Self-replicating AI systems in outer space or burrowed into the crust of the earth, or roaming the ocean seabeds will likely be quite dangerous to humanity sooner or later even if they have no interaction with our STS in the short term.
- TeaSea 16 Mar 2024 4:05 UTC
  3 points
  0
  Parent
  Brilliant! Thanks for these insightful comments, Nathan. I’ll endeavour to address them in a follow-up post.