“AI alignment theory” is meant as an overarching term to cover the whole research field associated with this problem, including, e.g., the much-debated attempt to estimate how rapidly an AI might gain in capability once it goes over various particular thresholds.
Other terms that have been used to describe this research problem include “robust and beneficial AI” and “Friendly AI”. The term “value alignment problem” was coined by Stuart Russell to refer to the primary subproblem of aligning AI preferences with (potentially idealized) human preferences.
Some alternative terms for this general field of study, such as ‘control problem’, can sound adversarial—like the rocket is already pointed in a bad direction and you need to wrestle with it. Other terms, like ‘AI safety’, understate the advocated degree to which alignment ought to be an intrinsic part of building advanced agents. E.g., there isn’t a separate theory of “bridge safety” for how to build bridges that don’t fall down. Pointing the agent in a particular direction ought to be seen as part of the standard problem of building an advanced machine agent. The problem does not divide into “building an advanced AI” and then separately “somehow causing that AI to produce good outcomes”, the problem is “getting good outcomes via building a cognitive agent that brings about those good outcomes”.
My personal view is that given all of this history and the fact that this forum is named the “AI Alignment Forum”, we should not redefine “AI Alignment” to mean the same thing as “Intent Alignment”. I feel like to the extent there is confusion/conflation over the terminology, it was mainly due to Paul’s (probably unintentional) overloading of “AI alignment” with the new and narrower meaning (in Clarifying “AI Alignment”), and we should fix that error by collectively going back to the original definition, or in some circumstances where the risk of confusion is too great, avoiding “AI alignment” and using some other term like “AI x-safety”. (Although there’s an issue with “existential risk/safety” as well, because “existential risk/safety” covers problems that aren’t literally existential, e.g., where humanity survives but its future potential is greatly curtailed. Man coordination is hard.)
I feel like to the extent there is confusion/conflation over the terminology, it was mainly due to Paul’s (probably unintentional) overloading of “AI alignment” with the new and narrower meaning (in Clarifying “AI Alignment”)
I don’t think this is the main or only source of confusion:
MIRI folks also frequently used the narrower usage. I think the first time I saw “aligned” was in Aligning Superintelligence with Human Interests from 2014 (scraped by wayback on January 3 2015) which says “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.””
Virtually every problem people discussed as part of AI alignment was also part of intent alignment. The name was deliberately chosen to evoke “pointing” your AI in a direction. Even in the linked post Eliezer uses “pointing the AI in the right direction” as a synonym for alignment.
It was proposed to me as a replacement for the narrower term AI control, which quite obviously doesn’t include all the broader stuff. In the email thread where Rob suggested I adopt it he suggested it was referring to what Nick Bostrom called the “second principal-agent problem” between AI developers and the AI they build.
the overarching research topic of how to develop sufficiently advanced machine intelligences such that running them produces good outcomes in the real world
I want to emphasize again that this definition seems extremely bad. A lot of people think their work helps AI actually produce good outcomes in the world when run, so pretty much everyone would think their work counts as alignment.
It includes all work in AI ethics, if in fact that research is helpful for ensuring that future AI has a good outcome. It also includes everything people work on in AI capabilities, if in fact capability increases improve the probability that a future AI system produces good outcomes when run. It’s not even restricted to safety, since it includes realizing more upside from your AI. It includes changing the way you build AI to help address distributional issues, if the speaker (very reasonably!) thinks those are important to the value of the future. I didn’t take this seriously as a definition and didn’t really realize anyone was taking it seriously, I thought it was just an instance of speaking loosely.
But if people are going to use the term this way, I think at a minimum they cannot complain about linguistic drift when “alignment” means anything at all. Obviously people are going to disagree about what AI features lead to “producing good outcomes.” Almost all the time I see definitional arguments it’s where people (including Eliezer) are objecting that “alignment” includes too much stuff and should be narrower, but this is obviously not going to be improved by adopting an absurdly broad definition.
Other relevant paragraphs from the Arbital post:
My personal view is that given all of this history and the fact that this forum is named the “AI Alignment Forum”, we should not redefine “AI Alignment” to mean the same thing as “Intent Alignment”. I feel like to the extent there is confusion/conflation over the terminology, it was mainly due to Paul’s (probably unintentional) overloading of “AI alignment” with the new and narrower meaning (in Clarifying “AI Alignment”), and we should fix that error by collectively going back to the original definition, or in some circumstances where the risk of confusion is too great, avoiding “AI alignment” and using some other term like “AI x-safety”. (Although there’s an issue with “existential risk/safety” as well, because “existential risk/safety” covers problems that aren’t literally existential, e.g., where humanity survives but its future potential is greatly curtailed. Man coordination is hard.)
I don’t think this is the main or only source of confusion:
MIRI folks also frequently used the narrower usage. I think the first time I saw “aligned” was in Aligning Superintelligence with Human Interests from 2014 (scraped by wayback on January 3 2015) which says “We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.””
Virtually every problem people discussed as part of AI alignment was also part of intent alignment. The name was deliberately chosen to evoke “pointing” your AI in a direction. Even in the linked post Eliezer uses “pointing the AI in the right direction” as a synonym for alignment.
It was proposed to me as a replacement for the narrower term AI control, which quite obviously doesn’t include all the broader stuff. In the email thread where Rob suggested I adopt it he suggested it was referring to what Nick Bostrom called the “second principal-agent problem” between AI developers and the AI they build.
I want to emphasize again that this definition seems extremely bad. A lot of people think their work helps AI actually produce good outcomes in the world when run, so pretty much everyone would think their work counts as alignment.
It includes all work in AI ethics, if in fact that research is helpful for ensuring that future AI has a good outcome. It also includes everything people work on in AI capabilities, if in fact capability increases improve the probability that a future AI system produces good outcomes when run. It’s not even restricted to safety, since it includes realizing more upside from your AI. It includes changing the way you build AI to help address distributional issues, if the speaker (very reasonably!) thinks those are important to the value of the future. I didn’t take this seriously as a definition and didn’t really realize anyone was taking it seriously, I thought it was just an instance of speaking loosely.
But if people are going to use the term this way, I think at a minimum they cannot complain about linguistic drift when “alignment” means anything at all. Obviously people are going to disagree about what AI features lead to “producing good outcomes.” Almost all the time I see definitional arguments it’s where people (including Eliezer) are objecting that “alignment” includes too much stuff and should be narrower, but this is obviously not going to be improved by adopting an absurdly broad definition.