W/r/t 2, young, unsophisticated AIs with mostly human-readable source code require only small amounts of concern to detect “being trained to lie”. Albeit this is only a small amount of concern by serious-FAI-work standards; outside the local cluster, anyone who tries to build this sort of AI in the first place might very well wave their hands and say, “Oh, but there’s no difference between trying to lie with your actions to us and really being friendly, that’s just some anthropomorphic interpretation of this code here” when the utility function has nothing about being nice and being nice is just being done as an instrumental act to get the humans to go along with you while you increase your reward counter. But in terms of serious FAI proposals, that’s just being stupid. I’m willing to believe Paul Christiano when he tells me that his smart International Mathematical Olympiad friends are smarter than this, regardless of my past bad experiences with would-be AGI makers. In any case, it shouldn’t take a large amount of “actual concern and actual willingness to admit problems”, to detect this class of problem in a young AGI; so this alone would not rule out “raise the FAI like a kid” as a serious FAI proposal. Being able to tell the difference between a ‘lying’ young AGI and a young AGI that actually has some particular utility function—albeit not so much / just-only by inspection of actions, as by inspection of code which not only has that utility function but was human-designed to transparently explicitly encode it—is an explicit part of serious FAI proposals.
3 and 4 are the actually difficult parts because they don’t follow from mere inspection of readable source code.
On 3: Knowing that the current execution path of the code seems to be working okay today is very different from strongly constraining future execution paths across hugely different contexts to have desirable properties; this requires abstract thinking on a much higher level than staring at what your AGI is doing right now. The tank-detector works so long as it’s seeing pictures from the training sets in which all tanks are present on cloudy days, but fails when it wanders out into the real world, etc. “Reflective decision theory”-style FAI proposals try to address this by being able to state the desirable properties of the AI in an abstraction which can be checked against abstractions over code execution pathways and even over permitted future self-modifications, although the ‘abstract desirable properties’ are very hard (require very difficult and serious FAI efforts) to specify for reasons related to 4.
On 4: Since humans don’t have introspective access to their own categories and generalizations, figuring out the degree of overlap by staring at their direct representations will fail (you would not know your brain’s spiking pattern for blueness ify ou saw it), and trying to check examples is subject to a 3-related problem wherein you only check a narrow slice of samples (you never checked any cryonics patients or Terry Schiavo when you were checking that the AI knew what a ‘sentient being’ was). I.e., your training set turns out to unfortunately have been a dimensionally impoverished subset of the test set. “Indirect normativity” (CEV-style) proposals try to get at this by teaching the AI to idealize values as being stored in humans, such that observation about human judgments or observation of human brain states will ‘correctly’ (from our standpoint) refine its moral theory; as opposed to trying to get the utility function correct outright.
The anthropomorphic appeal of “raising AIs as kids” doesn’t address 3 or 4, so it falls into the class of proposals that will appear to work while the AI is young, then kill you after it becomes smarter than you. Similarly, due to the problems with 3 and 4, any AGI project claiming to rely solely on 2 is probably unserious about FAI and probably will treat “learning how to get humans to press your reward button” as “our niceness training is working” a la the original AIXI paper, since you can make a plausible-sounding argument for it (or, heck, just a raw appeal to “I know my architecture!”) and it avoids a lot of inconvenient work you’d have to do if you publicly admitted otherwise. Ahem.
It should also be noted that Reversed Stupidity Is Not Intelligence; there’s a lot of stupid F-proposals for raising AIs like children, but that doesn’t mean a serious FAI project tries to build and run an FAI in one-shot without anything analogous to gradient developmental stages. Indirect normativity is complex enough to require learning (requiring an inductive DWIM architecture below, with that architecture simpler and more transparent). It’s >50% probable in my estimate that there’s a stage where you’re slowly teaching running code about things like vision, analogous to a baby stage of a human. It’s just that the presence of such a stage does not solve, and in fact does not even constitute significant progress toward, problems 3 and 4, the burden of which need be addressed by other proposals.
“learning how to get humans to press your reward button” as “our niceness training is working” a la the original AIXI paper,
Quote needed, wasn’t this contested by the author?
On 3: Knowing that the current execution path of the code seems to be working okay today is very different from strongly constraining future execution paths across hugely different contexts to have desirable properties; this requires abstract thinking on a much higher level than staring at what your AGI is doing right now. The tank-detector works so long as it’s seeing pictures from the training sets in which all tanks are present on cloudy days, but fails when it wanders out into the real world, etc. “Reflective decision theory”-style FAI proposals try to address this by being able to state the desirable properties of the AI in an abstraction which can be checked against abstractions over code execution pathways and even over permitted future self-modifications, although the ‘abstract desirable properties’ are very hard (require very difficult and serious FAI efforts) to specify for reasons related to 4.
Humans are able to learn basic human moral concepts with reasonable quantities of data. What is the relevant context change?
3 and 4 seem like the most fatal.
W/r/t 2, young, unsophisticated AIs with mostly human-readable source code require only small amounts of concern to detect “being trained to lie”. Albeit this is only a small amount of concern by serious-FAI-work standards; outside the local cluster, anyone who tries to build this sort of AI in the first place might very well wave their hands and say, “Oh, but there’s no difference between trying to lie with your actions to us and really being friendly, that’s just some anthropomorphic interpretation of this code here” when the utility function has nothing about being nice and being nice is just being done as an instrumental act to get the humans to go along with you while you increase your reward counter. But in terms of serious FAI proposals, that’s just being stupid. I’m willing to believe Paul Christiano when he tells me that his smart International Mathematical Olympiad friends are smarter than this, regardless of my past bad experiences with would-be AGI makers. In any case, it shouldn’t take a large amount of “actual concern and actual willingness to admit problems”, to detect this class of problem in a young AGI; so this alone would not rule out “raise the FAI like a kid” as a serious FAI proposal. Being able to tell the difference between a ‘lying’ young AGI and a young AGI that actually has some particular utility function—albeit not so much / just-only by inspection of actions, as by inspection of code which not only has that utility function but was human-designed to transparently explicitly encode it—is an explicit part of serious FAI proposals.
3 and 4 are the actually difficult parts because they don’t follow from mere inspection of readable source code.
On 3: Knowing that the current execution path of the code seems to be working okay today is very different from strongly constraining future execution paths across hugely different contexts to have desirable properties; this requires abstract thinking on a much higher level than staring at what your AGI is doing right now. The tank-detector works so long as it’s seeing pictures from the training sets in which all tanks are present on cloudy days, but fails when it wanders out into the real world, etc. “Reflective decision theory”-style FAI proposals try to address this by being able to state the desirable properties of the AI in an abstraction which can be checked against abstractions over code execution pathways and even over permitted future self-modifications, although the ‘abstract desirable properties’ are very hard (require very difficult and serious FAI efforts) to specify for reasons related to 4.
On 4: Since humans don’t have introspective access to their own categories and generalizations, figuring out the degree of overlap by staring at their direct representations will fail (you would not know your brain’s spiking pattern for blueness ify ou saw it), and trying to check examples is subject to a 3-related problem wherein you only check a narrow slice of samples (you never checked any cryonics patients or Terry Schiavo when you were checking that the AI knew what a ‘sentient being’ was). I.e., your training set turns out to unfortunately have been a dimensionally impoverished subset of the test set. “Indirect normativity” (CEV-style) proposals try to get at this by teaching the AI to idealize values as being stored in humans, such that observation about human judgments or observation of human brain states will ‘correctly’ (from our standpoint) refine its moral theory; as opposed to trying to get the utility function correct outright.
The anthropomorphic appeal of “raising AIs as kids” doesn’t address 3 or 4, so it falls into the class of proposals that will appear to work while the AI is young, then kill you after it becomes smarter than you. Similarly, due to the problems with 3 and 4, any AGI project claiming to rely solely on 2 is probably unserious about FAI and probably will treat “learning how to get humans to press your reward button” as “our niceness training is working” a la the original AIXI paper, since you can make a plausible-sounding argument for it (or, heck, just a raw appeal to “I know my architecture!”) and it avoids a lot of inconvenient work you’d have to do if you publicly admitted otherwise. Ahem.
It should also be noted that Reversed Stupidity Is Not Intelligence; there’s a lot of stupid F-proposals for raising AIs like children, but that doesn’t mean a serious FAI project tries to build and run an FAI in one-shot without anything analogous to gradient developmental stages. Indirect normativity is complex enough to require learning (requiring an inductive DWIM architecture below, with that architecture simpler and more transparent). It’s >50% probable in my estimate that there’s a stage where you’re slowly teaching running code about things like vision, analogous to a baby stage of a human. It’s just that the presence of such a stage does not solve, and in fact does not even constitute significant progress toward, problems 3 and 4, the burden of which need be addressed by other proposals.
Quote needed, wasn’t this contested by the author?
Humans are able to learn basic human moral concepts with reasonable quantities of data. What is the relevant context change?
Eh? Do you want a more detailed answer than the question might suggest? I thought nigerweiss et al had good responses.
I also don’t see any human culture getting Friendliness-through-AI-training right without doing something horrible elsewhere.