far more plausible than a word meeting the definition without any changing values,
I did not say or deliberately imply that nobody’s values would be changed by hearing an infallibly factual description of future events presented by a transcendant entity. In fact, that kind of experience is so powerful that unverified third-hand reports of it happening thousands of years ago retain enough impact to act as a recruiting tactic for several major religions.
which would require all apparent value disagreements to be illusions and the world not to work in the way it appears to.
Maybe not all, but certainly a lot of apparent value differences really are illusory. In third-world countries, genocide tends to flare up only after a drought leads to crop failures, suggesting that the real motivation is economic and racism is only used as an excuse, or a guide for who to kill without disrupting the social order more than absolutely necessary.
I think this is a lot less impossible than you’re trying to make it sound.
The stuff that people tend to get really passionate about, unwilling to compromise on, isn’t, in my experience, the global stuff. When someone says “I want less A” or “more A” they seem to mean “within range of my senses,” “in the environment where I’m likely to encounter it in the future” or “in my tribe’s territory or the territory of those we communicate with.” An arachnophobe wouldn’t panic upon hearing about a camel-spider three thousand miles away; if anything, the idea that none were on the same continent would be reassuring. An AI capable of terraforming galaxies might satisfy conflicting preferences by simply constructing an ideal environment for each, and somehow ensuring that everyone finds what they’re looking for.
The accurate description of such seemingly-impossible perfection would, in a sense, constitute a ‘fully general mind hack,’ in that it would convince anyone who can be convinced by the truth and satisfy anyone who can be satisfied within the laws of physics. If you know of a better standard, I’d like to hear it.
I’m not sure there is any point in continuing this. Once you allow the AI to optimize the human values it’s supposed to be tested against for test compability it’s over.
If, as you assert, pleasing everyone is impossible, and persuading anyone to accept something they wouldn’t otherwise be pleased by (even through a method as benign as giving them unlimited, factual knowledge of the consequences and allowing them to decide for themselves) is unFriendly, do you categorically reject the possibility of friendly AI?
If you think friendly AI is possible, but I’m going about it all wrong, what evidence would convince you that a given proposal was not equivalently flawed?
I’m not sure there is any point in continuing this.
I’m having some doubts, too. If you decide not to reply, I won’t press the issue.
and persuading anyone to accept something they wouldn’t otherwise be pleased by (even through a method as benign as giving them unlimited, factual knowledge of the consequences and allowing them to decide for themselves) is unFriendly,
No. Only if you allow acceptance to define friendliness. Leaving changing the definition of friendliness as an avenue to fulfill the goal defined as friendliness will almost certainly result in unfriendliness. Persuasion is not inherently unfriendly, provided it’s not used to short-circuit friendliness.
If you think friendly AI is possible, but I’m going about it all wrong, what evidence would convince you that a given proposal was not equivalently flawed?
As an absolute minimum it would need to be possible and not obviously exploitable. It should also not look like a hack. Ideally it should be understandable, give me an idea what an implementation might look like, be simple and elegant in design and seem rigorous enough to make me confident that the lack of visible holes is not a fact about the creativity in the looker.
Well, I’ll certainly concede that my suggestion fails the feasibility criterion, since a literal implementation might involve compiling a multiple-choice opinion poll with a countably infinite number of questions, translating it into every language and numbering system in history, and then presenting it to a number of subjects equal to the number of people who’ve ever lived multiplied by the average pre-singularity human lifespan in Planck-seconds multiplied by the number of possible orders in which those questions could be presented multiplied by the number of AI proposals under consideration.
I don’t mind. I was thinking about some more traditional flawed proposals, like the smile-maximizer, how they cast the net broadly enough to catch deeply Unfriendly outcomes, and decided to deliberately err in the other direction: design a test that would be too strict, that even a genuinely Friendly AI might not be able to pass, but that would definitely exclude any Unfriendly outcome.
I did not say or deliberately imply that nobody’s values would be changed by hearing an infallibly factual description of future events presented by a transcendant entity. In fact, that kind of experience is so powerful that unverified third-hand reports of it happening thousands of years ago retain enough impact to act as a recruiting tactic for several major religions.
Maybe not all, but certainly a lot of apparent value differences really are illusory. In third-world countries, genocide tends to flare up only after a drought leads to crop failures, suggesting that the real motivation is economic and racism is only used as an excuse, or a guide for who to kill without disrupting the social order more than absolutely necessary.
I think this is a lot less impossible than you’re trying to make it sound.
The stuff that people tend to get really passionate about, unwilling to compromise on, isn’t, in my experience, the global stuff. When someone says “I want less A” or “more A” they seem to mean “within range of my senses,” “in the environment where I’m likely to encounter it in the future” or “in my tribe’s territory or the territory of those we communicate with.” An arachnophobe wouldn’t panic upon hearing about a camel-spider three thousand miles away; if anything, the idea that none were on the same continent would be reassuring. An AI capable of terraforming galaxies might satisfy conflicting preferences by simply constructing an ideal environment for each, and somehow ensuring that everyone finds what they’re looking for.
The accurate description of such seemingly-impossible perfection would, in a sense, constitute a ‘fully general mind hack,’ in that it would convince anyone who can be convinced by the truth and satisfy anyone who can be satisfied within the laws of physics. If you know of a better standard, I’d like to hear it.
I’m not sure there is any point in continuing this. Once you allow the AI to optimize the human values it’s supposed to be tested against for test compability it’s over.
If, as you assert, pleasing everyone is impossible, and persuading anyone to accept something they wouldn’t otherwise be pleased by (even through a method as benign as giving them unlimited, factual knowledge of the consequences and allowing them to decide for themselves) is unFriendly, do you categorically reject the possibility of friendly AI?
If you think friendly AI is possible, but I’m going about it all wrong, what evidence would convince you that a given proposal was not equivalently flawed?
I’m having some doubts, too. If you decide not to reply, I won’t press the issue.
No. Only if you allow acceptance to define friendliness. Leaving changing the definition of friendliness as an avenue to fulfill the goal defined as friendliness will almost certainly result in unfriendliness. Persuasion is not inherently unfriendly, provided it’s not used to short-circuit friendliness.
As an absolute minimum it would need to be possible and not obviously exploitable. It should also not look like a hack. Ideally it should be understandable, give me an idea what an implementation might look like, be simple and elegant in design and seem rigorous enough to make me confident that the lack of visible holes is not a fact about the creativity in the looker.
Well, I’ll certainly concede that my suggestion fails the feasibility criterion, since a literal implementation might involve compiling a multiple-choice opinion poll with a countably infinite number of questions, translating it into every language and numbering system in history, and then presenting it to a number of subjects equal to the number of people who’ve ever lived multiplied by the average pre-singularity human lifespan in Planck-seconds multiplied by the number of possible orders in which those questions could be presented multiplied by the number of AI proposals under consideration.
I don’t mind. I was thinking about some more traditional flawed proposals, like the smile-maximizer, how they cast the net broadly enough to catch deeply Unfriendly outcomes, and decided to deliberately err in the other direction: design a test that would be too strict, that even a genuinely Friendly AI might not be able to pass, but that would definitely exclude any Unfriendly outcome.
Please taboo the word ‘hack.’