There may be questions in moral philosophy that we need to answer in order to build a Friendly AI, but most MIRI-associated people don’t think that the bulk of the difficulty of Friendly AI (over generic AGI) is in generating a sufficiently long or sufficiently basic list of intuitively moral English-language sentences. Eliezer thinks the hard part of Friendly AI is stability under self-modification; I’ve heard other suggestions to the effect that the hard part is logical uncertainty, or identifying how preference and motivation are implemented in human brains.
The problems you need to solve in order to convince a hostile human being to become a better person, or to organize a society, or to motivate yourself to do the right thing, aren’t necessarily the same as the problems you need to solve to build the brain of a value-conducive agent from scratch.
The stability under self-modification is a core problem of AGI generally, isn’t it? So isn’t that an effort to solve AGI, not safety/friendliness (which would be fairly depressing given its stated goals)? Does MIRI have a way to define safety/friendliness that isn’t derivative of moral philosophy?
Additionally, many human preferences are almost certainly not moral… surely a key part of the project would be to find some way to separate the two. Preference satisfaction seems like a potentially very unfriendly goal...
If you want to build an unfriendly AI, you probably don’t need to solve the stability problem. If you have a consistently self-improving agent with unstable goals, it should eventually (a) reach an intelligence level where it could solve the stability problem if it wanted to, then (b) randomly arrive at goals that entail their own preservation, then (c) implement the stability solution before the self-preserving goals can get overwritten. You can delegate the stability problem to the AI itself. The reason this doesn’t generalize to friendly AI is that this process doesn’t provide any obvious way for humans to determine which goals the agent has at step (b).
There may be questions in moral philosophy that we need to answer in order to build a Friendly AI, but most MIRI-associated people don’t think that the bulk of the difficulty of Friendly AI (over generic AGI) is in generating a sufficiently long or sufficiently basic list of intuitively moral English-language sentences. Eliezer thinks the hard part of Friendly AI is stability under self-modification; I’ve heard other suggestions to the effect that the hard part is logical uncertainty, or identifying how preference and motivation are implemented in human brains.
The problems you need to solve in order to convince a hostile human being to become a better person, or to organize a society, or to motivate yourself to do the right thing, aren’t necessarily the same as the problems you need to solve to build the brain of a value-conducive agent from scratch.
The stability under self-modification is a core problem of AGI generally, isn’t it? So isn’t that an effort to solve AGI, not safety/friendliness (which would be fairly depressing given its stated goals)? Does MIRI have a way to define safety/friendliness that isn’t derivative of moral philosophy?
Additionally, many human preferences are almost certainly not moral… surely a key part of the project would be to find some way to separate the two. Preference satisfaction seems like a potentially very unfriendly goal...
If you want to build an unfriendly AI, you probably don’t need to solve the stability problem. If you have a consistently self-improving agent with unstable goals, it should eventually (a) reach an intelligence level where it could solve the stability problem if it wanted to, then (b) randomly arrive at goals that entail their own preservation, then (c) implement the stability solution before the self-preserving goals can get overwritten. You can delegate the stability problem to the AI itself. The reason this doesn’t generalize to friendly AI is that this process doesn’t provide any obvious way for humans to determine which goals the agent has at step (b).
Cheers thanks for the informative reply.