You want to align an AGI. You just started your first programming course in the Computer Science faculty and you have a bright idea: let’s decompose the problem into simpler subproblems!
What you want is to reach a world state that is not a dystopia. For a state S such that S is dystopian, it holds that D(S) is true. If S is not a dystopia, D(S) is false.
You would like to implement the following algorithm:
You start at a random state S.
If D(S):
Change S a little in a non-dystopian direction.
Else:
You aligned AI!
Of course, everything here is vaguely defined. But I still wonder if it makes sense to pose the questions I have before doing the defining work. I guess you might want to interpret the questions under the most reasonable (?) formalizations of the concept above if such a thing is at all possible and not already alignment-complete.
So, we have multiple questions:
1. Is specifying anything of the above (e.g., D) alignment complete?
2. Is evaluating D(S) alignment-complete?
3. Is changing S a little in a non-dystopian direction alignment complete?
4. How would you actually use that pseudo-algorithm though?
Now, there’s a slightly different concept from alignment-completeness (I’d guess), which is “constructing a state U that is such that D(U) = false”. Such a thing hasn’t been done yet as far as I know, and every proposal tends to fail horribly.
So the questions above are worth repeating in this form:
1. Is specifying anything of the above (e.g., D) easier than constructing U?
2. Is evaluating D(S) easier than constructing U?
3. Is changing S a little in a non-dystopian direction easier than constructing U?
4. How would you actually use that pseudo-algorithm though?
Is checking that a state of the world is not dystopian easier than constructing a non-dystopian state?
You want to align an AGI. You just started your first programming course in the Computer Science faculty and you have a bright idea: let’s decompose the problem into simpler subproblems!
What you want is to reach a world state that is not a dystopia. For a state S such that S is dystopian, it holds that D(S) is true. If S is not a dystopia, D(S) is false.
You would like to implement the following algorithm:
Of course, everything here is vaguely defined. But I still wonder if it makes sense to pose the questions I have before doing the defining work. I guess you might want to interpret the questions under the most reasonable (?) formalizations of the concept above if such a thing is at all possible and not already alignment-complete.
So, we have multiple questions:
1. Is specifying anything of the above (e.g., D) alignment complete?
2. Is evaluating D(S) alignment-complete?
3. Is changing S a little in a non-dystopian direction alignment complete?
4. How would you actually use that pseudo-algorithm though?
Now, there’s a slightly different concept from alignment-completeness (I’d guess), which is “constructing a state U that is such that D(U) = false”. Such a thing hasn’t been done yet as far as I know, and every proposal tends to fail horribly.
So the questions above are worth repeating in this form:
1. Is specifying anything of the above (e.g., D) easier than constructing U?
2. Is evaluating D(S) easier than constructing U?
3. Is changing S a little in a non-dystopian direction easier than constructing U?
4. How would you actually use that pseudo-algorithm though?