I suggest as a first step, we should just aim for an uncompetitive aligned AI, one that might use a lot more training data, or many more invocations of Opt than the benchmark. (If we can’t solve that, that seems fairly strong evidence that a competitive aligned AI is impossible or beyond our abilities. Or if someone proposes a candidate and we can’t decide whether it’s actually aligned or not, that would also be very useful strategic information that doesn’t require the candidate to be competitive.)
Do you already have a solution to the uncompetitive aligned AI problem that you can sketch out? It sounds like you think iterated amplification or debate can be implemented using Opt (in an uncompetitive way), so maybe you can give enough details about that to either show that it is aligned or provide people a chance to find flaws in it?
If dropping competitiveness, what counts as a solution? Is “imitate a human, but run it fast” fair game? We could try to hash out the details in something along those lines, and I think that’s worthwhile, but I don’t think it’s a top priority and I don’t think the difficulties will end up being that similar. I think it may be productive to relax the competitiveness requirement (e.g. to allow solutions that definitely have at most a polynomial slowdown), but probably not a good idea to eliminate it altogether.
If dropping competitiveness, what counts as a solution?
I’m not sure, but mainly because I’m not sure what counts as a solution to your problem. If we had a specification of that, couldn’t we just remove the parts that deal with competitiveness?
Is “imitate a human, but run it fast” fair game?
I guess not, because a human imitation might have selfish goals and not be intent aligned to the user?
We could try to hash out the details in something along those lines, and I think that’s worthwhile, but I don’t think it’s a top priority and I don’t think the difficulties will end up being that similar.
What about my suggestion of hashing the details of how to implement IDA/DEBATE using Opt and then seeing if we can decide whether or not it’s aligned?
I suggest as a first step, we should just aim for an uncompetitive aligned AI, one that might use a lot more training data, or many more invocations of Opt than the benchmark. (If we can’t solve that, that seems fairly strong evidence that a competitive aligned AI is impossible or beyond our abilities. Or if someone proposes a candidate and we can’t decide whether it’s actually aligned or not, that would also be very useful strategic information that doesn’t require the candidate to be competitive.)
Do you already have a solution to the uncompetitive aligned AI problem that you can sketch out? It sounds like you think iterated amplification or debate can be implemented using Opt (in an uncompetitive way), so maybe you can give enough details about that to either show that it is aligned or provide people a chance to find flaws in it?
The point of working in this setting is mostly to constrain the search space or make it easier to construct an impossibility argument.
If dropping competitiveness, what counts as a solution? Is “imitate a human, but run it fast” fair game? We could try to hash out the details in something along those lines, and I think that’s worthwhile, but I don’t think it’s a top priority and I don’t think the difficulties will end up being that similar. I think it may be productive to relax the competitiveness requirement (e.g. to allow solutions that definitely have at most a polynomial slowdown), but probably not a good idea to eliminate it altogether.
I’m not sure, but mainly because I’m not sure what counts as a solution to your problem. If we had a specification of that, couldn’t we just remove the parts that deal with competitiveness?
I guess not, because a human imitation might have selfish goals and not be intent aligned to the user?
What about my suggestion of hashing the details of how to implement IDA/DEBATE using Opt and then seeing if we can decide whether or not it’s aligned?