Point 1: Meta-Execution and Security Amplification
I have a comment on the specific difficulty of meta-execution as an approach to security amplification. I believe that while the framework limits the “corruptibility” of the individual agents, the system as a whole is still quite vulnerable to adversarial inputs.
As far as I can tell, the meta-execution framework is Turing complete. You could store the tape contents within one pointer and the head location in another, or there’s probably a more direct analogy with lambda calculus. And by Turing complete I mean that there exists some meta-execution agent that, when given any (suitably encoded) description of a Turing machine as input, executes that Turing machine and returns its output.
Now, just because the meta-execution framework is Turing complete, this doesn’t mean that any particular agent created in this manner is Turing complete. If our agents were in practice Turing complete, I feel like that would defeat the security-amplification purpose of meta-execution. Maybe the individual nodes cannot be corrupted by the limited input they see, but the system as a whole could be made to perform arbitrary computation and produce arbitrary output on specific inputs. The result of “interpret the input as a Turing machine and run it” is probably not the correct or aligned response to those inputs.
Unfortunately, it seems empirically the case that computation systems become Turing complete very easily. Some examples:
I would argue that even humans are approximately Turing complete (there are probably input sequences that would cause a person to carry out an arbitrary computation to the best of their abilities). I assume this contributes to the desire for information limitation in meta-execution in the first place.
In particular, return oriented programming is interesting as an adversarial attack on pre-written programs by taking advantage of the fact that a limited control over execution flow in the presence of existing code often forms a Turing complete system, despite the attacker having no control over the existing code.
So I suspect that any meta-computation agent that is practically useful for answering general queries is likely to be Turing complete, and that it will be difficult to avoid Turing completeness (up to a resource limit, which doesn’t help the arbitrary code execution problem).
An addition to this argument thanks to William Saunders: We might end up having to accept that our agent will be Turing complete and hope that the malicious inputs are hard to find or work with low probability. But in that case, limiting the amount of information seen by individual nodes may make it harder for the system to detect and avoid these inputs. So what you gain in per-node security you lose in overall system security.
Point 2: IDA in general
More broadly, my main concern with IDA isn’t that it has a fatal flaw but that it isn’t clear to me how the system helps with ensuring alignment compared to other architectures. I do think that IDA can be used to provide modest improvement in capabilities with small loss in alignment (not sure if better or worse then augmenting humans with computational power in other ways), but that the alignment error is not zero and increases the larger the improvement in capabilities.
Argument:
It is easy and tempting for the amplification to result in some form of search (“what is the outcome of this action?” “what is the quality of this outcome?” repeat), which fails if the human might misevaluate some states.
To avoid that, H needs to be very careful about how they use the system.
I don’t believe that it is practically possible for formally specify the rules H needs to follow in order to produce an aligned system (or if you can, it’s just as hard as specifying the rules for a CPU + RAM architecture). You might disagree with this premise in which case the rest doesn’t follow.
If we can’t be confident of the rules H needs to follow, then it is very risky just asking H to act as best as they can in this system without knowing how to prevent things from going wrong.
Since I don’t believe specifying IDA-specific rules is any easier than for other architectures, it seems unlikely to me that you’d have a proof about the alignment or corrigibility of such a system that wouldn’t be more generally applicable, in which case why not use a more direct architecture with fewer approximation steps?
To expand on the last point, if A[*], the limiting agent, is aligned with H then it must contain at least implicitly some representation of H’s values (retrievable through IRL, for example). And so must A[i] for every i. So the alignment and distillation procedures must preserve the implicit values H. If we can prove that the distillation preserves implicit values, then it seems plausible that a similar procedure with similar proof would be able to just directly distill the values of H explicitly and then we can train an agent to behave optimally with respect to those.
I find your point 1 very interesting but point 2 may be based in part on a misunderstanding.
To expand on the last point, if A[*], the limiting agent, is aligned with H then it must contain at least implicitly some representation of H’s values (retrievable through IRL, for example). And so must A[i] for every i.
I think this is not how Paul hopes his scheme would work. If you read https://www.lesswrong.com/posts/yxzrKb2vFXRkwndQ4/understanding-iterated-distillation-and-amplification-claims, it’s clear that in the LBO variant of IDA, A[1] can’t possibly learn H’s values. Instead A[1] is supposed to learn “corrigibility” from H and then after enough amplifications, A[n] will gain the ability to learn values from some external user (who may or may not be H) and then the “corrigibility” that was learned and preserved through the IDA process is supposed to make it want to help the user achieve their values.
I won’t deny probably misunderstanding parts of LDA but if the point is to learn corrigibility from H couldn’t you just say that corrigibility is a value that H has? Then use the same argument with “corrigibility” in place of “value”? (This assumes that corrigiblity is entirely defined with reference to H. If not, replace with the subset that is defined entirely from H, if that is empty then remove H).
If A[*] has H-derived-corrigibility then so must A[1] so distillation must preserve H-derived-corrigibility so we could instead directly distill H-derived-corrigibility from H which can be used to directly train a powerful agent with that property, which can then be trained from some other user.
so we could instead directly distill H-derived-corrigibility from H which can be used to directly train a powerful agent with that property
I’m imagining the problem statement for distillation being: we have a powerful aligned/corrigible agent. Now we want to train a faster agent which is also aligned/corrigible.
If there is a way to do this without starting from a more powerful agent, then I agree that we can skip the amplification process and jump straight to the goal.
Point 1: Meta-Execution and Security Amplification
I have a comment on the specific difficulty of meta-execution as an approach to security amplification. I believe that while the framework limits the “corruptibility” of the individual agents, the system as a whole is still quite vulnerable to adversarial inputs.
As far as I can tell, the meta-execution framework is Turing complete. You could store the tape contents within one pointer and the head location in another, or there’s probably a more direct analogy with lambda calculus. And by Turing complete I mean that there exists some meta-execution agent that, when given any (suitably encoded) description of a Turing machine as input, executes that Turing machine and returns its output.
Now, just because the meta-execution framework is Turing complete, this doesn’t mean that any particular agent created in this manner is Turing complete. If our agents were in practice Turing complete, I feel like that would defeat the security-amplification purpose of meta-execution. Maybe the individual nodes cannot be corrupted by the limited input they see, but the system as a whole could be made to perform arbitrary computation and produce arbitrary output on specific inputs. The result of “interpret the input as a Turing machine and run it” is probably not the correct or aligned response to those inputs.
Unfortunately, it seems empirically the case that computation systems become Turing complete very easily. Some examples:
Accidentally Turing Complete
mov is Turing Complete
Return oriented Programming
I would argue that even humans are approximately Turing complete (there are probably input sequences that would cause a person to carry out an arbitrary computation to the best of their abilities). I assume this contributes to the desire for information limitation in meta-execution in the first place.
In particular, return oriented programming is interesting as an adversarial attack on pre-written programs by taking advantage of the fact that a limited control over execution flow in the presence of existing code often forms a Turing complete system, despite the attacker having no control over the existing code.
So I suspect that any meta-computation agent that is practically useful for answering general queries is likely to be Turing complete, and that it will be difficult to avoid Turing completeness (up to a resource limit, which doesn’t help the arbitrary code execution problem).
An addition to this argument thanks to William Saunders: We might end up having to accept that our agent will be Turing complete and hope that the malicious inputs are hard to find or work with low probability. But in that case, limiting the amount of information seen by individual nodes may make it harder for the system to detect and avoid these inputs. So what you gain in per-node security you lose in overall system security.
Point 2: IDA in general
More broadly, my main concern with IDA isn’t that it has a fatal flaw but that it isn’t clear to me how the system helps with ensuring alignment compared to other architectures. I do think that IDA can be used to provide modest improvement in capabilities with small loss in alignment (not sure if better or worse then augmenting humans with computational power in other ways), but that the alignment error is not zero and increases the larger the improvement in capabilities.
Argument:
It is easy and tempting for the amplification to result in some form of search (“what is the outcome of this action?” “what is the quality of this outcome?” repeat), which fails if the human might misevaluate some states.
To avoid that, H needs to be very careful about how they use the system.
I don’t believe that it is practically possible for formally specify the rules H needs to follow in order to produce an aligned system (or if you can, it’s just as hard as specifying the rules for a CPU + RAM architecture). You might disagree with this premise in which case the rest doesn’t follow.
If we can’t be confident of the rules H needs to follow, then it is very risky just asking H to act as best as they can in this system without knowing how to prevent things from going wrong.
Since I don’t believe specifying IDA-specific rules is any easier than for other architectures, it seems unlikely to me that you’d have a proof about the alignment or corrigibility of such a system that wouldn’t be more generally applicable, in which case why not use a more direct architecture with fewer approximation steps?
To expand on the last point, if A[*], the limiting agent, is aligned with H then it must contain at least implicitly some representation of H’s values (retrievable through IRL, for example). And so must A[i] for every i. So the alignment and distillation procedures must preserve the implicit values H. If we can prove that the distillation preserves implicit values, then it seems plausible that a similar procedure with similar proof would be able to just directly distill the values of H explicitly and then we can train an agent to behave optimally with respect to those.
I find your point 1 very interesting but point 2 may be based in part on a misunderstanding.
I think this is not how Paul hopes his scheme would work. If you read https://www.lesswrong.com/posts/yxzrKb2vFXRkwndQ4/understanding-iterated-distillation-and-amplification-claims, it’s clear that in the LBO variant of IDA, A[1] can’t possibly learn H’s values. Instead A[1] is supposed to learn “corrigibility” from H and then after enough amplifications, A[n] will gain the ability to learn values from some external user (who may or may not be H) and then the “corrigibility” that was learned and preserved through the IDA process is supposed to make it want to help the user achieve their values.
I won’t deny probably misunderstanding parts of LDA but if the point is to learn corrigibility from H couldn’t you just say that corrigibility is a value that H has? Then use the same argument with “corrigibility” in place of “value”? (This assumes that corrigiblity is entirely defined with reference to H. If not, replace with the subset that is defined entirely from H, if that is empty then remove H).
If A[*] has H-derived-corrigibility then so must A[1] so distillation must preserve H-derived-corrigibility so we could instead directly distill H-derived-corrigibility from H which can be used to directly train a powerful agent with that property, which can then be trained from some other user.
I’m imagining the problem statement for distillation being: we have a powerful aligned/corrigible agent. Now we want to train a faster agent which is also aligned/corrigible.
If there is a way to do this without starting from a more powerful agent, then I agree that we can skip the amplification process and jump straight to the goal.