Given an agent A as you describe, let AO be its object-level algorithm, and AM, for meta, be its agent-search algorithm. A itself stands for “execute AM for 1⁄2 the time, and if that didn’t transform yourself, execute AO for the remaining time”.
When you say “if A’ is a valid description of an agent and P is a proof that A’ does better on the object level task than A”, do you mean that P proves: 1) that A’O is better than AO at the task T; 2) that A’ is better than A at T; 3) that A’ is better than AO at T?
If you mean 2), doesn’t it follow that you can get at most 1⁄2 speedup from any transformation? After all, if A solves T through transforming into A’, that is still a case of A solving T, and so the only way A’ could be better at it than A is due to the penalty A incurs by searching for (A’,P); that penalty is at most 1⁄2 the time spent by A. It also depends a lot on the search strategy and looks “brittle” as a measure of difference between A and A’.
It’s not clear to me what you mean in “but suppose A’ considers many candidate modifications (A″, P1), (A″, P2), …, (A″, Pk). It is now much harder for A to show that all of these self-modifications are safe...”—are these different agents A”1, A”2 etc., or the same agent with different proofs? If the latter, it’s not clear what “these self-modifications” refer to, as there’s just one modification, with different potential proofs; if the former, the language later on that seems to refer to one unique A” is confusing.
If we look at the simplest case, where A’ runs twice faster than A but as a program is equivalent to A (e.g. suppose they run on the same underlying hardware platform, but A has a systematic slowdown in its code that’s trivial to remove and trivial to show it doesn’t affect its results), and A is able to prove that, do you claim that this case falls under your scenario, and A in fact should not be able to self-modify into A’ w/o assuming its consistency? If you do, are you able, for this specific simple case, to debug the reasoning and show how it carries through?
I mean (following Shmidhuber) that A’ is better than A. Of course if A transforms into A’ then A’ and A must be about equally good, but what we should really ask is “how much better is A then some less reflective approach?” Then we could imagine large improvements: for example, A could use the first fourth of its time doubling the speed of its proof searcher, the next eighth doubling it again, the next sixteenth doubling it again, and end up finding a better object level solver that would have taken much longer than the entire allotted time to discover using the original proof searcher. The argument I gave suggests that this can’t happen in any obvious way, so that A is actually only a factor of 2 faster than a particular non-reflective proof searcher, but that seems to be a different issue.
When I wrote “these self-modifications” I meant “these (self-modifications + justifications)”, as safety was a property of a justified self-modification. I’ll try and clarify.
If A can prove that AO always does better with more time, then A can modify into A’ and simultaneously commit to spending less time doing proof search. If it spends much more time doing proof search, it runs into the sort of issue I highlighted (consider the additional justified self-modifications it looks at; it can’t prove they are all safe, so it can’t prove that considering them is good). I could try and make this argument more clear/explicit.
Thanks, this whole part of your post is clearer to me now. I think the post would benefit from integrating these explanations to some degree into the text.
I would like to request some clarifications.
Given an agent A as you describe, let AO be its object-level algorithm, and AM, for meta, be its agent-search algorithm. A itself stands for “execute AM for 1⁄2 the time, and if that didn’t transform yourself, execute AO for the remaining time”.
When you say “if A’ is a valid description of an agent and P is a proof that A’ does better on the object level task than A”, do you mean that P proves: 1) that A’O is better than AO at the task T; 2) that A’ is better than A at T; 3) that A’ is better than AO at T?
If you mean 2), doesn’t it follow that you can get at most 1⁄2 speedup from any transformation? After all, if A solves T through transforming into A’, that is still a case of A solving T, and so the only way A’ could be better at it than A is due to the penalty A incurs by searching for (A’,P); that penalty is at most 1⁄2 the time spent by A. It also depends a lot on the search strategy and looks “brittle” as a measure of difference between A and A’.
It’s not clear to me what you mean in “but suppose A’ considers many candidate modifications (A″, P1), (A″, P2), …, (A″, Pk). It is now much harder for A to show that all of these self-modifications are safe...”—are these different agents A”1, A”2 etc., or the same agent with different proofs? If the latter, it’s not clear what “these self-modifications” refer to, as there’s just one modification, with different potential proofs; if the former, the language later on that seems to refer to one unique A” is confusing.
If we look at the simplest case, where A’ runs twice faster than A but as a program is equivalent to A (e.g. suppose they run on the same underlying hardware platform, but A has a systematic slowdown in its code that’s trivial to remove and trivial to show it doesn’t affect its results), and A is able to prove that, do you claim that this case falls under your scenario, and A in fact should not be able to self-modify into A’ w/o assuming its consistency? If you do, are you able, for this specific simple case, to debug the reasoning and show how it carries through?
I mean (following Shmidhuber) that A’ is better than A. Of course if A transforms into A’ then A’ and A must be about equally good, but what we should really ask is “how much better is A then some less reflective approach?” Then we could imagine large improvements: for example, A could use the first fourth of its time doubling the speed of its proof searcher, the next eighth doubling it again, the next sixteenth doubling it again, and end up finding a better object level solver that would have taken much longer than the entire allotted time to discover using the original proof searcher. The argument I gave suggests that this can’t happen in any obvious way, so that A is actually only a factor of 2 faster than a particular non-reflective proof searcher, but that seems to be a different issue.
When I wrote “these self-modifications” I meant “these (self-modifications + justifications)”, as safety was a property of a justified self-modification. I’ll try and clarify.
If A can prove that AO always does better with more time, then A can modify into A’ and simultaneously commit to spending less time doing proof search. If it spends much more time doing proof search, it runs into the sort of issue I highlighted (consider the additional justified self-modifications it looks at; it can’t prove they are all safe, so it can’t prove that considering them is good). I could try and make this argument more clear/explicit.
Thanks, this whole part of your post is clearer to me now. I think the post would benefit from integrating these explanations to some degree into the text.