I haven’t read the paper yet (thanks for posting it, anyway), so maybe the answer to my question is there, but there is something about MIRI interest with Löb’s theorem that always bugged me, specifically:
Unfortunately, the straightforward way of setting up such a model fails catastrophically on the innocent-sounding step “DT1 knows that DT2’s deductions are reliable”. If we try and model DT1 and DT2 as proving statements in two formal systems (one stronger than the other), then the only way that DT1 can make such a statement about DT2’s reliability is if DT1 (and thus both) are in fact unreliable! This counterintuitive roadblock is best explained by reference to Löb’s theorem, and so we turn to the background of that theorem.
Sure, DT1 can’t prove that DT2 decisions are reliable, and in fact in general it can’t even prove that DT1 itself makes reliable decisions, but DT1 may be able to prove “Assuming that DT1 decisions are reliable, then DT2 decisions are reliable”. Isn’t that enough for all practical purposes?
Notice that this even makes sense in the limit case where DT2 = DT1, which isn’t necessarily just a theoretical pathological case but can happen in practice when even a non-self-modifying DT1 ponders “Why should I not kill myself?”
Am I missing something? Isn’t Löb’s theorem just essentially a formal way of showing that you can’t prove that you are not insane?
Good question! Translating your question to the setting of the logical model, you’re suggesting that instead of using provability in Peano Arithmetic as the criterion for justified action, or provability in PA + Con(PA) (which would have the same difficulty), the agent uses provability under the assumption that its current formal system (which includes PA) is consistent.
Thus, you definitely do not want an agent that makes decisions on the criterion “if I assume that my own deductions are reliable, then can I show that this is the best action?”, at least not until you’ve come up with a heuristic version of this that doesn’t lead to awful self-fulfilling prophecies.
I don’t think he was talking about self-PA, but rather an altered decision criteria, such that rather that “if I can prove this is good, do it” it is “if I can prove that if I am consistent then this is good, do it” which I think doesn’t have this particular problem, though it does have others, and it still can’t /increase/ in proof strength.
I don’t think he was talking about self-PA, but rather an altered decision criteria, such that rather that “if I can prove this is good, do it” it is “if I can prove that if I am consistent then this is good, do it”
Yes.
and it still can’t /increase/ in proof strength.
Mmm, I think I can see it. What about “if I can prove that if a version of me with unbounded computational resources is consistent then this is good, do it”. (*)
It seems to me that this allows increase in proof strength up to the proof strength of that particular ideal reference agent.
(* there should be probably additional constraints that specify that the current agent, and the successor if present, must be provably approximations of the unbounded agent in some conservative way)
“if I can prove that if a version of me with unbounded computational resources is consistent then this is good, do it”
In this formalism we generally assume infinite resources anyway. And even if this is not the case, consistent/inconsistent doesn’t depend on resources, only on the axioms and rules for deduction. So this still doesn’t let you increase in proof strength, although again it should help avoid losing it.
If we are already assuming infinite resources, then do we really need anything stronger than PA?
And even if this is not the case, consistent/inconsistent doesn’t depend on resources, only on the axioms and rules for deduction.
A formal system may be inconsistent, but a resource-bounded theorem prover working on it might never be able to prove any contradiction for a given resource bound. If you increase the resource bound, contradictions may become provable.
I haven’t read the paper yet (thanks for posting it, anyway), so maybe the answer to my question is there, but there is something about MIRI interest with Löb’s theorem that always bugged me, specifically:
Sure, DT1 can’t prove that DT2 decisions are reliable, and in fact in general it can’t even prove that DT1 itself makes reliable decisions, but DT1 may be able to prove “Assuming that DT1 decisions are reliable, then DT2 decisions are reliable”.
Isn’t that enough for all practical purposes?
Notice that this even makes sense in the limit case where DT2 = DT1, which isn’t necessarily just a theoretical pathological case but can happen in practice when even a non-self-modifying DT1 ponders “Why should I not kill myself?”
Am I missing something?
Isn’t Löb’s theorem just essentially a formal way of showing that you can’t prove that you are not insane?
Good question! Translating your question to the setting of the logical model, you’re suggesting that instead of using provability in Peano Arithmetic as the criterion for justified action, or provability in PA + Con(PA) (which would have the same difficulty), the agent uses provability under the assumption that its current formal system (which includes PA) is consistent.
Unfortunately, this turns out to be an inconsistent formal system!
Thus, you definitely do not want an agent that makes decisions on the criterion “if I assume that my own deductions are reliable, then can I show that this is the best action?”, at least not until you’ve come up with a heuristic version of this that doesn’t lead to awful self-fulfilling prophecies.
I don’t think he was talking about self-PA, but rather an altered decision criteria, such that rather that “if I can prove this is good, do it” it is “if I can prove that if I am consistent then this is good, do it” which I think doesn’t have this particular problem, though it does have others, and it still can’t /increase/ in proof strength.
Yes.
Mmm, I think I can see it.
What about “if I can prove that if a version of me with unbounded computational resources is consistent then this is good, do it”. (*) It seems to me that this allows increase in proof strength up to the proof strength of that particular ideal reference agent.
(* there should be probably additional constraints that specify that the current agent, and the successor if present, must be provably approximations of the unbounded agent in some conservative way)
“if I can prove that if a version of me with unbounded computational resources is consistent then this is good, do it”
In this formalism we generally assume infinite resources anyway. And even if this is not the case, consistent/inconsistent doesn’t depend on resources, only on the axioms and rules for deduction. So this still doesn’t let you increase in proof strength, although again it should help avoid losing it.
If we are already assuming infinite resources, then do we really need anything stronger than PA?
A formal system may be inconsistent, but a resource-bounded theorem prover working on it might never be able to prove any contradiction for a given resource bound. If you increase the resource bound, contradictions may become provable.