I think you are mistaken about the relative efficiency / inefficiency of scientific research. I believe that research is comparably efficient to much of industry, and that many of the things that look like inefficiencies are actually trading off small local gains for large global gains. I’ve come to this conclusion as the result of years of doing scientific research, where almost every promising idea I’ve come up with (including some that I thought quite clever) had already been explored by someone else. In fact, the typical case for when I was able to
make progress was when solving the problem required a combination of tools, each of which individually was relatively rare in the field.
For instance, my paper on stochastic verification required: (i) familiarity of sum-of-squares programming; (ii) the application of supermartingale
techniques from statistics; and (iii) the ability to produce relatively non-trivial convex relaxations of a difficult optimization
problem. In robotics, most people are familiar with convex optimization, and at least some are familiar with sum-of-squares programming and supermartingales. In fact, at least one other person had already published a fairly similar paper but had not formulated the problem in a way that was useful for systems of practical levels of complexity, probably because they had (i) and (ii) but lacked (iii).
Of course, it is true that your point #3 (“new tools open up promising new angles of attack”) often does lead to low-hanging fruit. In machine learning, often when a new tool is discovered there will be a sizable contingent of the community who devotes resources to exploring possible uses of that tool. However, I think the time-scale for this is shorter than you might imagine. I would guess that, in machine learning at least, most of the ``easy″ applications of a new tool have been exhausted within five years of its introduction. This is not to say that new applications don’t trickle in, but
they tend to either be esoteric or else require adapting the tool in some non-trivial way.
My impression is that MIRI believes that most of what drives what they perceive to be low-hanging fruit comes from #4 (“progress is only valuable to those with unusual views”). I think this is probably true, but not, as you probably believe, due to differing views about the intelligent explosion; I believe instead that MIRI’s differing views come from a misunderstanding of the context of their research.
For instance, MIRI repeatedly brings up Paul’s probabilistic metamathematics as an important piece of research progress produced by MIRI. I’ve mostly hedged when people ask me about this, because I do think
it is an interesting result, but I feel by now that it has been oversold by MIRI. One could argue that the result is relevant to one or more of logic, philosophy, or AI, but I’ll focus on the last one because it is presumably what MIRI cares about most. The standard argument in favor of this line of research is that Lob’s theorem implies that purely logical systems will have difficulty reasoning about their future behavior, and that therefore it is worth asking whether a probabilistic language of thought can overcome this obstacle. But I believe this is due to a failure to think sufficiently concretely about the problem. To give two examples: first, humans seem perfectly capable of making predictions about their future behavior despite this issue, and I have yet to see an argument for why AIs should be different. Secondly, we manage to routinely prove facts about the behavior of programs (this is the field of program analysis) despite the fact that in theory this should be “undecidable”. This is because the undecidability issues don’t really crop up in practice. If MIRI wanted to make the case that they will crop up for a human-level AI in important ways, they should at least mention and respond to the large literature pointing in the opposite direction, and explain why those tools or their successors would be insufficient.
A full response to the misguidedness of the Lob obstacle as a problem of interest probably necessitates its own post, but hopefully this serves to at least partially illustrate my point. Another example would be decision theory of modal agents. I also won’t take the time to treat this in detail, but will simply note that this work studies a form of decision theory that MIRI itself invented, and that no one else uses or studies. It should therefore perhaps be unsurprising that it is relatively easy to prove novel results about it. It would be the equivalent of an activity I did to amuse myself in high school, which was to invent new mathematical objects and then study their properties. But I think it would be mistaken to point to this as an example of inefficiency in research. (And yes, I do think the idea of formalizing decision theory in terms of code is an interesting one. I just don’t think you get to point to this as an example of why research is inefficient.)
I’m not sure this is the best description of my objections to this post, but I need to start work now and it seems best to avoid keeping this around as a draft for too long, so I’m going to go ahead and post it and wait for everyone to tear it to shreds.
I agree that Luke here overstates the significance of my result, but I do think you miss the point a bit or are uncharitable. Regardless of whether making predictions about your own behavior is fundamentally difficult, we don’t yet understand any formal framework that can capture reasoning of the form “my decisions are good because my beliefs correspond to reality.” Assuming there is a natural formal framework capturing human reasoning (I think the record so far suggests optimism) then there is something interesting that we don’t yet understand. It seems like you are applying the argument: “We know that humans can do X, so why do you think that X is an important problem?” The comment about undecidability issues not applying in practice also seems a bit unfair; for programs that do proof search we know that we cannot prove claims of the desired type based on simple Godelian arguments, and almost all interesting frameworks for reasoning are harder to prove things about than a simple proof search. (Of course the game is that we don’t want to prove things about the algorithms in question, we are happy to form justified beliefs about them in whatever way we can, including inductive inference. But the point is that there are things we don’t understand.)
There are further questions about whether any work at MIRI is a meaningful contribution to this problem or any other. I think that the stuff I’ve worked on is plausibly but not obviously a significant contribution (basically the same status as the other work I’m doing).
Regarding the modal agents stuff, I agree that it’s a toy problem where you should expect progress to be fast (if there was a nice theorem at the end of it, then it wouldn’t be too unusual as a paper in theoretical CS, except for the unfashionable use of mathematical logic). Regarding updateless/timeless/ambient decision theory, it’s a clear step forward for a very idiosyncratic problem, but one for which I think you can make a reasonable case that it’s worthwhile.
I think you shouldn’t be too surprised to make meaningful headway on theoretically interesting questions, even those which will plausibly be important. It seems like in theoretical research today things are still developing reasonably rapidly, and the ratio between plausibly important problems and human capital is very large. I expect that given effort and success at recruiting human capital MIRI can make good headway, in the same sort of way that other theorists do. Optimistically they would be distinguished primarily by working on a class of problems which is unusually important given their values and model of the world (a judgment with which you might disagree).
Of course the game is that we don’t want to prove things about the algorithms in question, we are happy to form justified beliefs about them in whatever way we can, including inductive inference. But the point is that there are things we don’t understand.
And the question is: who cares? The mechanism by which human beings predict their future behavior is not logical inference. Similar ad-hoc Bayesian extrapolation techniques can be used in any general AI without worry about Löbian obstacles. So why is it such a pressing issue?
I don’t wish to take away from the magnitude of your accomplishment. It is an important achievement. But in the long run I don’t think it’s going to be a very useful result in the construction of superhuman AGIs, specifically. And it’s reasonable to ask why MIRI is assigning strategic importance to these issues.
I think you shouldn’t be too surprised to make meaningful headway on theoretically interesting questions, even those which will plausibly be important. It seems like in theoretical research today things are still developing reasonably rapidly, and the ratio between plausibly important problems and human capital is very large.
I agree with this. Luke seems to be making a much stronger claim than the above, though.
It seems like you are applying the argument: “We know that humans can do X, so why do you think that X is an important problem?”
I agree that that would be a bad argument. That was not the argument I intended to make, though I can see why it has been interpreted that way and I should have put more effort into explaining myself. I am rather saying that human reasoning looks so far away from even getting close to running into issues with Godel / Lob, that it seems like a rather abstruse starting point for investigation.
The rest of your comment seems most easily discussed in person, so I’ll do that and hopefully we’ll remember to update the thread with our conclusion.
Secondly, we manage to routinely prove facts about the behavior of programs (this is the field of program analysis) despite the fact that in theory this should be “undecidable”. This is because the undecidability issues don’t really crop up in practice.
As someone with a reasonable acquaintance with program analysis, synthesis, and semantics… YIKES.
Rice’s Theorem is, so to speak, the biggest, nastiest rock we semantics folks have to crawl around on a regular basis. The way we generally do it is by constructing algorithms, semantic frameworks, and even entire programming languages in which undecidability cannot happen in the first place, thus restricting ourselves to analyzing something less than the set of all possible programs.
Now, admittedly, in practice we’ve made a lot of progress this way, because in practice there are really four kinds of programs: ones that provably terminate by design, ones that provably don’t terminate by design, provably undecidable programs (usually programs that rely on the halting behavior of some other program or logic for their own halting behavior), and just plain messed-up what-the-fuck programs.
The last kind are mostly created only by mistake. The third kind come up in program analysis and semantics, but we can usually construct a proof that we’re dealing with a formally undecidable problem there and set reasonable hard bounds on length of iteration or depth of recursion (or even find decidable subclasses of these problems that are decently useful to real people). The second kind is handled by writing coterminating programs over codata. The first kind is handled by writing terminating programs over data.
Undecidability issues do come up in practice, and the current research frontier (MIRI’s Lobian paper, Goedel Machines, AIXI) certainly indicates that these issues definitely come up in the study of Universal Artificial Intelligence. However, for most problems below the level of program analysis or universal induction, undecidability issues can be handled or contained productively by research effort.
To give two examples: first, humans seem perfectly capable of making predictions about their future behavior despite this issue,
Using heuristic methods rather than formal proofs. These already often fail (humans even fail to take the effect of being in a different emotional state properly into account), and that’s without having to deal with the effects of radical self-modifications which might impact your whole reasoning and motivational system.
I don’t think this is sufficient to dismiss my example. Whether or not we prove things, we certainly have some way of reasoning at least somewhat reliably about how we and others will behave. It seems important to ask why we expect AI to be fundamentally different; I don’t think that drawing a distinction between heuristics and logical proofs is sufficient to do so, since many of the logical obstacles carry over to the heuristic case, and to the extent they don’t this seems important and worth grappling with.
Also note that, even if you did think it was sufficient, I gave you another example that was based purely in the realm of formal logic.
Jacob, have you seen Luke’s interview with me, where I’ve tried to reply to some arguments of the sort you’ve given in this thread and elsewhere?
I don’t think [the fact that humans’ predictions about themselves and each other often fail] is sufficient to dismiss my example. Whether or not we prove things, we certainly have some way of reasoning at least somewhat reliably about how we and others will behave. It seems important to ask why we expect AI to be fundamentally different; I don’t think that drawing a distinction between heuristics and logical proofs is sufficient to do so, since many of the logical obstacles carry over to the heuristic case, and to the extent they don’t this seems important and worth grappling with.
Perhaps here is a way to get a handle on where we disagree: Suppose we make a whole-brain emulation of Jacob Steinhardt, and you start modifying yourself in an attempt to achieve superintelligence while preserving your values, so that you can save the world. You try to go through billions of (mostly small) changes. In this process, you use careful but imperfect human (well, eventually transhuman) reasoning to figure out which changes are sufficiently safe to make. My expectation is that one of two things happens: Either you fail, ending up with very different values than you started with or stopping functioning completely; or you think very hard about how much confidence you need to have in each self-modification, and how much confidence you can achieve by ordinary human reasoning, and end up not doing a billion of these because you can’t achieve the necessary level of confidence. The only way I know for a human to reach the necessary level of confidence in the majority of the self-modifications would be to use formally verified proofs.
Presumably you disagree. If you could make a convincing case that a whole-brain emulation could safely go through many self-modifications using ordinary human reasoning, that would certainly change my position in the direction that the Löbian obstacle and other diagonalization issues won’t be that important in practice. If you can’t convince me that it’s probably possible and I can’t convince you that it probably isn’t, this might still help understanding where the disagreement is coming from.
Also note that, even if you did think it was sufficient, I gave you another example that was based purely in the realm of formal logic.
I thought the example was pretty terrible. Everybody with more than passing familiarity with the halting problem, and more generally Rice’s theorem, understands that the result that you can’t decide for every program whether it’s in a given class doesn’t imply that there are no useful classes of programs for which you can do so. MIRI’s argument for the importance of Löb’s theorem is: There’s an obvious way you can try to get stable self-modification, which is to require that if the AI self-modifies, it has to prove that the successor will not destroy the world. But if the AI tries to argue “doing the following trivial self-modification is safe because the modified me will only do things that it proves won’t destroy the world, thus it won’t destroy the world”, that requires the AI to understand the soundness of its own proof system, which is impossible by Löb’s theorem. This seems to me like a straight-up application of what Löb’s theorem actually says, rather than the kind of half-informed misunderstanding that would suggest that program analysis is impossible because Rice’s theorem.
Here’s a concrete way you could try to get stable self-modification:
Suppose for concreteness that we have a C program, call it X, and that within the C program there is an array called “world_state” of length M and a floating point number called “utility”. A simple instantiation of X would look something like:
We would like to consider modifications to X where we replace chooseAction with some new method chooseAction2 to get a program X2. Suppose we want to ensure some condition such as: from the current world state, if we use X2 instead of X, then after some finite period of time the sequence of utilities we get from using chooseAction2 will always be larger than the corresponding sequence if we have used chooseAction. Abusing notation a bit, this is the same as verifying the statement:
“there exists N such that for all n > N, utility2(n) > utility(n)”
[although note that utility2 and utility have fairly complicated descriptions if you actually try to write them out]. Now I agree that reasoning about this for arbitrary choices of chooseAction and chooseAction2 will be quite difficult (probably undecidable although I haven’t proved that). But the key point is that I get to choose chooseAction2, and there are many decision procedures that can prove such a statement in special cases. For instance, I could partition the space of world states into finitely many pieces, write down a transition function that over-approximates the possible transitions (for instance, by having a transition from Piece1 to Piece2 if any element of Piece1 can transition to any element of Piece2). Then I only need to reason about finite automata and those are trivially decidable.
You could argue that this proof system is fairly weak, but again, the AI gets to tailor its choices of chooseAction2 to be easy to reason about. You could also argue that an AI that could only create instances of chooseAction2 that could be reasoned about as above would be severely hampered, but I think this overlooks the fact that such techniques have been extraordinarily successful at reasoning about fairly complex systems.
In addition, there are more general proof strategies than the above if that one does not satisfy you. For instance, I could just require that any proposed modification to chooseAction2 come paired with a proof that that modification will be safe. Now I agree that there exist choices of chooseAction2 that are safe but not provably safe and this strategy disallows all such modifications. But that doesn’t seem so restrictive to me.
Finally, I agree that such a naieve proof strategy as “doing the following trivial self-modification is safe because the modified me will only do things that it proves won’t destroy the world, thus it won’t destroy the world” does not work. I’m not proposing that. The proof system clearly has to do some actual work. But your argument seems to me to be saying “this extremely naieve strategy doesn’t work, let’s develop an entire subfield of logic to try to make it work” whereas I advocate “this extremely naieve strategy doesn’t work, let’s use a non-naieve approach instead, and look at the relevant literature on how to solve similar problems”.
I also agree that the idea of “logical uncertainty” is very interesting. I spend much of my time as a grad student working on problems that could be construed as versions of logical uncertainty. But it seems like a mistake to me to tackle such problems without at least one, and preferably both, of: an understanding of the surrounding literature; experience with concrete instantiations of the problems at hand.
Sorry for being curmudgeonly there—I did afterwards wish that I had tempered that. The thing is that when you write something like
I also agree that the idea of “logical uncertainty” is very interesting. I spend much of my time as a grad student working on problems that could be construed as versions of logical uncertainty.
that sounds to me like you’re painting MIRI as working on these topics just because it’s fun, and supporting its work by arguments that are obviously naive to someone who knows the field, and that you’re supporting this by arguments that miss the point of what MIRI is trying to say. That’s why I found the example of program analysis so annoying—people who think that the halting problem means that program analysis is impossible really are misinformed (actually Rice’s theorem, really, but someone with this misconception wouldn’t be aware of that), both about the state of the field and about why these theorems say what they say. E.g., yes, of course your condition is undecidable as long as there is any choice f(s) of chooseAction2(s) that satisfies it; proof: let chooseAction2(s) be the program that checks whether chooseAction2(s) satisfies your criterion, and if yes return chooseAction(s), otherwise f(s). That’s how these proofs always go, and of course that doesn’t mean that there are no programs that are able to verify the condition for an interesting subclass of chooseAction2′s; the obvious interesting example is searching for a proof of the condition in ZFC, and the obvious boring example is that there is a giant look-up table which decides the condition for all choices of chooseAction2(s) of length less than L.
One possibility is that MIRI’s arguments actually do look that terrible to you, but that this is because MIRI hasn’t managed to make them clearly enough. I’m thinking this may be the case because you write:
In addition, there are more general proof strategies than the above if that one does not satisfy you. For instance, I could just require that any proposed modification to chooseAction2 come paired with a proof that that modification will be safe. Now I agree that there exist choices of chooseAction2 that are safe but not provably safe and this strategy disallows all such modifications. But that doesn’t seem so restrictive to me.
First, that’s precisely the “obvious” strategy that’s the starting point for MIRI’s work.
Second, yes, Eliezer’s arguing that this isn’t good enough, but the reason isn’t that it there are some safe modifications which aren’t provably safe. The work around the Löbian obstacle has nothing to do with trying to work around undecidability. (I will admit that for a short period at the April workshop I thought this might be a good research direction, because I had my intuitions shaken by the existence of Paul’s system and got overly optimistic, but Paul quickly convinced me that this was an unfounded hope, and in any case the main work around the Löbian obstacle was never really related to this.) MIRI’s argument definitely isn’t that “the above algorithm can’t decide for all chooseAction2 whether they’re safe, therefore it probably can’t decide it for the kind of chooseAction2 we’re interested in, therefore it’s unacceptable”. If that’s how you’ve understood the argument, then I see why you would think that the program analysis example is relevant. (The argument is indeed that it seems to be unable to decide safety for the chooseAction2 we’re interested in, but not because it’s unable to decide this for any generic chooseAction2.)
Third, you seem to imply that your proposal will only take safe actions. You haven’t given an argument for why we should think so, but the implication seems clear: You’re using a chooseAction that is obviously safe as long as it doesn’t rewrite itself, and it will only accept a proposed modification if it comes with a proof that it is safe, so if it does choose to rewrite itself then its successor will in fact be safe as well. Now I think this is fine reasoning, but you don’t seem to agree:
Finally, I agree that such a naieve proof strategy as “doing the following trivial self-modification is safe because the modified me will only do things that it proves won’t destroy the world, thus it won’t destroy the world” does not work. I’m not proposing that. The proof system clearly has to do some actual work.
You seem to be quite willing to use that reasoning yourself to show that the initial AI is safe, but you don’t think the AI should be able to use the same sort of reasoning. Eliezer’s argument is that this is in fact reasoning you want to use when building a self-improving AI: Yes, you can reason in more detail about how the AI you are building works, but this AI_1 will build an AI_2 and so on, and when proving that it’s ok to build AI_1 you don’t want to reason in detail about how AI_1,000,000 is going to work (which is built using design principles you don’t understand, by AI_999,999 which is much smarter than you); rather, you want to use general principles to reason that the because of the way AI_1,000,000 came to be, it has to be safe (because AI_999,999 only builds safe AIs, because it was built by AI_999,998 which only builds safe AIs...). But not only you need to reason like that, because you don’t know and aren’t smart enough to comprehend AI_1,000,000′s exact design; AI_1, which also isn’t that smart, will need to be able to use the same sort of reasoning. Hence, the interest in the Löbian obstacle.
There are caveats to add to this and parts of your comment I haven’t replied to, but I’m running into the same problem as you with your original comment in this thread, having already spent too much time on this. I’d be happy to elaborate if useful. For my part, I’d be interested in your reply to the other part of my comment: do you think I have localized our disagreement correctly?
Oh, one last point that I shouldn’t skip over: I assume the point about MIRI lacking “an understanding of the surrounding literature” refers to the thing about being tripped up at the July workshop by not knowing Gaifman’s work on logical uncertainty well enough. If so, I agree that that was an avoidable fail, but I don’t think it’s indicative of always ignoring the relevant literature or something like that. I’ll also admit that I still haven’t myself more with Gaifman’s work, but that’s because I’m not currently focusing on logical uncertainty, and I intend to do so in the future.
Sorry for being curmudgeonly there—I did afterwards wish that I had tempered that.
Don’t worry, I wasn’t offended :)
that sounds to me like you’re painting MIRI as working on these topics just because it’s fun, and supporting its work by arguments that are obviously naive to someone who knows the field, and that you’re supporting this by arguments that miss the point of what MIRI is trying to say.
I don’t think that MIRI is working on these topics just because they are fun, and I apologize for implying that. I should note here that I respect the work that you and Paul have done, and as I said at the beginning I was somewhat hesitant to start this discussion at all, because I was worried that it would have a negative impact on either you / Paul’s reputation (regardless of whether my criticisms ended up being justified) or on our relationship. But in the end I decided that it was better to raise my objections in fairly raw form and deal with any damage later.
One possibility is that MIRI’s arguments actually do look that terrible to you
What I would say is that the arguments start to look really fishy when one thinks about concrete instantiations of the problem.
You seem to be quite willing to use that reasoning yourself to show that the initial AI is safe
I’m not sure I understand what you’re saying here, but I’m not convinced that this is the sort of reasoning I’d use. It seems like Paul’s argument is similar to yours, though, and I’m going to talk to him in person in a few days, so perhaps the most efficient thing will be for me to talk to him and then report back.
Do you think I have localized our disagreement correctly?
I don’t think that “whole brain emulations can safely self-modify” is a good description of our disagreements. I think that this comment (the one you just made) does a better job of it. But I should also add that my real objection is something more like: “The argument in favor of studying Lob’s theorem is very abstract and it is fairly unintuitive that human reasoning should run into that obstacle. Standard epistemic hygiene calls for trying to produce concrete examples to motivate this line of work. I have not seen this done by MIRI, and all of the examples I can think of, both from having done AI and verification work myself, and from looking at what my colleagues do in program analysis, points in the squarely opposite direction.”
When I say “failure to understand the surrounding literature”, I am referring more to a common MIRI failure mode of failing to sanity-check their ideas / theories with concrete examples / evidence. I doubt that this comment is the best place to go into that, but perhaps I will make a top-level post about this in the near future.
Sorry for ducking most of the technical points, as I said, I hope that talking to Paul will resolve most of them.
Good to hear, and thanks for the reassurance :-) And yeah, I do too well know the problem of having too little time to write something polished, and I do certainly prefer having the discussion in fairly raw form to not having it at all.
One possibility is that MIRI’s arguments actually do look that terrible to you
What I would say is that the arguments start to look really fishy when one thinks about concrete instantiations of the problem.
I’m not really sure what you mean by a “concrete instantiation”. I can think of concrete toy models, of AIs using logical reasoning which know an exact description of their environment as a logical formula, which can’t reason in the way I believe is what we want to achieve, because of the Löbian obstacle. I can’t write down a self-rewriting AGI living in the real world that runs into the Löbian obstacle, but that’s because I can’t write down any AGI that lives in the real world.
My reason for thinking that the Löbian obstacle may be relevant is that, as mentioned in the interview, I think that a real-world seed FAI will probably use (something very much like) formal proofs to achieve the high level of confidence it needs in most of its self-modifications. I feel that formally specified toy models + this informal picture of a real-world FAI are as close to thinking about concrete instantiations as I can get at this point.
I may be wrong about this, but it seems to me that when you think about concrete instantiations, you look towards solutions that reason about the precise behavior of the program they’re trying to verify—reasoning like “this variable gets decremented in each iteration of this loop, and when it reaches zero we exit the loop, so we won’t loop forever”. But heuristically, while it seems possible to reason about the program you’re creating in this way, our task is to ensure that we’re creating a program which creates a program which creates a program which goes out to learn about the world and look for the most efficient way to use transistors it finds in the external environment to achieve its goals, and we want to verify that those transistors won’t decide to blow up the world; it seems clear to me that this is going to require reasoning of the type “the program I’m creating is going to reason correctly about the program it is creating”, which is the kind of reasoning that runs into the Löbian obstacle, rather than the kind of reasoning applied by today’s automated verification techniques.
Writing this, I’m not too confident that this will be helpful to getting the idea across. Hope the face-to-face with Paul with help, perhaps also with translating your intuitions to a language that better matches the way I think about things.
I think that the point above would be really helpful to clarify, though. This seems to be a recurring theme in my reactions to your comments on MIRI’s arguments—e.g. there was that LW conversation you had with Eliezer where you pointed out that it’s possible to verify properties probabilistically in more interesting ways than running a lot of independent trials, and I go, yeah, but how is that going to help with verifying whether the far-future descendant of an AI we build now, when it has entire solar systems of computronium to run on, is going to avoid running simulations which by accident contain suffering sentient beings? It seems that to achieve confidence that this far-future descendant will behave in a sensible way, without unduly restricting the details of how it is going to work, is going to need fairly abstract reasoning, and the sort of tools you point to don’t seem to be capable of this or to extend in some obvious way to dealing with this.
You seem to be quite willing to use that reasoning yourself to show that the initial AI is safe
I’m not sure I understand what you’re saying here, but I’m not convinced that this is the sort of reasoning I’d use.
I’m fairly sure that the reason your brain goes “it would be safe if we only allow self-modifications when there’s a proof that they’re safe” is that you believe that if there’s a proof that a self-modification is safe, then it is safe—I think this is probably a communication problem between us rather than you actually wanting to use different reasoning. But again, hopefully the face-to-face with Paul can help with that.
I don’t think that “whole brain emulations can safely self-modify” is a good description of our disagreements. I think that this comment (the one you just made) does a better job of it. But I should also add that my real objection is something more like: “The argument in favor of studying Lob’s theorem is very abstract and it is fairly unintuitive that human reasoning should run into that obstacle. [...]”
Thanks for the reply! Thing is, I don’t think that ordinary human reasoning should run into that obstacle, and the “ordinary” is just to exclude humans reasoning by writing out formal proofs in a fixed proof system and having these proofs checked by a computer. But I don’t think that ordinary human reasoning can achieve the level of confidence an FAI needs to achieve in its self-rewrites, and the only way I currently know how an FAI could plausibly reach that confidence is through logical reasoning. I thought that “whole brain emulations can safely self-modify” might describe our disagreement because that would explain why you think that human reasoning not being subject to Löb’s theorem would be relevant.
My next best guess is that you think that even though human reasoning can’t safely self-modify, its existence suggests that it’s likely that there is some form of reasoning which is more like human reasoning than logical reasoning and therefore not subject to Löb’s theorem, but which is sufficiently safe for a self-modifying FAI. Request for reply: Would that be right?
I can imagine that that might be the case, but I don’t think it’s terribly likely. I can more easily imagine that there would be something completely different from both human reasoning or logical reasoning, or something quite similar to normal logical reasoning but not subject to Löb’s theorem. But if so, how will we find it? Unless essentially every kind of reasoning except human reasoning can easily be made safe, it doesn’t seem likely that AGI research will hit on a safe solution automatically. MIRI’s current research seems to me like a relatively promising way of trying to search for a solution that’s close to logical reasoning.
When I say “failure to understand the surrounding literature”, I am referring more to a common MIRI failure mode of failing to sanity-check their ideas / theories with concrete examples / evidence. I doubt that this comment is the best place to go into that, but perhaps I will make a top-level post about this in the near future.
Ok, I think I probably don’t understand this yet, and making a post about it sounds like a good plan!
Sorry for ducking most of the technical points, as I said, I hope that talking to Paul will resolve most of them.
I don’t have time to reply to all of this right now, but since you explicitly requested a reply to:
My next best guess is that you think that even though human reasoning can’t safely self-modify, its existence suggests that it’s likely that there is some form of reasoning which is more like human reasoning than logical reasoning and therefore not subject to Löb’s theorem, but which is sufficiently safe for a self-modifying FAI. Request for reply: Would that be right?
The answer is yes, I think this is essentially right although I would probably want to add some hedges to my version of the statement (and of course the usual hedge that our intuitions probably conflict at multiple points but that this is probably the major one and I’m happy to focus in on it).
We have an AI which controls a robotic arm inside a box. The box also contains various cooking tools and ingredients to make cake, and a big red button that kills a puppy.
We prefer to cake to no cake, and we prefer our cake to be delicious and moist cake, but above all we prefer the companion cu-… puppy to stay alive.
Therefore, we implement in the AI a “puppy safety module” (PSM) which vetoes any course of action proposed by the planning module if it determines that there is any non-negligible probability of the red button being depressed. When the PSM can’t make an accurate prediction it will always err on the safe side: things like throwing tools on the box walls where they can bounce unpredictably and hit the red button don’t get approved. Athena, the Greek goddess of wisdom, justice and math, came down from Mt. Olympus and debugged the code for PSM. She was also so kind to check all the compilers, OS and hardware for bugs and glitches.
Do you agree that with this AI the puppy is most likely safe?
Now, let’s assume that the cakes made by this AI aren’t very good. We don’t really know how to program it to make better cakes, we try our best reinforcement learning algorithms but it doesn’t improve much. Then somebody suggest we try to allow the AI to modify it own code: In addition to sending motor commands to the arm, the planning module can also edit a copy of the AI code and eventually replace the whole program with this successor. Of course, these actions must still be approved by the PSM, but they are otherwise unrestricted: in principle they can modify the PSM, or add a bypass, or get rid of it entirely.
Do you think that this AI is unsafe for the puppy?
Do you think it is safe but unable to improve itself in effective ways?
Suppose that in the recesses of the code there is an instantiation of the bubble sort algorithm. The planner proposes to change it with, say, merge sort. Do you think that the PSM would generally disapprove such change?
Do you think it would approve it, but it would still be unable to approve modifications that would be needed for significant improvement?
One possibility is that MIRI’s arguments actually do look that terrible to you, but that this is because MIRI hasn’t managed to make them clearly enough.
MIRI claim to have had an important insight on AI design (this so called “Löbian obstacle”) that experts in relevant fields (AI, model checking, automated theorem proving, etc.) didn’t have. MIRI attempted to communicate their insight, but so far the experts have mostly ignored MIRI claims or denied that they are likely to be important and relevant.
What is the most likely explanation for that? It seems that we can narrow it to two hypotheses: A) MIRI’s insight is really relevant and important to AI design, but communication with the experts failed because of some problem on MIRI’s side, or on the experts’ side (e.g. stubbornness, stupidity) or both (e.g. excessively different backgrounds). B) MIRI is mistaken about the value of their insight (possible psychological causes may include confirmation bias, Dunning–Kruger effect, groupthink, overconfident personalities, etc.).
I would say that, barring evidence to the contrary, hypothesis B is the most likely explanation.
What is the most likely explanation for that? It seems that we can narrow it to two hypotheses:
A) MIRI’s insight is really relevant and important to AI design, but communication with the experts failed because > of some problem on MIRI’s side, or on the experts’ side (e.g. stubbornness, stupidity) or both (e.g. excessively different backgrounds).
B) MIRI is mistaken about the value of their insight (possible psychological causes may include confirmation bias, Dunning–Kruger effect, groupthink, overconfident personalities, etc.).
I don’t believe these options are exhaustive. “Relevant and valuable” are subjective and time-varying. The published work might be interesting and useful down the line, but not help the problems that AI researchers are working on right now.
It usually takes a few years for the research community to assimilate strange new ideas—sometimes much more than a few years. This isn’t due to a failure on anybody’s part—it’s due to the fact that scientists pick problems where they have a reasonable prospect of success within a few years.
I would give MIRI at least a decade or two before evaluating whether their work had any mainstream traction.
MIRI stated goals are similar to those of mainstream AI research, and MIRI approach in particular includes as subgoals the goals of research fields such as model checking and automated theorem proving.
Do you claim that MIRI is one or two decades ahead of mainstream researchers?
If the answer is no, then how does MIRI (or MIRI donors) evaluate now whether these lines of work are valuable towards advancing their stated goals?
MIRI stated goals are similar to those of mainstream AI research, and MIRI approach in particular includes as subgoals the goals of research fields such as model checking and automated theorem proving.
Research has both ultimate goals (“machines that think”) and short-term goals (“machines that can parse spoken English”). My impression is that the MIRI agenda is relevant to the ultimate goal of AI research, but has only limited overlap with the things people are really working on in the short term. I haven’t seen MIRI work that looked directly relevant to existing work on theorem proving or model checking. (I don’t know much about automated theorem proving, but do know a bit about model checking.)
Do you claim that MIRI is one or two decades ahead of mainstream researchers?
It’s not a matter of “ahead”. Any research area is typically a bunch of separate tracks that proceed separately and eventually merge together or have interconnections. It might be several decades before the MIRI/self modifying AI track merges with the main line of AI or CS research. That’s not necessarily a sign anything is wrong. It took decades of improvement before formal verification or theorem proving become part of the computer science toolkit. I would consider MIRI a success if it follows a similar trajectory.
If the answer is no, then how does MIRI (or MIRI donors) evaluate now whether these lines of work are valuable towards advancing their stated goals?
I can’t imagine any really credible assurance that “this basic research is definitely useful,” for any basic research. The ultimate goal “safe self modifying AI” is too remote to have any idea if we’re on the right track. But if MIRI, motivated by that goal, does neat stuff, I think it’s a safe bet that (A) the people doing the work are clueful, and (B) their work was at least potentially useful in dealing with AI risks. And potentially useful is the best assurance anybody can ever give.
I’m a computer systems guy, not a theorist or AI researcher, but my opinion of MIRI has gradually shifted from “slightly crankish” to “there are some interesting questions here and MIRI might be doing useful work on them that nobody else is currently doing.” My impression is a number of mainstream computer scientists have similar views.
Eliezer recently gave a talk at MIT. If the audience threw food at the stage, I would consider that evidence for MIRI being crankish. If knowledgeable CS types showed up and were receptive or interested, I would consider that a strong vote of confidence. Anybody able to comment?
MIRI stated goals are similar to those of mainstream AI research, and MIRI approach in particular includes as subgoals the goals of research fields such as model checking and automated theorem proving.
It’s definitely not a goal of mainstream AI, and not even a goal of most AGI researchers, to create self-modifying AI that provably preserves its goals. MIRI’s work on this topic doesn’t seem relevant to what mainstream AI researchers want to achieve.
Zooming out from MIRI’s technical work to MIRI’s general mission, it’s certainly true that MIRI’s failure to convince the AI world of the importance of preventing unFriendly AI is Bayesian evidence against MIRI’s perspective being correct. Personally, I don’t find this evidence strong enough to make me think that preventing unFriendly AI isn’t worth working on.
Also, two more points why MIRI isn’t that likely to produce research AI researchers will see as a direct boon to their field: One, stuff that’s close to something people are already trying to do is more often already worked on; the stuff that people aren’t working on seem more important for MIRI to work on. And two, AGI researchers in particular are particularly interested in results that get us closer to AGI, and MIRI is trying to work on topics that can be published about without bringing the world closer to AGI.
MIRI claim to have had an important insight on AI design (this so called “Löbian obstacle”) that experts in relevant fields (AI, model checking, automated theorem proving, etc.) didn’t have. MIRI attempted to communicate their insight, but so far the experts have mostly ignored MIRI claims or denied that they are likely to be important and relevant.
I wouldn’t say MIRI has tried very hard yet to communicate about the Lobian obstacle to people in the relevant fields. E.g. we haven’t published about the Lobian obstacle in a journal or conference proceedings.
Part of the reason for that is that we don’t expect experts in these fields to be very interested in it. The Lobian obstacle is aiming at better understanding of strongly self-modifying systems, which won’t exist for at least 15 years, and probably longer than that.
Part of the reason for that is that we don’t expect experts in these fields to be very interested in it. The Lobian obstacle is aiming at better understanding of strongly self-modifying systems, which won’t exist for at least 15 years, and probably longer than that.
I agree the AI community won’t be very interested. But it might be worth sending it to some theoretical computer science venue—STOC, say—instead. If nothing else, it would give useful information about how receptive academics are to the topic.
One possibility is that MIRI’s arguments actually do look that terrible to you, but that this is because MIRI hasn’t managed to make them clearly enough
I look forward to a clear, detailed explanation of MIRI’s thinking on this subject. In particular this counter-intuitive result:
The argument is indeed that it seems to be unable to decide safety for the chooseAction2 we’re interested in
I’m in the early stages of writing up my own work on the Löbian obstacle for publication, which will need to include its own (more condensed, rather than expanded) exposition of the Löbian obstacle; but I liked Eliezer’s article, so it would be helpful to know why you didn’t think it argued the point well enough.
I have, although formal logic is not my field so please excuse me if I have misunderstood it.
Eliezer does not demonstrate that overcoming the Löbian obstacle is necessary in the construction of tiling agents, he rather assumes it. No form of program verification is actually required, if you do not use the structure of a logical agent. Consider, for example, the GOLUM architecture[1] which is a form of tiling agent that proceeds by direct experimentation (simulation). It does not require an ability to prove logical facts about the soundness and behavior of its offspring, just an ability to run them in simulation. Of course logical program analysis helps in focusing in on the situations which give rise to differing behavior between the two programs, but there are no Gödelian difficulties there (even if there were you could fall back on probabilistic sampling of environments, searching for setups which trigger different results).
The MIRI argument, as I understand it is: “a program which tried to predict the result of modifying itself runs into a Löbian obstacle; we need to overcome the Löbian obstacle to create self-modifying programs with steadfast goal systems.” (I hope I am not constructing a strawman in simplifying it as such.) The problem comes from the implicit assumption that the self-modifying agent will use methods of formal logic to reason about the future actions of its modified self. This need not be the case! There are other methods which work well in practice, converge on stable solutions under the right circumstances, and have been well explored in theory and in practice.
I’m reminded of the apocryphal story of two space-age engineers that meet after the fall of the Soviet Union. The American, who oversaw a $1.5 million programme to develop the “Astronaut Pen” which would write in hard vacuum and microgravity environments, was curious to know how his Russian counterpart solved the same problem. “Simple,” he replied, “we used a pencil.”
You could expend significant time, energy, and donations to solve the problem of Löbian obstacles in the context of tiling agents for self-modifying AI. Or you could use an existing off-the-shelf solution that solves the problem in a different way.
I don’t think the argument is that AI would be fundamentally different, but rather that “we can reason at least somewhat reliably when making predictions of agents who don’t drastically self-modify, and of whom we have thousands of years of data to help build our predictions on” isn’t good enough to deal with the case of a drastically self-modifying agent that could exhibit entirely novel behavior and cognitive dynamics even if it wasn’t capable of self-modifying. “Somewhat reliably” is fine only as long as a single failure isn’t enough to throw all the rest of your predictions to the trash bin.
I don’t know enough about your second example to feel confident commenting on it.
“Somewhat reliably” is fine only as long as a single failure isn’t enough to throw all the rest of your predictions to the trash bin.
Humans seem pretty good at making correct predictions even if they have made incorrect predictions in the past. More generally, any agent for whom a single wrong prediction throws everything into disarray will probably not continue to function for very long.
I don’t know enough about your second example to feel confident commenting on it.
Fair enough. This is an admirable habit that is all too rare, so have an upvote :).
Humans seem pretty good at making correct predictions even if they have made incorrect predictions in the past. More generally, any agent for whom a single wrong prediction throws everything into disarray will probably not continue to function for very long.
That’s basically my point. A human has to predict the answer to questions of the type “what would I do in situation X”, and their overall behavior is the sum of their actions over all situations, so they can still get the overall result roughly correct as long as they are correct on average. An AI that’s capable of self-modification also has to predict the answer to questions of the type “how would my behavior be affected if I modified my decision-making algorithm in this way”, where the answer doesn’t just influence the behavior in one situation but all the ones that follow. The effects of individual decisions become global rather than local. It needs to be able to make much more reliable predictions if it wants to have a chance of even remaining basically operational over the long term.
Fair enough. This is an admirable habit that is all too rare, so have an upvote :).
The standard argument in favor of this line of research is that Lob’s theorem implies that purely logical systems will have difficulty reasoning about their future behavior, and that therefore it is worth asking whether a probabilistic language of thought can overcome this obstacle. But I believe this is due to a failure to think sufficiently concretely about the problem. To give two examples: first, humans seem perfectly capable of making predictions about their future behavior despite this issue, and I have yet to see an argument for why AIs should be different. Secondly, we manage to routinely prove facts about the behavior of programs (this is the field of program analysis) despite the fact that in theory this should be “undecidable”.
These examples involve predictions generated by processes which are not purely logical systems, and which we don’t understand enough to code into an AI. So it seems like Paul’s idea could be progress towards having a process that makes such predictions about itself that we can understand at that level.
Can you clarify in what sense you think a computer program is not a purely logical system? Or am I misunderstanding you?
ETA: In particular, I’m referring to the point where I said:
Secondly, we manage to routinely prove facts about the behavior of programs (this is the field of program analysis) despite the fact that in theory this should be “undecidable”.
I didn’t meant that the computer program is not a purely logical system, but that the people proving facts about its behavior aren’t purely logical systems.
Program analysis consists of writing computer programs that reason about other computer programs. Is the objection that these programs were written by a human? That seems like a strange objection to me if so.
Ok, then I did not understand exactly what you meant, but I still don’t think this is a counterexample to the problem Paul’s idea tries to get around.
The problem is that logical systems have problems reasoning about their own behavior, not a claim that there is no other logical system that can reason about them. In particular, we are interested in if an optimization process can model itself as an optimization process, accurately predicting that its future decisions are likely to achieve outcomes that score well on its optimization criteria and the score will be better if it has more resources, and will become much worse if its representation of its criteria gets corrupted, (all using abstract reasoning in much less time than fully simulating its future decisions). Can program analysis do that?
ETA: Also, I should note that this is a good question and I’m glad you asked it!
If your question is whether a program analyzer can, when given itself as input, produce sensible results,
the answer is yes. Program analyzers are meant to run on arbitrary code, so in particular they can be run on themselves as a special instance. (Actually, nothing particularly special happens in this case as far as I can tell.)
Now, a key point is the formalism we are working in: a program analyzer takes in a program P and some specification S, and checks whether P obeys specification S (for instance, checking that it never accesses
memory before it allocates it). Importantly, P is allowed to do any 3 of the following:
report that P satisfies S (with a proof)
report that P does not satisfy S (with a counterexample)
give up (i.e. report “unsure”)
So a key question is how often it gives up! But while there are many instances where modern program analysis tools do give up, there are also many where they don’t. Furthermore if you are an AI and you propose a modification to your code, and your program analysis subroutine reports “unsure”, you are free (and would be wise) to try a different modification instead. Researchers in the field of program analysis are extremely cognizant of the halting problem (which is closely related to Lob’s theorem) and typically deal with it by over-approximating, e.g. by identifying conditions that would be sufficient for halting, although not necessarily necessary. As a result one obtains solubility at the cost of precision (although note that the approximation is always sound: if we report that P satisfies S, then P does indeed always satisfy S).
This is a good approach for dealing with the halting problem, but I think that Lob’s theorem is not so closely related that getting around the halting problem means you get around Lob’s theorem.
The theoretical AI that would run into Lob’s theorem would need more general proof producing capability than these relatively simple program analyzers.
It seems like these program analyzers are built around the specification S they check for, with the human programmer doing the work of constructing a structure of a proof which can be filled in to a complete proof by supplying basic facts that the program can check. For example, I have a library that produces .Net assemblies, with byte code that targets a stack based virtual machine, and I want to verify that instructions to read elements off the stack get elements of the type is expecting. So I wrote my library to keep track of types that could be written to the stack at any execution point (represented by a stack of types). It is straightforward to compute the possible stack states after each instruction, given that instruction and the previous possible stack trace, and determine if the instruction is legal given the previous state (well, branching makes it more complicated, but not too much). So, in this case, I came up with the structure of tracking possible states after each instruction, and then it was straightforward to write my program to fill in that structure, but I did not, and don’t know how to, write my program to come up with the proof structure. It is therefor easy to be confident that proof will have nice properties, because the space of possible proofs with this structure is much smaller than the space of all possible proofs.
The theoretical AI that would run into Lob’s theorem would be able to come up with proof structures, as an AI that could only use proof structures prepackaged by human programmers would have huge gaps in its capabilities. This may introduce difficulties not seen in simple program analyzers.
In the example I asked about earlier, the optimizer that needs to prove things about its own future decisions, it runs into an issue that Lob’s theorem applies to. In order prove that its own future decisions are good, it would rely on the fact that it’s future decision are based on its own sound proof system, so it needs to assert that its own proof system is sound, that if its proof system says “X”, then X. Lob’s theorem says that, for all statements X, if a system P (at least as powerful as Peano arithmetic) says “P says ‘X’ implies X” then P says X. So, if the system asserts its own soundness, it asserts anything.
So, in summary:
Lob’s theorem is a problem for generally powerful formal logic based proof systems that assert their own soundness.
Program analyzers are formal logic based proof systems that do not run into Lob’s theorem, because they are not generally powerful, and do not assert their own soundness.
Humans are generally powerful proof generators than can have confidence in their own soundness, but are not purely based on formal logic, (we might intuitively dismiss a formal logical proof as “spurious”) and so can get around Lob’s theorem, but we don’t understand how humans do this well enough to write a program to do it.
Paul’s idea of a probabilistic proof system could lead to a generally powerful proof generator with confidence in its own soundness that is not based on formal logic, so that it gets around Lob’s theorem, that we can understand well enough to write a program for.
It seems like these program analyzers are built around the specification S they check for, with the human programmer doing the work of constructing a structure of a proof which can be filled in to a complete proof by supplying basic facts that the program can check. For example, I have a library that produces .Net assemblies, with byte code that targets a stack based virtual machine, and I want to verify that instructions to read elements off the stack get elements of the type is expecting. So I wrote my library to keep track of types that could be written to the stack at any execution point (represented by a stack of types). It is straightforward to compute the possible stack states after each instruction, given that instruction and the previous possible stack trace, and determine if the instruction is legal given the previous state (well, branching makes it more complicated, but not too much). So, in this case, I came up with the structure of tracking possible states after each instruction, and then it was straightforward to write my program to fill in that structure, but I did not, and don’t know how to, write my program to come up with the proof structure. It is therefor easy to be confident that proof will have nice properties, because the space of possible proofs with this structure is much smaller than the space of all possible proofs.
I’m not sure I’ve understood what you have in mind here, but in the general case complete type checking in .NET (that is, proving that an assembly not only is syntactically well-formed but also never throws type-related exceptions at runtime) is undecidable because of Rice’s theorem.
In the general case, complete type checking is as difficult as proving arbitrary claims in first-order logic.
I am not trying to write a classifier that tells you whether or not an arbitrary program throws a type exception. I wrote a verifier that tells you whether or not an arbitrary program can be proven not to throw type exceptions (except possibly at an explicit cast statement, or a throw exception statement) with a particular proof strategy that covers a huge space of useful, nicely structured programs.
See also jsteinhardt’s comment I was responding to, which discussed getting around the halting problem by allowing the checker to say “I don’t know”.
am not trying to write a classifier that tells you whether or not an arbitrary program throws a type exception. I wrote a verifier that tells you whether or not an arbitrary program can be proven not to throw type exceptions (except possibly at an explicit cast statement, or a throw exception statement)
I’m not an expert on .NET, but is there anything that can throw a type exception other than an explicit cast or an explicit throw (or the standard library, I suppose)?
There are sequences of .Net instructions that result in the runtime throwing type exceptions, because it tries to read a value of a certain type of the stack, and it gets an incompatible value. This is the situation that my verifier guards against.
The standard .Net runtime also includes a verifier that checks the same thing, and it will not run code that fails this validation unless it is explicitly trusted. So a verifiable .Net assembly will not throw type exceptions without an explicit cast or throw, but an arbitrary assembly may do so. The compilers for the standard languages such as C# will generally only produce verifiable assemblies unless you explicitly mark parts of the source code as “unsafe”, and unsafe, or unverifiable assemblies need special permissions to run at all.
(There are other properties that both my verifier and the standard verifier check for. The reason I wrote my own is that it produces much more informative descriptions of problems it finds, and it is integrated into my assembly emitting libraries, so it detects problems as the assembly to be emitted is defined, and when run in the debugger, will easily show the compiler code and execution state that caused the problem.)
So a verifiable .Net assembly will not throw type exceptions without an explicit cast or throw, but an arbitrary assembly may do so.
IIUC, unverifiable code does not, or at least is not guaranteed to, politely throw an exception should a type error occur. It may crash the runtime or fail silently leaving the application in an incorrect state.
(There are other properties that both my verifier and the standard verifier check for. The reason I wrote my own is that it produces much more informative descriptions of problems it finds, and it is integrated into my assembly emitting libraries, so it detects problems as the assembly to be emitted is defined, and when run in the debugger, will easily show the compiler code and execution state that caused the problem.)
Ok. I thought that you were considering assemblies that passed the standard .NET verification and you were trying to check for some stronger property (such as absence of runtime exceptions caused by downcasts). That would have been equivalent to arbitrary first-order logic inference. Since you are instead checking for decidable properties, your system is indeed not equivalent to arbitrary first-order logic inference.
But as jsteinhardt says, it is actually possible to write verifiers that attempt to check for undecidable properties, provided that they have the option to give up.
My mathematical logic is a bit rusty, but my impression is that the following are true:
Godel’s theorem presents a strictly stronger obstacle than Lob’s theorem. A reflectively consistent theory may still be incomplete, but any complete theory is necessarily reflectively consistent.
The undecidability of the halting problem is basically Godel’s theorem stated in computational terms. If we could identify a subset L of Turing machines for whom the halting problem can be decided, as long as it was closed under operations such as inserting a (non-self-referential) sub-routine, then we would be able to verify any (non-self-referential) property of the program that was also expressible in L. This is a sketch of a claim rather than an actual claim that I’ve proved, though.
Finally, I think it’s worth pointing out an actual example of a program analysis tool since I think they are more powerful than you have in mind. The following slides are a good example of such a tool. At a high level, it gets around the problems you are worried about by constructing an over-approximation of the halting problem that is expressible in propositional logic (and thus decidable, in fact it is even in NP). More generally we can construct a sequence of approximations, each expressible in propositional logic, whose conjunction is no longer an approximation but in fact exactly the original statement.
Godel’s theorem presents a strictly stronger obstacle than Lob’s theorem.
Why do you say that? My understanding is that Godel’s theorem says that a (sufficiently powerful) logical system has true statements it can’t prove, but these statements are excessively complicated and probably not important. Is there some way you envision an AGI being limited in its capacity to achieve its goals by Godel’s theorem, as we envision Lob’s theorem blocking an AGI from trusting its future self to make effective decisions? (Besides where the goals are tailored to be blocked by the theorem: “Prove all true statements in my formal system”)
Finally, I think it’s worth pointing out an actual example of a program analysis tool since I think they are more powerful than you have in mind. The following slides are a good example of such a tool.
As near as I can tell, this is more powerful than other static analysis tools that I have seen (though maybe not, the Google Web Toolkit optimizer is pretty impressive), but it is well within what I expect to be possible, and doesn’t get around Lob’s theorem. (I can’t be too confident in this assessment of its power because I don’t see a clear claim of what sort of statements it can prove or how it works except that it seems to detect when inputs to a statement may have invalid values and that it uses brute force techniques to analyze functions and then associates a summary of the analysis with that function (constraints on valid inputs and guarantees on outputs?) so that its analysis of call sites can use the summary.) (This sort of program analyzer is interesting in its own right, and I would like to see a more complete explanation of it, but I don’t think it’s existence says anything about the problems posed by Lob’s theorem.)
Do you agree or disagree that complete implies reflectively consistent? If you agree, then do you agree or disagree that this means avoidance of Godelian obstacles implies avoidance of Lobian obstacles? If you agree with both of those statements, I’m confused as to why:
Godel’s theorem presents a strictly stronger obstacle than Lob’s theorem.
Ah, so by “Godel’s theorem presents a strictly stronger obstacle than Lob’s theorem” you mean if you overcome Godelian obstacles you also overcome Lobian obstacles? I think I agree, but I am not sure that it is relevant, because the program analyzer examples don’t overcome Godelian obstacles, they just cope with the Godelian obstacles, which does not similarly imply coping with or overcoming Lobian obstacles.
I’m guessing that he is using the standard AI terminology of a “logical program” as one which reasons by means of logical inference over a knowledgebase. This is not how human minds work, nor do many AI researchers view it as a viable path forward.
The types of programs which have seen success in the AI revolution that is currently going on, often are not amenable to program analysis. They very often fall in the “undecidable” category, at least with respect to their anticipated behavior.
The types of programs which have seen success in the AI revolution that is currently going on, often are not amenable to program analysis. They very often fall in the “undecidable” category, at least with respect to their anticipated behavior.
This depends which properties you care about. I suspect there isn’t a small model of the AI program that duplicates its output behavior, but you can do a lot with off-the-shelf analysis.
It’s relatively easy to prove things like “this program will only ever use the following subset of available system calls,” or “this program is well typed”, or “the program cannot modify itself”, or “can only modify these fixed specific parts of its state.”
I can also imagine a useful AI program where you can prove a bound on the run-time, of the form “this program will always terminate in at most X steps”, for some actual fixed constant X. (Obviously you cannot do this for programs in general, but if the program only loops over its inputs for a fixed number of iterations or somesuch, you can do it.)
These sorts of properties are far from a general proof “this program is Safe”, but they are non-trivial and useful properties to verify.
The FAI-related things you would want to prove are of the form: “when given a world state characterized by X, input percepts characterized by Y, the program always generates outputs characterized by Z.” For many existing and popular AI architectures, we haven’t the foggiest idea how to do this. (It’s no surprise that Eliezer Yudkowsky favors Pearl’s causal probabilistic graph models, where such analysis is at least known to be possible.)
To the extent that X, Y, and Z can be written formally within the programming language, program analysis at least in principle is fully capable of proving such a statement. I apologize if this comes off as rude, but your statement that “we haven’t the foggiest idea how to do this” is flat-out false. While there are certainly unique challenges to reasoning about the sort of code that gets written in machine learning, it seems to me that the main reason we don’t have well-developed analysis tools is that most code doesn’t look like machine learning code, and so there has been little pressure to develop such tools.
Program analysis consists of writing computer programs that reason about other computer programs. Is the objection that these programs were written by a human? That seems like a strange objection to me if so.
Can you give an example? The main reason they are not amenable to program analysis is because the sorts of guarantees they are supposed to satisfy are probabilistic / statistical in nature, and we don’t yet have good techniques for verifying such properties. I am pretty sure that the issue is not undecidability.
I should also clarify that whether or not JGWeissman thought that I meant “logical program” instead of “computer program”, my comment was intended to mean “computer program” and the field of program analysis studies “computer programs”, not “logical programs”.
Ok, throw it back at you: how do you prove the behavior of a deep belief network, or any other type of neural network currently in vogue, short of actually running it? If you do have some way of doing it, can I be coauthor? I didn’t mean to imply that it was proved impossible, just that no one has the faintest idea how to do it—and not for lack of trying.
What does “prove the behavior” mean? If you mean “prove that it will classify this next image correctly” then of course the answer is no, although that is because it’s not even necessarily true. If you mean “prove that the program’s behavior always falls within some well-defined regime”, then yes we can absolutely prove that sort of statement. Things that are at the boundary of what we can prove (i.e. I don’t know of any existing formal tools that do it, but I’m pretty sure we could make one) would be something like “the weights of the network don’t change too much if we change the input a small amount” or “the prediction error on the training set will decrease monotonically” or “the generalization error is bounded by assuming that the test data ends up being drawn from the same distribution as the training data”.
Presumably for friendliness you want something like “P and P’ satisfy some contract relative to each other” (such as sufficiently similar behavior) which is the sort of thing we are already relatively good at proving. Here I’m imagining that P is the “old” version of a program and P’ is a “modified” version of the program that we are considering using.
I believe that research is comparably efficient to much of industry, and that many of the things that look like inefficiencies are actually trading off small local gains for large global gains.
I understand “trade small local gains for large global gains” as a prescriptive principle, but does it work as a descriptive hypothesis? Why expect academics to be so much better than philanthropists at cause neutrality? When I speak to academics who aren’t also EAs, they are basically never cause neutral, and they even joke about how ridiculously non-cause-neutral everybody in academia is, and how accidental everyone’s choice of focus is, including their own.
I’m not talking about cause neutrality. My point is that even once the general problem has been decided, there are many possible approaches, and academics often do things that seem inefficient but are actually exploring the space of possible approaches (possibly by trying to better understand the objects they are studying).
What level of “general problem” do you have in mind? To a large degree I’m thinking about things like “Gosh, it took (unnecessary) centuries or decades for researchers to launch subfields to study normative uncertainty and intelligence explosion”, and that could be a “lack of cause neutrality” problem. And maybe you’re thinking instead on a smaller scale, and want to say something like “Given that people decide to work on X, they’re relatively efficient in working on X, and exploring the space within X, even if they’re completely missing normative uncertainty and intelligence explosion.”
many of the things that look like inefficiencies are actually trading off small local gains for large global gains
This is an interesting hypothesis, and one I wasn’t thinking of. But hard to measure!
For instance, MIRI repeatedly brings up Paul’s probabilistic metamathematics as an important piece of research progress produced by MIRI.
Out of curiosity, what gives you that impression? I tend to cite it because it is (along with the Lobian cooperation stuff) among the most important results to come out of MIRI’s first couple workshops, not because I can already tell whether it’s an important breakthrough in mathematical logic in general.
As for the purpose and relevance of the Lobian obstacle work, it seems like there might still be a failure of communication there. Since you and Eliezer and I discussed this at length and there still seems to be an unbridged gap, I’m not sure which thing I can say to bridge the gap. Maybe this quote from Paul?
No one thinks that the world will be destroyed because people built AI’s that couldn’t handle the Lobian obstruction. That doesn’t seem like a sensible position, and I think Eliezer explicitly disavows it in the writeup. The point is that we have some frameworks for reasoning about reasoning. Those formalisms don’t capture reflective reasoning, i.e. they don’t provide a formal account of how reflective reasoning could work in principle. The problem Eliezer points to is an obvious problem that any consistent framework for reflective reasoning must resolve.
Working on this problem directly may be less productive than just trying to understand how reflective reasoning works in general—indeed, folks around here definitely try to understand how reflective reasoning works much more broadly, rather than focusing on this problem. The point of this post is to state a precise problem which existing techniques cannot resolve, because that is a common technique for making progress.
Another example would be decision theory of modal agents. I also won’t take the time to treat this in detail, but will simply note that this work studies a form of decision theory that MIRI itself invented, and that no one else uses or studies.
In the OP I actually gave program equilibrium as an example of new theoretical progress that opens up new lines of inquiry, e.g. the modal agents work (though of course there are other pieces contributing to modal agents, too). So yeah, I don’t think the modal agents work is an example of inefficiency.
The examples I gave in the OP for apparent inefficiency in decision theory research was philosophy’s failure to formulate a reliabilist metatheory of instrumental rationality until 2013, even though reliabilist theories of epistemic rationality have been popular since the late 1960s, and also the apparently slow uptake of causal Bayes nets in the causal decision theory world.
Out of curiosity, what gives you that impression? I tend to cite it because it is (along with the Lobian cooperation stuff) among the most important results to come out of MIRI’s first couple workshops, not because I can already tell whether it’s an important breakthrough in mathematical logic in general.
In this very post you placed it in a list next to normative uncertainty and the intelligence explosion. The implication seemed obvious to me but perhaps it was unintended.
I seem to remember other comments / posts where similar sentiments were either expressed or implied, although a quick search doesn’t turn them up, so perhaps I was wrong.
The implication seemed obvious to me but perhaps it was unintended.
Yeah, unintended, but I can see why one might infer that.
Does my “philosophical edge” comment imply importance to you? I was merely trying to say that it’s philosophical even though I’m thinking of it in terms of AI, and it’s not obvious to me, like your first example, why one would read the comment as assigning particular importance to the result.
I think that the comment that I quoted is not by itself objectionable to me. If that’s actually the only example I can come up with, then I think it would be unfair to criticize it, so I will update the parent to remove it.
Luke,
I think you are mistaken about the relative efficiency / inefficiency of scientific research. I believe that research is comparably efficient to much of industry, and that many of the things that look like inefficiencies are actually trading off small local gains for large global gains. I’ve come to this conclusion as the result of years of doing scientific research, where almost every promising idea I’ve come up with (including some that I thought quite clever) had already been explored by someone else. In fact, the typical case for when I was able to make progress was when solving the problem required a combination of tools, each of which individually was relatively rare in the field.
For instance, my paper on stochastic verification required: (i) familiarity of sum-of-squares programming; (ii) the application of supermartingale techniques from statistics; and (iii) the ability to produce relatively non-trivial convex relaxations of a difficult optimization problem. In robotics, most people are familiar with convex optimization, and at least some are familiar with sum-of-squares programming and supermartingales. In fact, at least one other person had already published a fairly similar paper but had not formulated the problem in a way that was useful for systems of practical levels of complexity, probably because they had (i) and (ii) but lacked (iii).
Of course, it is true that your point #3 (“new tools open up promising new angles of attack”) often does lead to low-hanging fruit. In machine learning, often when a new tool is discovered there will be a sizable contingent of the community who devotes resources to exploring possible uses of that tool. However, I think the time-scale for this is shorter than you might imagine. I would guess that, in machine learning at least, most of the ``easy″ applications of a new tool have been exhausted within five years of its introduction. This is not to say that new applications don’t trickle in, but they tend to either be esoteric or else require adapting the tool in some non-trivial way.
My impression is that MIRI believes that most of what drives what they perceive to be low-hanging fruit comes from #4 (“progress is only valuable to those with unusual views”). I think this is probably true, but not, as you probably believe, due to differing views about the intelligent explosion; I believe instead that MIRI’s differing views come from a misunderstanding of the context of their research.
For instance, MIRI repeatedly brings up Paul’s probabilistic metamathematics as an important piece of research progress produced by MIRI. I’ve mostly hedged when people ask me about this, because I do think it is an interesting result, but I feel by now that it has been oversold by MIRI. One could argue that the result is relevant to one or more of logic, philosophy, or AI, but I’ll focus on the last one because it is presumably what MIRI cares about most. The standard argument in favor of this line of research is that Lob’s theorem implies that purely logical systems will have difficulty reasoning about their future behavior, and that therefore it is worth asking whether a probabilistic language of thought can overcome this obstacle. But I believe this is due to a failure to think sufficiently concretely about the problem. To give two examples: first, humans seem perfectly capable of making predictions about their future behavior despite this issue, and I have yet to see an argument for why AIs should be different. Secondly, we manage to routinely prove facts about the behavior of programs (this is the field of program analysis) despite the fact that in theory this should be “undecidable”. This is because the undecidability issues don’t really crop up in practice. If MIRI wanted to make the case that they will crop up for a human-level AI in important ways, they should at least mention and respond to the large literature pointing in the opposite direction, and explain why those tools or their successors would be insufficient.
A full response to the misguidedness of the Lob obstacle as a problem of interest probably necessitates its own post, but hopefully this serves to at least partially illustrate my point. Another example would be decision theory of modal agents. I also won’t take the time to treat this in detail, but will simply note that this work studies a form of decision theory that MIRI itself invented, and that no one else uses or studies. It should therefore perhaps be unsurprising that it is relatively easy to prove novel results about it. It would be the equivalent of an activity I did to amuse myself in high school, which was to invent new mathematical objects and then study their properties. But I think it would be mistaken to point to this as an example of inefficiency in research. (And yes, I do think the idea of formalizing decision theory in terms of code is an interesting one. I just don’t think you get to point to this as an example of why research is inefficient.)
I’m not sure this is the best description of my objections to this post, but I need to start work now and it seems best to avoid keeping this around as a draft for too long, so I’m going to go ahead and post it and wait for everyone to tear it to shreds.
I agree that Luke here overstates the significance of my result, but I do think you miss the point a bit or are uncharitable. Regardless of whether making predictions about your own behavior is fundamentally difficult, we don’t yet understand any formal framework that can capture reasoning of the form “my decisions are good because my beliefs correspond to reality.” Assuming there is a natural formal framework capturing human reasoning (I think the record so far suggests optimism) then there is something interesting that we don’t yet understand. It seems like you are applying the argument: “We know that humans can do X, so why do you think that X is an important problem?” The comment about undecidability issues not applying in practice also seems a bit unfair; for programs that do proof search we know that we cannot prove claims of the desired type based on simple Godelian arguments, and almost all interesting frameworks for reasoning are harder to prove things about than a simple proof search. (Of course the game is that we don’t want to prove things about the algorithms in question, we are happy to form justified beliefs about them in whatever way we can, including inductive inference. But the point is that there are things we don’t understand.)
There are further questions about whether any work at MIRI is a meaningful contribution to this problem or any other. I think that the stuff I’ve worked on is plausibly but not obviously a significant contribution (basically the same status as the other work I’m doing).
Regarding the modal agents stuff, I agree that it’s a toy problem where you should expect progress to be fast (if there was a nice theorem at the end of it, then it wouldn’t be too unusual as a paper in theoretical CS, except for the unfashionable use of mathematical logic). Regarding updateless/timeless/ambient decision theory, it’s a clear step forward for a very idiosyncratic problem, but one for which I think you can make a reasonable case that it’s worthwhile.
I think you shouldn’t be too surprised to make meaningful headway on theoretically interesting questions, even those which will plausibly be important. It seems like in theoretical research today things are still developing reasonably rapidly, and the ratio between plausibly important problems and human capital is very large. I expect that given effort and success at recruiting human capital MIRI can make good headway, in the same sort of way that other theorists do. Optimistically they would be distinguished primarily by working on a class of problems which is unusually important given their values and model of the world (a judgment with which you might disagree).
And the question is: who cares? The mechanism by which human beings predict their future behavior is not logical inference. Similar ad-hoc Bayesian extrapolation techniques can be used in any general AI without worry about Löbian obstacles. So why is it such a pressing issue?
I don’t wish to take away from the magnitude of your accomplishment. It is an important achievement. But in the long run I don’t think it’s going to be a very useful result in the construction of superhuman AGIs, specifically. And it’s reasonable to ask why MIRI is assigning strategic importance to these issues.
I agree with this. Luke seems to be making a much stronger claim than the above, though.
I agree that that would be a bad argument. That was not the argument I intended to make, though I can see why it has been interpreted that way and I should have put more effort into explaining myself. I am rather saying that human reasoning looks so far away from even getting close to running into issues with Godel / Lob, that it seems like a rather abstruse starting point for investigation.
The rest of your comment seems most easily discussed in person, so I’ll do that and hopefully we’ll remember to update the thread with our conclusion.
What makes you say that? Did you see what I said about this here?
As someone with a reasonable acquaintance with program analysis, synthesis, and semantics… YIKES.
Rice’s Theorem is, so to speak, the biggest, nastiest rock we semantics folks have to crawl around on a regular basis. The way we generally do it is by constructing algorithms, semantic frameworks, and even entire programming languages in which undecidability cannot happen in the first place, thus restricting ourselves to analyzing something less than the set of all possible programs.
Now, admittedly, in practice we’ve made a lot of progress this way, because in practice there are really four kinds of programs: ones that provably terminate by design, ones that provably don’t terminate by design, provably undecidable programs (usually programs that rely on the halting behavior of some other program or logic for their own halting behavior), and just plain messed-up what-the-fuck programs.
The last kind are mostly created only by mistake. The third kind come up in program analysis and semantics, but we can usually construct a proof that we’re dealing with a formally undecidable problem there and set reasonable hard bounds on length of iteration or depth of recursion (or even find decidable subclasses of these problems that are decently useful to real people). The second kind is handled by writing coterminating programs over codata. The first kind is handled by writing terminating programs over data.
Undecidability issues do come up in practice, and the current research frontier (MIRI’s Lobian paper, Goedel Machines, AIXI) certainly indicates that these issues definitely come up in the study of Universal Artificial Intelligence. However, for most problems below the level of program analysis or universal induction, undecidability issues can be handled or contained productively by research effort.
Using heuristic methods rather than formal proofs. These already often fail (humans even fail to take the effect of being in a different emotional state properly into account), and that’s without having to deal with the effects of radical self-modifications which might impact your whole reasoning and motivational system.
I don’t think this is sufficient to dismiss my example. Whether or not we prove things, we certainly have some way of reasoning at least somewhat reliably about how we and others will behave. It seems important to ask why we expect AI to be fundamentally different; I don’t think that drawing a distinction between heuristics and logical proofs is sufficient to do so, since many of the logical obstacles carry over to the heuristic case, and to the extent they don’t this seems important and worth grappling with.
Also note that, even if you did think it was sufficient, I gave you another example that was based purely in the realm of formal logic.
Jacob, have you seen Luke’s interview with me, where I’ve tried to reply to some arguments of the sort you’ve given in this thread and elsewhere?
Perhaps here is a way to get a handle on where we disagree: Suppose we make a whole-brain emulation of Jacob Steinhardt, and you start modifying yourself in an attempt to achieve superintelligence while preserving your values, so that you can save the world. You try to go through billions of (mostly small) changes. In this process, you use careful but imperfect human (well, eventually transhuman) reasoning to figure out which changes are sufficiently safe to make. My expectation is that one of two things happens: Either you fail, ending up with very different values than you started with or stopping functioning completely; or you think very hard about how much confidence you need to have in each self-modification, and how much confidence you can achieve by ordinary human reasoning, and end up not doing a billion of these because you can’t achieve the necessary level of confidence. The only way I know for a human to reach the necessary level of confidence in the majority of the self-modifications would be to use formally verified proofs.
Presumably you disagree. If you could make a convincing case that a whole-brain emulation could safely go through many self-modifications using ordinary human reasoning, that would certainly change my position in the direction that the Löbian obstacle and other diagonalization issues won’t be that important in practice. If you can’t convince me that it’s probably possible and I can’t convince you that it probably isn’t, this might still help understanding where the disagreement is coming from.
I thought the example was pretty terrible. Everybody with more than passing familiarity with the halting problem, and more generally Rice’s theorem, understands that the result that you can’t decide for every program whether it’s in a given class doesn’t imply that there are no useful classes of programs for which you can do so. MIRI’s argument for the importance of Löb’s theorem is: There’s an obvious way you can try to get stable self-modification, which is to require that if the AI self-modifies, it has to prove that the successor will not destroy the world. But if the AI tries to argue “doing the following trivial self-modification is safe because the modified me will only do things that it proves won’t destroy the world, thus it won’t destroy the world”, that requires the AI to understand the soundness of its own proof system, which is impossible by Löb’s theorem. This seems to me like a straight-up application of what Löb’s theorem actually says, rather than the kind of half-informed misunderstanding that would suggest that program analysis is impossible because Rice’s theorem.
Glad to see you’re doing well, Benja :)
Here’s a concrete way you could try to get stable self-modification:
Suppose for concreteness that we have a C program, call it X, and that within the C program there is an array called “world_state” of length M and a floating point number called “utility”. A simple instantiation of X would look something like:
while(true){ action = chooseAction(worldState); world_state = propgateWorldState(worldState, action); utility = calculateUtility(worldState); }
We would like to consider modifications to X where we replace chooseAction with some new method chooseAction2 to get a program X2. Suppose we want to ensure some condition such as: from the current world state, if we use X2 instead of X, then after some finite period of time the sequence of utilities we get from using chooseAction2 will always be larger than the corresponding sequence if we have used chooseAction. Abusing notation a bit, this is the same as verifying the statement:
“there exists N such that for all n > N, utility2(n) > utility(n)”
[although note that utility2 and utility have fairly complicated descriptions if you actually try to write them out]. Now I agree that reasoning about this for arbitrary choices of chooseAction and chooseAction2 will be quite difficult (probably undecidable although I haven’t proved that). But the key point is that I get to choose chooseAction2, and there are many decision procedures that can prove such a statement in special cases. For instance, I could partition the space of world states into finitely many pieces, write down a transition function that over-approximates the possible transitions (for instance, by having a transition from Piece1 to Piece2 if any element of Piece1 can transition to any element of Piece2). Then I only need to reason about finite automata and those are trivially decidable.
You could argue that this proof system is fairly weak, but again, the AI gets to tailor its choices of chooseAction2 to be easy to reason about. You could also argue that an AI that could only create instances of chooseAction2 that could be reasoned about as above would be severely hampered, but I think this overlooks the fact that such techniques have been extraordinarily successful at reasoning about fairly complex systems.
In addition, there are more general proof strategies than the above if that one does not satisfy you. For instance, I could just require that any proposed modification to chooseAction2 come paired with a proof that that modification will be safe. Now I agree that there exist choices of chooseAction2 that are safe but not provably safe and this strategy disallows all such modifications. But that doesn’t seem so restrictive to me.
Finally, I agree that such a naieve proof strategy as “doing the following trivial self-modification is safe because the modified me will only do things that it proves won’t destroy the world, thus it won’t destroy the world” does not work. I’m not proposing that. The proof system clearly has to do some actual work. But your argument seems to me to be saying “this extremely naieve strategy doesn’t work, let’s develop an entire subfield of logic to try to make it work” whereas I advocate “this extremely naieve strategy doesn’t work, let’s use a non-naieve approach instead, and look at the relevant literature on how to solve similar problems”.
I also agree that the idea of “logical uncertainty” is very interesting. I spend much of my time as a grad student working on problems that could be construed as versions of logical uncertainty. But it seems like a mistake to me to tackle such problems without at least one, and preferably both, of: an understanding of the surrounding literature; experience with concrete instantiations of the problems at hand.
Sorry for being curmudgeonly there—I did afterwards wish that I had tempered that. The thing is that when you write something like
that sounds to me like you’re painting MIRI as working on these topics just because it’s fun, and supporting its work by arguments that are obviously naive to someone who knows the field, and that you’re supporting this by arguments that miss the point of what MIRI is trying to say. That’s why I found the example of program analysis so annoying—people who think that the halting problem means that program analysis is impossible really are misinformed (actually Rice’s theorem, really, but someone with this misconception wouldn’t be aware of that), both about the state of the field and about why these theorems say what they say. E.g., yes, of course your condition is undecidable as long as there is any choice f(s) of chooseAction2(s) that satisfies it; proof: let chooseAction2(s) be the program that checks whether chooseAction2(s) satisfies your criterion, and if yes return chooseAction(s), otherwise f(s). That’s how these proofs always go, and of course that doesn’t mean that there are no programs that are able to verify the condition for an interesting subclass of chooseAction2′s; the obvious interesting example is searching for a proof of the condition in ZFC, and the obvious boring example is that there is a giant look-up table which decides the condition for all choices of chooseAction2(s) of length less than L.
One possibility is that MIRI’s arguments actually do look that terrible to you, but that this is because MIRI hasn’t managed to make them clearly enough. I’m thinking this may be the case because you write:
First, that’s precisely the “obvious” strategy that’s the starting point for MIRI’s work.
Second, yes, Eliezer’s arguing that this isn’t good enough, but the reason isn’t that it there are some safe modifications which aren’t provably safe. The work around the Löbian obstacle has nothing to do with trying to work around undecidability. (I will admit that for a short period at the April workshop I thought this might be a good research direction, because I had my intuitions shaken by the existence of Paul’s system and got overly optimistic, but Paul quickly convinced me that this was an unfounded hope, and in any case the main work around the Löbian obstacle was never really related to this.) MIRI’s argument definitely isn’t that “the above algorithm can’t decide for all chooseAction2 whether they’re safe, therefore it probably can’t decide it for the kind of chooseAction2 we’re interested in, therefore it’s unacceptable”. If that’s how you’ve understood the argument, then I see why you would think that the program analysis example is relevant. (The argument is indeed that it seems to be unable to decide safety for the chooseAction2 we’re interested in, but not because it’s unable to decide this for any generic chooseAction2.)
Third, you seem to imply that your proposal will only take safe actions. You haven’t given an argument for why we should think so, but the implication seems clear: You’re using a chooseAction that is obviously safe as long as it doesn’t rewrite itself, and it will only accept a proposed modification if it comes with a proof that it is safe, so if it does choose to rewrite itself then its successor will in fact be safe as well. Now I think this is fine reasoning, but you don’t seem to agree:
You seem to be quite willing to use that reasoning yourself to show that the initial AI is safe, but you don’t think the AI should be able to use the same sort of reasoning. Eliezer’s argument is that this is in fact reasoning you want to use when building a self-improving AI: Yes, you can reason in more detail about how the AI you are building works, but this AI_1 will build an AI_2 and so on, and when proving that it’s ok to build AI_1 you don’t want to reason in detail about how AI_1,000,000 is going to work (which is built using design principles you don’t understand, by AI_999,999 which is much smarter than you); rather, you want to use general principles to reason that the because of the way AI_1,000,000 came to be, it has to be safe (because AI_999,999 only builds safe AIs, because it was built by AI_999,998 which only builds safe AIs...). But not only you need to reason like that, because you don’t know and aren’t smart enough to comprehend AI_1,000,000′s exact design; AI_1, which also isn’t that smart, will need to be able to use the same sort of reasoning. Hence, the interest in the Löbian obstacle.
There are caveats to add to this and parts of your comment I haven’t replied to, but I’m running into the same problem as you with your original comment in this thread, having already spent too much time on this. I’d be happy to elaborate if useful. For my part, I’d be interested in your reply to the other part of my comment: do you think I have localized our disagreement correctly?
Oh, one last point that I shouldn’t skip over: I assume the point about MIRI lacking “an understanding of the surrounding literature” refers to the thing about being tripped up at the July workshop by not knowing Gaifman’s work on logical uncertainty well enough. If so, I agree that that was an avoidable fail, but I don’t think it’s indicative of always ignoring the relevant literature or something like that. I’ll also admit that I still haven’t myself more with Gaifman’s work, but that’s because I’m not currently focusing on logical uncertainty, and I intend to do so in the future.
Don’t worry, I wasn’t offended :)
I don’t think that MIRI is working on these topics just because they are fun, and I apologize for implying that. I should note here that I respect the work that you and Paul have done, and as I said at the beginning I was somewhat hesitant to start this discussion at all, because I was worried that it would have a negative impact on either you / Paul’s reputation (regardless of whether my criticisms ended up being justified) or on our relationship. But in the end I decided that it was better to raise my objections in fairly raw form and deal with any damage later.
What I would say is that the arguments start to look really fishy when one thinks about concrete instantiations of the problem.
I’m not sure I understand what you’re saying here, but I’m not convinced that this is the sort of reasoning I’d use. It seems like Paul’s argument is similar to yours, though, and I’m going to talk to him in person in a few days, so perhaps the most efficient thing will be for me to talk to him and then report back.
I don’t think that “whole brain emulations can safely self-modify” is a good description of our disagreements. I think that this comment (the one you just made) does a better job of it. But I should also add that my real objection is something more like: “The argument in favor of studying Lob’s theorem is very abstract and it is fairly unintuitive that human reasoning should run into that obstacle. Standard epistemic hygiene calls for trying to produce concrete examples to motivate this line of work. I have not seen this done by MIRI, and all of the examples I can think of, both from having done AI and verification work myself, and from looking at what my colleagues do in program analysis, points in the squarely opposite direction.”
When I say “failure to understand the surrounding literature”, I am referring more to a common MIRI failure mode of failing to sanity-check their ideas / theories with concrete examples / evidence. I doubt that this comment is the best place to go into that, but perhaps I will make a top-level post about this in the near future.
Sorry for ducking most of the technical points, as I said, I hope that talking to Paul will resolve most of them.
Good to hear, and thanks for the reassurance :-) And yeah, I do too well know the problem of having too little time to write something polished, and I do certainly prefer having the discussion in fairly raw form to not having it at all.
I’m not really sure what you mean by a “concrete instantiation”. I can think of concrete toy models, of AIs using logical reasoning which know an exact description of their environment as a logical formula, which can’t reason in the way I believe is what we want to achieve, because of the Löbian obstacle. I can’t write down a self-rewriting AGI living in the real world that runs into the Löbian obstacle, but that’s because I can’t write down any AGI that lives in the real world.
My reason for thinking that the Löbian obstacle may be relevant is that, as mentioned in the interview, I think that a real-world seed FAI will probably use (something very much like) formal proofs to achieve the high level of confidence it needs in most of its self-modifications. I feel that formally specified toy models + this informal picture of a real-world FAI are as close to thinking about concrete instantiations as I can get at this point.
I may be wrong about this, but it seems to me that when you think about concrete instantiations, you look towards solutions that reason about the precise behavior of the program they’re trying to verify—reasoning like “this variable gets decremented in each iteration of this loop, and when it reaches zero we exit the loop, so we won’t loop forever”. But heuristically, while it seems possible to reason about the program you’re creating in this way, our task is to ensure that we’re creating a program which creates a program which creates a program which goes out to learn about the world and look for the most efficient way to use transistors it finds in the external environment to achieve its goals, and we want to verify that those transistors won’t decide to blow up the world; it seems clear to me that this is going to require reasoning of the type “the program I’m creating is going to reason correctly about the program it is creating”, which is the kind of reasoning that runs into the Löbian obstacle, rather than the kind of reasoning applied by today’s automated verification techniques.
Writing this, I’m not too confident that this will be helpful to getting the idea across. Hope the face-to-face with Paul with help, perhaps also with translating your intuitions to a language that better matches the way I think about things.
I think that the point above would be really helpful to clarify, though. This seems to be a recurring theme in my reactions to your comments on MIRI’s arguments—e.g. there was that LW conversation you had with Eliezer where you pointed out that it’s possible to verify properties probabilistically in more interesting ways than running a lot of independent trials, and I go, yeah, but how is that going to help with verifying whether the far-future descendant of an AI we build now, when it has entire solar systems of computronium to run on, is going to avoid running simulations which by accident contain suffering sentient beings? It seems that to achieve confidence that this far-future descendant will behave in a sensible way, without unduly restricting the details of how it is going to work, is going to need fairly abstract reasoning, and the sort of tools you point to don’t seem to be capable of this or to extend in some obvious way to dealing with this.
I’m fairly sure that the reason your brain goes “it would be safe if we only allow self-modifications when there’s a proof that they’re safe” is that you believe that if there’s a proof that a self-modification is safe, then it is safe—I think this is probably a communication problem between us rather than you actually wanting to use different reasoning. But again, hopefully the face-to-face with Paul can help with that.
Thanks for the reply! Thing is, I don’t think that ordinary human reasoning should run into that obstacle, and the “ordinary” is just to exclude humans reasoning by writing out formal proofs in a fixed proof system and having these proofs checked by a computer. But I don’t think that ordinary human reasoning can achieve the level of confidence an FAI needs to achieve in its self-rewrites, and the only way I currently know how an FAI could plausibly reach that confidence is through logical reasoning. I thought that “whole brain emulations can safely self-modify” might describe our disagreement because that would explain why you think that human reasoning not being subject to Löb’s theorem would be relevant.
My next best guess is that you think that even though human reasoning can’t safely self-modify, its existence suggests that it’s likely that there is some form of reasoning which is more like human reasoning than logical reasoning and therefore not subject to Löb’s theorem, but which is sufficiently safe for a self-modifying FAI. Request for reply: Would that be right?
I can imagine that that might be the case, but I don’t think it’s terribly likely. I can more easily imagine that there would be something completely different from both human reasoning or logical reasoning, or something quite similar to normal logical reasoning but not subject to Löb’s theorem. But if so, how will we find it? Unless essentially every kind of reasoning except human reasoning can easily be made safe, it doesn’t seem likely that AGI research will hit on a safe solution automatically. MIRI’s current research seems to me like a relatively promising way of trying to search for a solution that’s close to logical reasoning.
Ok, I think I probably don’t understand this yet, and making a post about it sounds like a good plan!
No problem, and hope so as well.
I don’t have time to reply to all of this right now, but since you explicitly requested a reply to:
The answer is yes, I think this is essentially right although I would probably want to add some hedges to my version of the statement (and of course the usual hedge that our intuitions probably conflict at multiple points but that this is probably the major one and I’m happy to focus in on it).
Thanks!
Thought experiment:
We have an AI which controls a robotic arm inside a box. The box also contains various cooking tools and ingredients to make cake, and a big red button that kills a puppy.
We prefer to cake to no cake, and we prefer our cake to be delicious and moist cake, but above all we prefer the companion cu-… puppy to stay alive.
Therefore, we implement in the AI a “puppy safety module” (PSM) which vetoes any course of action proposed by the planning module if it determines that there is any non-negligible probability of the red button being depressed.
When the PSM can’t make an accurate prediction it will always err on the safe side: things like throwing tools on the box walls where they can bounce unpredictably and hit the red button don’t get approved.
Athena, the Greek goddess of wisdom, justice and math, came down from Mt. Olympus and debugged the code for PSM. She was also so kind to check all the compilers, OS and hardware for bugs and glitches.
Do you agree that with this AI the puppy is most likely safe?
Now, let’s assume that the cakes made by this AI aren’t very good. We don’t really know how to program it to make better cakes, we try our best reinforcement learning algorithms but it doesn’t improve much. Then somebody suggest we try to allow the AI to modify it own code:
In addition to sending motor commands to the arm, the planning module can also edit a copy of the AI code and eventually replace the whole program with this successor. Of course, these actions must still be approved by the PSM, but they are otherwise unrestricted: in principle they can modify the PSM, or add a bypass, or get rid of it entirely.
Do you think that this AI is unsafe for the puppy?
Do you think it is safe but unable to improve itself in effective ways?
Since the PSM was designed without self-modification in mind, “safe but unable to improve itself in effective ways”.
(Not sure how this thought experiment helps the discussion along.)
Can you please motivate?
Suppose that in the recesses of the code there is an instantiation of the bubble sort algorithm. The planner proposes to change it with, say, merge sort. Do you think that the PSM would generally disapprove such change?
Do you think it would approve it, but it would still be unable to approve modifications that would be needed for significant improvement?
MIRI claim to have had an important insight on AI design (this so called “Löbian obstacle”) that experts in relevant fields (AI, model checking, automated theorem proving, etc.) didn’t have. MIRI attempted to communicate their insight, but so far the experts have mostly ignored MIRI claims or denied that they are likely to be important and relevant.
What is the most likely explanation for that? It seems that we can narrow it to two hypotheses:
A) MIRI’s insight is really relevant and important to AI design, but communication with the experts failed because of some problem on MIRI’s side, or on the experts’ side (e.g. stubbornness, stupidity) or both (e.g. excessively different backgrounds).
B) MIRI is mistaken about the value of their insight (possible psychological causes may include confirmation bias, Dunning–Kruger effect, groupthink, overconfident personalities, etc.).
I would say that, barring evidence to the contrary, hypothesis B is the most likely explanation.
I don’t believe these options are exhaustive. “Relevant and valuable” are subjective and time-varying. The published work might be interesting and useful down the line, but not help the problems that AI researchers are working on right now.
It usually takes a few years for the research community to assimilate strange new ideas—sometimes much more than a few years. This isn’t due to a failure on anybody’s part—it’s due to the fact that scientists pick problems where they have a reasonable prospect of success within a few years.
I would give MIRI at least a decade or two before evaluating whether their work had any mainstream traction.
MIRI stated goals are similar to those of mainstream AI research, and MIRI approach in particular includes as subgoals the goals of research fields such as model checking and automated theorem proving.
Do you claim that MIRI is one or two decades ahead of mainstream researchers?
If the answer is no, then how does MIRI (or MIRI donors) evaluate now whether these lines of work are valuable towards advancing their stated goals?
Research has both ultimate goals (“machines that think”) and short-term goals (“machines that can parse spoken English”). My impression is that the MIRI agenda is relevant to the ultimate goal of AI research, but has only limited overlap with the things people are really working on in the short term. I haven’t seen MIRI work that looked directly relevant to existing work on theorem proving or model checking. (I don’t know much about automated theorem proving, but do know a bit about model checking.)
It’s not a matter of “ahead”. Any research area is typically a bunch of separate tracks that proceed separately and eventually merge together or have interconnections. It might be several decades before the MIRI/self modifying AI track merges with the main line of AI or CS research. That’s not necessarily a sign anything is wrong. It took decades of improvement before formal verification or theorem proving become part of the computer science toolkit. I would consider MIRI a success if it follows a similar trajectory.
I can’t imagine any really credible assurance that “this basic research is definitely useful,” for any basic research. The ultimate goal “safe self modifying AI” is too remote to have any idea if we’re on the right track. But if MIRI, motivated by that goal, does neat stuff, I think it’s a safe bet that (A) the people doing the work are clueful, and (B) their work was at least potentially useful in dealing with AI risks. And potentially useful is the best assurance anybody can ever give.
I’m a computer systems guy, not a theorist or AI researcher, but my opinion of MIRI has gradually shifted from “slightly crankish” to “there are some interesting questions here and MIRI might be doing useful work on them that nobody else is currently doing.” My impression is a number of mainstream computer scientists have similar views.
Eliezer recently gave a talk at MIT. If the audience threw food at the stage, I would consider that evidence for MIRI being crankish. If knowledgeable CS types showed up and were receptive or interested, I would consider that a strong vote of confidence. Anybody able to comment?
It’s definitely not a goal of mainstream AI, and not even a goal of most AGI researchers, to create self-modifying AI that provably preserves its goals. MIRI’s work on this topic doesn’t seem relevant to what mainstream AI researchers want to achieve.
Zooming out from MIRI’s technical work to MIRI’s general mission, it’s certainly true that MIRI’s failure to convince the AI world of the importance of preventing unFriendly AI is Bayesian evidence against MIRI’s perspective being correct. Personally, I don’t find this evidence strong enough to make me think that preventing unFriendly AI isn’t worth working on.
Also, two more points why MIRI isn’t that likely to produce research AI researchers will see as a direct boon to their field: One, stuff that’s close to something people are already trying to do is more often already worked on; the stuff that people aren’t working on seem more important for MIRI to work on. And two, AGI researchers in particular are particularly interested in results that get us closer to AGI, and MIRI is trying to work on topics that can be published about without bringing the world closer to AGI.
I wouldn’t say MIRI has tried very hard yet to communicate about the Lobian obstacle to people in the relevant fields. E.g. we haven’t published about the Lobian obstacle in a journal or conference proceedings.
Part of the reason for that is that we don’t expect experts in these fields to be very interested in it. The Lobian obstacle is aiming at better understanding of strongly self-modifying systems, which won’t exist for at least 15 years, and probably longer than that.
I agree the AI community won’t be very interested. But it might be worth sending it to some theoretical computer science venue—STOC, say—instead. If nothing else, it would give useful information about how receptive academics are to the topic.
I look forward to a clear, detailed explanation of MIRI’s thinking on this subject. In particular this counter-intuitive result:
deserves some technical elaboration.
Mark, have you read Eliezer’s article about the Löbian obstacle, and what was your reaction to it?
I’m in the early stages of writing up my own work on the Löbian obstacle for publication, which will need to include its own (more condensed, rather than expanded) exposition of the Löbian obstacle; but I liked Eliezer’s article, so it would be helpful to know why you didn’t think it argued the point well enough.
I have, although formal logic is not my field so please excuse me if I have misunderstood it.
Eliezer does not demonstrate that overcoming the Löbian obstacle is necessary in the construction of tiling agents, he rather assumes it. No form of program verification is actually required, if you do not use the structure of a logical agent. Consider, for example, the GOLUM architecture[1] which is a form of tiling agent that proceeds by direct experimentation (simulation). It does not require an ability to prove logical facts about the soundness and behavior of its offspring, just an ability to run them in simulation. Of course logical program analysis helps in focusing in on the situations which give rise to differing behavior between the two programs, but there are no Gödelian difficulties there (even if there were you could fall back on probabilistic sampling of environments, searching for setups which trigger different results).
The MIRI argument, as I understand it is: “a program which tried to predict the result of modifying itself runs into a Löbian obstacle; we need to overcome the Löbian obstacle to create self-modifying programs with steadfast goal systems.” (I hope I am not constructing a strawman in simplifying it as such.) The problem comes from the implicit assumption that the self-modifying agent will use methods of formal logic to reason about the future actions of its modified self. This need not be the case! There are other methods which work well in practice, converge on stable solutions under the right circumstances, and have been well explored in theory and in practice.
I’m reminded of the apocryphal story of two space-age engineers that meet after the fall of the Soviet Union. The American, who oversaw a $1.5 million programme to develop the “Astronaut Pen” which would write in hard vacuum and microgravity environments, was curious to know how his Russian counterpart solved the same problem. “Simple,” he replied, “we used a pencil.”
You could expend significant time, energy, and donations to solve the problem of Löbian obstacles in the context of tiling agents for self-modifying AI. Or you could use an existing off-the-shelf solution that solves the problem in a different way.
[1] http://goertzel.org/GOLEM.pdf
Doesn’t the non-apocryphal version of that story have some relevance?
http://en.wikipedia.org/wiki/Space_Pen
http://www.snopes.com/business/genius/spacepen.asp
Using a space pencil could cause your spaceship to light on fire. Sometimes it pays to be careful.
I don’t think the argument is that AI would be fundamentally different, but rather that “we can reason at least somewhat reliably when making predictions of agents who don’t drastically self-modify, and of whom we have thousands of years of data to help build our predictions on” isn’t good enough to deal with the case of a drastically self-modifying agent that could exhibit entirely novel behavior and cognitive dynamics even if it wasn’t capable of self-modifying. “Somewhat reliably” is fine only as long as a single failure isn’t enough to throw all the rest of your predictions to the trash bin.
I don’t know enough about your second example to feel confident commenting on it.
Humans seem pretty good at making correct predictions even if they have made incorrect predictions in the past. More generally, any agent for whom a single wrong prediction throws everything into disarray will probably not continue to function for very long.
Fair enough. This is an admirable habit that is all too rare, so have an upvote :).
That’s basically my point. A human has to predict the answer to questions of the type “what would I do in situation X”, and their overall behavior is the sum of their actions over all situations, so they can still get the overall result roughly correct as long as they are correct on average. An AI that’s capable of self-modification also has to predict the answer to questions of the type “how would my behavior be affected if I modified my decision-making algorithm in this way”, where the answer doesn’t just influence the behavior in one situation but all the ones that follow. The effects of individual decisions become global rather than local. It needs to be able to make much more reliable predictions if it wants to have a chance of even remaining basically operational over the long term.
Thanks. :)
And more important, its creators want to be sure that it will be very reliable before they switch it on.
These examples involve predictions generated by processes which are not purely logical systems, and which we don’t understand enough to code into an AI. So it seems like Paul’s idea could be progress towards having a process that makes such predictions about itself that we can understand at that level.
Can you clarify in what sense you think a computer program is not a purely logical system? Or am I misunderstanding you?
ETA: In particular, I’m referring to the point where I said:
I didn’t meant that the computer program is not a purely logical system, but that the people proving facts about its behavior aren’t purely logical systems.
Program analysis consists of writing computer programs that reason about other computer programs. Is the objection that these programs were written by a human? That seems like a strange objection to me if so.
Ok, then I did not understand exactly what you meant, but I still don’t think this is a counterexample to the problem Paul’s idea tries to get around.
The problem is that logical systems have problems reasoning about their own behavior, not a claim that there is no other logical system that can reason about them. In particular, we are interested in if an optimization process can model itself as an optimization process, accurately predicting that its future decisions are likely to achieve outcomes that score well on its optimization criteria and the score will be better if it has more resources, and will become much worse if its representation of its criteria gets corrupted, (all using abstract reasoning in much less time than fully simulating its future decisions). Can program analysis do that?
ETA: Also, I should note that this is a good question and I’m glad you asked it!
If your question is whether a program analyzer can, when given itself as input, produce sensible results, the answer is yes. Program analyzers are meant to run on arbitrary code, so in particular they can be run on themselves as a special instance. (Actually, nothing particularly special happens in this case as far as I can tell.)
Now, a key point is the formalism we are working in: a program analyzer takes in a program P and some specification S, and checks whether P obeys specification S (for instance, checking that it never accesses memory before it allocates it). Importantly, P is allowed to do any 3 of the following:
report that P satisfies S (with a proof)
report that P does not satisfy S (with a counterexample)
give up (i.e. report “unsure”)
So a key question is how often it gives up! But while there are many instances where modern program analysis tools do give up, there are also many where they don’t. Furthermore if you are an AI and you propose a modification to your code, and your program analysis subroutine reports “unsure”, you are free (and would be wise) to try a different modification instead. Researchers in the field of program analysis are extremely cognizant of the halting problem (which is closely related to Lob’s theorem) and typically deal with it by over-approximating, e.g. by identifying conditions that would be sufficient for halting, although not necessarily necessary. As a result one obtains solubility at the cost of precision (although note that the approximation is always sound: if we report that P satisfies S, then P does indeed always satisfy S).
This is a good approach for dealing with the halting problem, but I think that Lob’s theorem is not so closely related that getting around the halting problem means you get around Lob’s theorem.
The theoretical AI that would run into Lob’s theorem would need more general proof producing capability than these relatively simple program analyzers.
It seems like these program analyzers are built around the specification S they check for, with the human programmer doing the work of constructing a structure of a proof which can be filled in to a complete proof by supplying basic facts that the program can check. For example, I have a library that produces .Net assemblies, with byte code that targets a stack based virtual machine, and I want to verify that instructions to read elements off the stack get elements of the type is expecting. So I wrote my library to keep track of types that could be written to the stack at any execution point (represented by a stack of types). It is straightforward to compute the possible stack states after each instruction, given that instruction and the previous possible stack trace, and determine if the instruction is legal given the previous state (well, branching makes it more complicated, but not too much). So, in this case, I came up with the structure of tracking possible states after each instruction, and then it was straightforward to write my program to fill in that structure, but I did not, and don’t know how to, write my program to come up with the proof structure. It is therefor easy to be confident that proof will have nice properties, because the space of possible proofs with this structure is much smaller than the space of all possible proofs.
The theoretical AI that would run into Lob’s theorem would be able to come up with proof structures, as an AI that could only use proof structures prepackaged by human programmers would have huge gaps in its capabilities. This may introduce difficulties not seen in simple program analyzers.
In the example I asked about earlier, the optimizer that needs to prove things about its own future decisions, it runs into an issue that Lob’s theorem applies to. In order prove that its own future decisions are good, it would rely on the fact that it’s future decision are based on its own sound proof system, so it needs to assert that its own proof system is sound, that if its proof system says “X”, then X. Lob’s theorem says that, for all statements X, if a system P (at least as powerful as Peano arithmetic) says “P says ‘X’ implies X” then P says X. So, if the system asserts its own soundness, it asserts anything.
So, in summary:
Lob’s theorem is a problem for generally powerful formal logic based proof systems that assert their own soundness.
Program analyzers are formal logic based proof systems that do not run into Lob’s theorem, because they are not generally powerful, and do not assert their own soundness.
Humans are generally powerful proof generators than can have confidence in their own soundness, but are not purely based on formal logic, (we might intuitively dismiss a formal logical proof as “spurious”) and so can get around Lob’s theorem, but we don’t understand how humans do this well enough to write a program to do it.
Paul’s idea of a probabilistic proof system could lead to a generally powerful proof generator with confidence in its own soundness that is not based on formal logic, so that it gets around Lob’s theorem, that we can understand well enough to write a program for.
I’m not sure I’ve understood what you have in mind here, but in the general case complete type checking in .NET (that is, proving that an assembly not only is syntactically well-formed but also never throws type-related exceptions at runtime) is undecidable because of Rice’s theorem.
In the general case, complete type checking is as difficult as proving arbitrary claims in first-order logic.
I am not trying to write a classifier that tells you whether or not an arbitrary program throws a type exception. I wrote a verifier that tells you whether or not an arbitrary program can be proven not to throw type exceptions (except possibly at an explicit cast statement, or a throw exception statement) with a particular proof strategy that covers a huge space of useful, nicely structured programs.
See also jsteinhardt’s comment I was responding to, which discussed getting around the halting problem by allowing the checker to say “I don’t know”.
I’m not an expert on .NET, but is there anything that can throw a type exception other than an explicit cast or an explicit throw (or the standard library, I suppose)?
There are sequences of .Net instructions that result in the runtime throwing type exceptions, because it tries to read a value of a certain type of the stack, and it gets an incompatible value. This is the situation that my verifier guards against.
The standard .Net runtime also includes a verifier that checks the same thing, and it will not run code that fails this validation unless it is explicitly trusted. So a verifiable .Net assembly will not throw type exceptions without an explicit cast or throw, but an arbitrary assembly may do so. The compilers for the standard languages such as C# will generally only produce verifiable assemblies unless you explicitly mark parts of the source code as “unsafe”, and unsafe, or unverifiable assemblies need special permissions to run at all.
(There are other properties that both my verifier and the standard verifier check for. The reason I wrote my own is that it produces much more informative descriptions of problems it finds, and it is integrated into my assembly emitting libraries, so it detects problems as the assembly to be emitted is defined, and when run in the debugger, will easily show the compiler code and execution state that caused the problem.)
Thanks for the clarification.
IIUC, unverifiable code does not, or at least is not guaranteed to, politely throw an exception should a type error occur. It may crash the runtime or fail silently leaving the application in an incorrect state.
Ok. I thought that you were considering assemblies that passed the standard .NET verification and you were trying to check for some stronger property (such as absence of runtime exceptions caused by downcasts). That would have been equivalent to arbitrary first-order logic inference.
Since you are instead checking for decidable properties, your system is indeed not equivalent to arbitrary first-order logic inference.
But as jsteinhardt says, it is actually possible to write verifiers that attempt to check for undecidable properties, provided that they have the option to give up.
My mathematical logic is a bit rusty, but my impression is that the following are true:
Godel’s theorem presents a strictly stronger obstacle than Lob’s theorem. A reflectively consistent theory may still be incomplete, but any complete theory is necessarily reflectively consistent.
The undecidability of the halting problem is basically Godel’s theorem stated in computational terms. If we could identify a subset L of Turing machines for whom the halting problem can be decided, as long as it was closed under operations such as inserting a (non-self-referential) sub-routine, then we would be able to verify any (non-self-referential) property of the program that was also expressible in L. This is a sketch of a claim rather than an actual claim that I’ve proved, though.
Finally, I think it’s worth pointing out an actual example of a program analysis tool since I think they are more powerful than you have in mind. The following slides are a good example of such a tool. At a high level, it gets around the problems you are worried about by constructing an over-approximation of the halting problem that is expressible in propositional logic (and thus decidable, in fact it is even in NP). More generally we can construct a sequence of approximations, each expressible in propositional logic, whose conjunction is no longer an approximation but in fact exactly the original statement.
Why do you say that? My understanding is that Godel’s theorem says that a (sufficiently powerful) logical system has true statements it can’t prove, but these statements are excessively complicated and probably not important. Is there some way you envision an AGI being limited in its capacity to achieve its goals by Godel’s theorem, as we envision Lob’s theorem blocking an AGI from trusting its future self to make effective decisions? (Besides where the goals are tailored to be blocked by the theorem: “Prove all true statements in my formal system”)
As near as I can tell, this is more powerful than other static analysis tools that I have seen (though maybe not, the Google Web Toolkit optimizer is pretty impressive), but it is well within what I expect to be possible, and doesn’t get around Lob’s theorem. (I can’t be too confident in this assessment of its power because I don’t see a clear claim of what sort of statements it can prove or how it works except that it seems to detect when inputs to a statement may have invalid values and that it uses brute force techniques to analyze functions and then associates a summary of the analysis with that function (constraints on valid inputs and guarantees on outputs?) so that its analysis of call sites can use the summary.) (This sort of program analyzer is interesting in its own right, and I would like to see a more complete explanation of it, but I don’t think it’s existence says anything about the problems posed by Lob’s theorem.)
Do you agree or disagree that complete implies reflectively consistent? If you agree, then do you agree or disagree that this means avoidance of Godelian obstacles implies avoidance of Lobian obstacles? If you agree with both of those statements, I’m confused as to why:
is a controversial statement.
Ah, so by “Godel’s theorem presents a strictly stronger obstacle than Lob’s theorem” you mean if you overcome Godelian obstacles you also overcome Lobian obstacles? I think I agree, but I am not sure that it is relevant, because the program analyzer examples don’t overcome Godelian obstacles, they just cope with the Godelian obstacles, which does not similarly imply coping with or overcoming Lobian obstacles.
I’m guessing that he is using the standard AI terminology of a “logical program” as one which reasons by means of logical inference over a knowledgebase. This is not how human minds work, nor do many AI researchers view it as a viable path forward.
The types of programs which have seen success in the AI revolution that is currently going on, often are not amenable to program analysis. They very often fall in the “undecidable” category, at least with respect to their anticipated behavior.
This depends which properties you care about. I suspect there isn’t a small model of the AI program that duplicates its output behavior, but you can do a lot with off-the-shelf analysis.
It’s relatively easy to prove things like “this program will only ever use the following subset of available system calls,” or “this program is well typed”, or “the program cannot modify itself”, or “can only modify these fixed specific parts of its state.”
I can also imagine a useful AI program where you can prove a bound on the run-time, of the form “this program will always terminate in at most X steps”, for some actual fixed constant X. (Obviously you cannot do this for programs in general, but if the program only loops over its inputs for a fixed number of iterations or somesuch, you can do it.)
These sorts of properties are far from a general proof “this program is Safe”, but they are non-trivial and useful properties to verify.
The FAI-related things you would want to prove are of the form: “when given a world state characterized by X, input percepts characterized by Y, the program always generates outputs characterized by Z.” For many existing and popular AI architectures, we haven’t the foggiest idea how to do this. (It’s no surprise that Eliezer Yudkowsky favors Pearl’s causal probabilistic graph models, where such analysis is at least known to be possible.)
To the extent that X, Y, and Z can be written formally within the programming language, program analysis at least in principle is fully capable of proving such a statement. I apologize if this comes off as rude, but your statement that “we haven’t the foggiest idea how to do this” is flat-out false. While there are certainly unique challenges to reasoning about the sort of code that gets written in machine learning, it seems to me that the main reason we don’t have well-developed analysis tools is that most code doesn’t look like machine learning code, and so there has been little pressure to develop such tools.
Program analysis consists of writing computer programs that reason about other computer programs. Is the objection that these programs were written by a human? That seems like a strange objection to me if so.
Edit: oops, replied to wrong comment.
Can you give an example? The main reason they are not amenable to program analysis is because the sorts of guarantees they are supposed to satisfy are probabilistic / statistical in nature, and we don’t yet have good techniques for verifying such properties. I am pretty sure that the issue is not undecidability.
I should also clarify that whether or not JGWeissman thought that I meant “logical program” instead of “computer program”, my comment was intended to mean “computer program” and the field of program analysis studies “computer programs”, not “logical programs”.
Ok, throw it back at you: how do you prove the behavior of a deep belief network, or any other type of neural network currently in vogue, short of actually running it? If you do have some way of doing it, can I be coauthor? I didn’t mean to imply that it was proved impossible, just that no one has the faintest idea how to do it—and not for lack of trying.
What does “prove the behavior” mean? If you mean “prove that it will classify this next image correctly” then of course the answer is no, although that is because it’s not even necessarily true. If you mean “prove that the program’s behavior always falls within some well-defined regime”, then yes we can absolutely prove that sort of statement. Things that are at the boundary of what we can prove (i.e. I don’t know of any existing formal tools that do it, but I’m pretty sure we could make one) would be something like “the weights of the network don’t change too much if we change the input a small amount” or “the prediction error on the training set will decrease monotonically” or “the generalization error is bounded by assuming that the test data ends up being drawn from the same distribution as the training data”.
Presumably for friendliness you want something like “P and P’ satisfy some contract relative to each other” (such as sufficiently similar behavior) which is the sort of thing we are already relatively good at proving. Here I’m imagining that P is the “old” version of a program and P’ is a “modified” version of the program that we are considering using.
I understand “trade small local gains for large global gains” as a prescriptive principle, but does it work as a descriptive hypothesis? Why expect academics to be so much better than philanthropists at cause neutrality? When I speak to academics who aren’t also EAs, they are basically never cause neutral, and they even joke about how ridiculously non-cause-neutral everybody in academia is, and how accidental everyone’s choice of focus is, including their own.
I’m not talking about cause neutrality. My point is that even once the general problem has been decided, there are many possible approaches, and academics often do things that seem inefficient but are actually exploring the space of possible approaches (possibly by trying to better understand the objects they are studying).
What level of “general problem” do you have in mind? To a large degree I’m thinking about things like “Gosh, it took (unnecessary) centuries or decades for researchers to launch subfields to study normative uncertainty and intelligence explosion”, and that could be a “lack of cause neutrality” problem. And maybe you’re thinking instead on a smaller scale, and want to say something like “Given that people decide to work on X, they’re relatively efficient in working on X, and exploring the space within X, even if they’re completely missing normative uncertainty and intelligence explosion.”
This is an interesting hypothesis, and one I wasn’t thinking of. But hard to measure!
Out of curiosity, what gives you that impression? I tend to cite it because it is (along with the Lobian cooperation stuff) among the most important results to come out of MIRI’s first couple workshops, not because I can already tell whether it’s an important breakthrough in mathematical logic in general.
As for the purpose and relevance of the Lobian obstacle work, it seems like there might still be a failure of communication there. Since you and Eliezer and I discussed this at length and there still seems to be an unbridged gap, I’m not sure which thing I can say to bridge the gap. Maybe this quote from Paul?
In the OP I actually gave program equilibrium as an example of new theoretical progress that opens up new lines of inquiry, e.g. the modal agents work (though of course there are other pieces contributing to modal agents, too). So yeah, I don’t think the modal agents work is an example of inefficiency.
The examples I gave in the OP for apparent inefficiency in decision theory research was philosophy’s failure to formulate a reliabilist metatheory of instrumental rationality until 2013, even though reliabilist theories of epistemic rationality have been popular since the late 1960s, and also the apparently slow uptake of causal Bayes nets in the causal decision theory world.
In this very post you placed it in a list next to normative uncertainty and the intelligence explosion. The implication seemed obvious to me but perhaps it was unintended.
I seem to remember other comments / posts where similar sentiments were either expressed or implied, although a quick search doesn’t turn them up, so perhaps I was wrong.
Yeah, unintended, but I can see why one might infer that.
Does my “philosophical edge” comment imply importance to you? I was merely trying to say that it’s philosophical even though I’m thinking of it in terms of AI, and it’s not obvious to me, like your first example, why one would read the comment as assigning particular importance to the result.
I think that the comment that I quoted is not by itself objectionable to me. If that’s actually the only example I can come up with, then I think it would be unfair to criticize it, so I will update the parent to remove it.