Trustworthy automated philosophy?

Reply to: Stable self-improvement as a research problem

Paul’s post, and subsequent discussions with Eliezer and Nate, have made me update significantly towards the hypothesis that the best way to get a positive intelligence explosion might be to (1) create a seed agent that’s human-level at doing mathematical philosophy, (2) have this agent improve a small number of times (like “tens” of significant improvements), thereby making it significantly smarter-than-human (like 10x or 100x according to some relevant scale), and (3) have the resulting agent invent and approve a decision-making framework reliable enough to undergo an intelligence explosion.

My first reaction was that it seemed probably too difficult to create an initial agent which is capable of doing the necessary philosophy and whose decision-making system is (knowably) sufficiently reliable that we can trust it with our future. Subsequent discussions have made me revise this, and I expect that as a result I’ll shift somewhat in what problems I’ll be working on, but I’m still rather worried about this, and think that it is probably the core of remaining disagreement with Paul in this area.

(I apologize for taking so long to reply! I’ve been finding it really hard to articulate my thoughts while having my assumptions jumbled around.)


First of all, I want to say that I do currently agree with the following argument, and that I hadn’t considered it before reading Paul’s post (so I’m grateful for having had it pointed out to me!).

  1. Suppose that we can make a seed agent that is aligned with our values and is able to “reason as well as we are” about computers and the external world.

  2. Suppose further (Paul doesn’t mention this, but it seems important to me) that this agent has a clean, non-kludgy design, which its human research team (and therefore by construction the agent itself) understands on a conceptual level.

  3. Say that this agent does ten or twenty self-improvements, each consisting of significant changes to the code that give the agent about a ~30% efficiency boost each, for a total of a 13x or 190x improvement.

  4. Say that when a previous version of the agent makes a new one, it doesn’t replace itself by this new version, but instead observes this new version as it goes about its business (including making and observing future versions), and checks the new version’s reasoning in enough detail that it would see it come up with thoughts like “let’s deceive my predecessor”, and also thoughts that the old version considers to be reasoning mistakes.

    • A non-kludgy design seems necessary to me to make it actually feasible to put in these monitors and be confident that there’s no optimization for something like this going on that the parent agent doesn’t know about.

    • If a bug like this is detected, the successor can be shut down and the bug can be fixed.

  5. It then seems plausible that, after having improved 13x or 190x, the resulting agent would still use good reasoning, still be aligned with humanity’s interests, and would now be much better than the human research team at inventing ways to do reasoning that are sufficiently reliable to safely undergo an intelligence explosion (which may involve longer series of much more radical changes than the agent has made so far).

I still do not think that this kind of scheme is actually reliable enough to undergo an entire intelligence explosion. (I’d be happy to discuss this more.) But I accept that, if we can get an initial agent satisfying assumptions 1. and 2., it seems like the plan sketched above has a good chance of working. I’m even willing to buy (though I think Eliezer may disagree with this) that it’s ok for the earlier agents in the chain to detect some bugs in later agents, fix them, and go on—i.e., the scheme works even if the earlier agents’ reasoning about the later agents isn’t sufficiently good to reliably avert bugs (since it seems plausible that it could still be good enough to detect bugs if they would lead to harmful results).

The main source of disagreement seems to be whether we can plausibly get 1. and 2.


I’m not saying that I think it’s infeasible to create any artificial agent that is as good as a team of smart humans at the relevant kinds of tasks (such as doing mathematical philosophy). Rather, I’m not sure it’s feasible, in the relevant timeframe, to create an agent which is able to do this and such that we can have justified confidence that

(i) this agent really is trying to achieve the goals we program it with (that is, if we give it the right utility function, it really is aligned with our values); and (ii) its reasoning will work correctly (i.e., not be significantly worse than the smart human team’s) when faced with surprising new problems we haven’t tested it on.

Perhaps it’s not obvious how we could possibly fail to get the above if we get any sort of aligned smarter-than-human intelligence. But there is a trick in our arsenal—formal proofs—that allows us to get justified high confidence in (i) and (ii) for many relevant domains, but not for mathematical philosophy.

Of course, it’s clear that a realistic AI system will have to use logically uncertain reasoning, but my visualization so far has been that it would use something that looks a whole lot like formal proofs for things on a “critical path”—things that, if they go wrong, are catastrophic. This would allow the system to get high confidence that when it rewrites itself, certain properties about how it makes decisions are preserved, and it would allow me to get high confidence that when the system says “high confidence”, that’s actually correct.

So, it’s not that I’ve expected the system we create to not be able to do the kind of logically uncertain reasoning that you’d need to, say, invent a new large cardinal axiom; rather, I’ve expected not to use this sort of reasoning for critical decisions. For example, if an AI system needs to write a database management system that it needs to work correctly, I would expect the system to use lower-confidence logical uncertainty reasoning to decide what kind of algorithms to implement, but to use high-confidence reasoning to make sure that the implementation meets its specification.


For a reasoner that uses something which looks a lot like formal proofs, it seems pretty straight-forward to make something that you have justified confidence reasons correctly even in cases you haven’t tested it on (i.e., (ii)), and it’s pretty plausible to me that there are ways to get justified confidence that this reasoning is employed in the service of the goals you’ve given the system (i.e., (i)); for example, meliorizers with a suggester/​verifier architecture do something in this space, although they are not a solution to the real problem.

It doesn’t seem likely that similarly strong methods exist for the kind of reasoning you need to use to invent new large cardinal axioms, or switch from PA to the consistency waterfall. But of course, we don’t need confidence as high as the confidence we have in a verified formal proof; if we “merely” reach the level of confidence we have in the human research team, we’re doing really well!

Eliezer seems to be mostly worried about the feasibility of getting confidence in (i), i.e., confidence that the agent is optimizing for the right values. I actually currently find this fairly plausible if we can find a clean design for a human-level logically uncertain reasoner capable of doing mathematical philosophy: I would expect a non-kludgy design for such a reasoner to be structured to have things roughly like a database of things that it believes (with various kinds of confidence, updated as it finds more information); types of reasoning that it thinks usually give good results (again, with various kinds of confidence); and goals it is trying to achieve. I find it fairly plausible that I could look at the design of a system like this and say, yes, given the way that this system is set up to work, it’s clear that the thing it is trying to optimize for is the stuff stored in the part of its memory labelled “goals”. (I’m not sure where Eliezer’s disagreement comes from, and I assign significant probability to having misunderstood him or him having good arguments against this I haven’t considered.)


My own concern has been more with (ii); that is, I’ve been finding it implausible that I would be able to become reasonably confident that the agent will continue to reason reasonably even in situations very different from the ones I can test it in. For example, in Paul’s proposal, I can’t actually test directly whether my agent does good reasoning when it’s trying to check that the agents on the next three levels down do good reasoning about yet further agents which are trying to design the reasoning system that’ll be used for the rest of the intelligence explosion—I don’t expect to follow my seed agent’s reasoning in detail in this scenario (I do expect to be able to follow some high-level reasoning, but not most that my agent is doing), and I can’t just observe the outcome of a hundred different intelligence explosions and check how many went well.

When I discussed this line of thought with Eliezer, he had a quite different intuition. His initial position, at least, was that if you have the agent do a bunch of work that is clearly on the level of, e.g., von Neumann—say, you have it prove fifty interesting theorems, and over the course of the next month these proofs are reviewed by top mathematicians and judged correct and insightful—then he expected the result of an intelligence explosion initiated by this agent to be a superintelligence with the initial agent’s goals (whether or not these are ours), with most of the remaining uncertainty coming from things like whether intelligence explosions are possible in the first place. (I hope I’m not misrepresenting this.)

As stated, this doesn’t seem right to me. If I understand correctly, the reasoning is that this problem is difficult and varied enough that if your system is going to fail badly anywhere, it’ll probably fail badly on a subproblem of this problem. But it does not seem to me that a test like this reliably tests all the types of reasoning a system like this has to do. I gave “ontological crises” and “stopping to reason in terms of a Cartesian boundary” as two examples of problems that seem like they’re going to make the system reason in ways that don’t get tested by having it prove theorems, and as far as I understand, this made Eliezer shift significantly. I also think that a system like this could be subject to failure modes where there are some kinds of bad philosophy that it finds occasionally, sufficiently rarely that it’s unlikely to turn up in testing, or perhaps that it’ll find only if prompted by the thoughts of smarter successor agents, and that will make it fail catastrophically.


One possible objection is to say that if I’ve seen the system reason in ways as convincing as those of a very smart human on a wide variety of test cases and thought experiments, and I still expect it to fail in weird scenarios, then shouldn’t I find it just as likely that a very smart human would fail in the same scenarios? After all, humans were clearly not selected for being great at mathematical philosophy; therefore, humans’ ability to do this kind of philosophy has to be a side-effect of adaptations for generally intelligent reasoning; therefore, it may seem like we should mathematical philosophy as good as humans’ to be a property of most ways of doing generally intelligent reasoning (since otherwise, it would be too miraculous that humans have it); and tests like the above seem like they convincingly test generally intelligent reasoning; so don’t I have just as much information suggesting that the artificial agent will handle weird new problems well as I have about the human?

But one way in which natural selection still way outperforms human engineers is in creating robust systems; human intelligence aside, consider animals living in human cities, trees surviving holes in their trunk, etc. So in the case of humans, I have lots of empirical evidence suggesting that you can put them in very different new environments and have them face very different new challenges and they’re able to do something reasonable, but this may be the result of selection for robustness in their reasoning rather than a generic property of all generally intelligent systems.


So my current state is that I can imagine that it might be feasible to create a system such that

  • we have a good conceptual understanding of how it does mathematical philosophy (by finding reasoning methods that give good results on a wide variety of things, and applying them);

  • we’ve tested this reasoning on many problems, not just proving math theorems, but also things like toy problems of ontological crises and Cartesian boundaries, and going from PA to PA + Con(PA), and putting some bad patterns of reasoning in its database and seeing whether it’ll figure out that these patterns are bad and reject them, and it’s performed as well as a very smart human on these tests;

  • even though we have not tested on all types of obstacles it might face in real life, the range of weird stuff we have tested it on makes it seem likely that it’ll also perform well on new weird stuff;

  • given our conceptual understanding, we have reason to believe that it will do the same kind of good reasoning under real-world conditions as in our test problems;

  • we’ve given problems like the above to it to consider as thought experiments or theoretical questions about AI design, and it has given sensible answers, and again our conceptual understanding of its reasoning process leads us to expect that this means it will act according to these answers in real-world situations;

  • our conceptual understanding suggests that if it does end up getting some bat-shit crazy idea in its head, it has mechanisms for detecting such ideas and walking them back, and we’ve examined its behavior on concrete test cases (e.g., it has some clean functionality to “unroll” conclusions it has drawn from a bad reasoning method if it later detects that this reasoning method is bad).

But, on the one hand, I’m not that confident that we will in fact be able to make a system with these properties in the kind of timeframe available, and on the other hand, I have a lingering fear that even if we can get a system that passes all of these tests, the real world may look sufficiently different that things may go wrong.

I haven’t yet come to any strong conclusions about how these changes in my position will affect what I’ll be working on medium-term. I think it’s likely that I’ll spend more time on a wider variety of approaches to modelling self-improvement, and when working on logic-based approaches, that I’ll try harder to work on things that seem likely to generalize to the reasoning of an artificial mathematical philosopher. But that said, I’m still sufficiently unconvinced that I think I’ll continue to think about logical uncertainty approaches that involve using something like formal proofs to gain very high confidence on certain tasks.