A rejection of the Orthogonality Thesis

Link post

I shared a blog post I wrote in Astral Codex, and someone suggested I try and share with the rationalist community (though it’s somewhat polemic in tone). So here goes! The blog post is a little wider in scope, but I think the relevant part for the rationalist community is the rejection of the orthogonality thesis. The TL;DR is that the orthogonality thesis is often presented as a fact, but it seems to me that it’s mostly a series of assertions, namely:

  1. There is a large mind design space. Do we have any actual reasons for thinking so? Sure, one can argue everything has a large design space, but in practice, there’s often an underlying unique mechanism for how things work.

  2. Ethics are not an emergent property of intelligence—but again, that’s just an assertion. There’s no reason to believe or disbelieve it. It’s possible that self-reflection (and hence ethics and the ability to question one’s goals and motivations) is a pre-requisite for general cognition—we don’t know whether this is true or not because we don’t really understand intelligence yet.

  3. The previous two are assertions that could be true, but reflective stability is definitely not true—it’s paradoxical. To quote from my post:

This line of reasoning is absurd: it assumes an agent knows in advance the precise effects of self-improvement — but that’s not how learning works! If you knew exactly how an alteration in your understanding of the world would impact you, you wouldn’t need the alteration: to be able to make that judgement, you’d have to be able to reason as though you had already undergone it (of course, you can predict some of the effects of a particular course of study/​self-improvement: for instance, you know that if you take a course in accounting, you’ll become better at reading financial statements. But you have no idea what other effects this course will have on your worldview — for instance, an accounting course might cause you to hate reading financial statements. If you did — if you could think exactly as you would after the course—you wouldn’t need the course.)

So if the argument the OT proponents are making is that AI will not self-improve out of fear of jeopardising its commitment to its original goal, then the entire OT is moot, because AI will never risk self-improving at all.

(To tackle the Gandhi analogy head on: obviously Ghandi wouldn’t take a pill if it were sold to him as ‘if you take this you’ll start killing people’. But if he were told ‘this pill will lead to enlightenment’, and it turns out that an enlightened being is OK with murder, then he’d have to take it — otherwise, he’d be admitting that his injunction against murder is not enlightened; and ultimately, Ghandi’s agenda wasn’t simply non-violence — that was one aspect of a wider worldview and philosophy. To be logically consistent, AI doomers would need to argue that Ghandi wouldn’t dare reading anything new, out of fear that might change his worldview.)

All this is not to suggest we shouldn’t take AI risk seriously, or that we shouldn’t proactively research alignment &c. But it strikes me as dogmatic to proclaim that doom is certain, and that orthogonality is a ‘fact’.