First probable crux: at this point, I think one of my biggest cruxes with a lot of people is that I expect the capability level required to wipe out humanity, or at least permanently de-facto disempower humanity, is not that high. I expect that an AI which is to a +3sd intelligence human as a +3sd intelligence human is to a −2sd intelligence human would probably suffice, assuming copying the AI is much cheaper than building it.
This sounds roughly right to me, but I don’t see why this matters to our disagreement?
For example, think about how humans most often deceive other humans: we do it mainly by deceiving ourselves, reframing our experiences and actions in ways which make us look good and then presenting that picture to others. Or, we instinctively behave in more prosocial ways when people are watching than when not, even without explicitly thinking about it. That’s the sort of thing I expect to happen in AI, especially if we explicitly train with something like RLHF (and even moreso if we pass a gradient back through deception-detecting interpretability tools).
This also sounds plausible to me (though it isn’t clear to me how exactly doom happens). For me the relevant question is “could we reasonably hope to notice the bad things by analyzing the AI and extracting its knowledge”, and I think the answer is still yes.
I maybe want to stop saying “explicitly thinking about it” (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that “conscious thoughts” have deception in them) and instead say that “the AI system at some point computes some form of ‘reason’ that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action”.
I maybe want to stop saying “explicitly thinking about it” (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that “conscious thoughts” have deception in them) and instead say that “the AI system at some point computes some form of ‘reason’ that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action”.
I don’t quite agree with that as literally stated; a huge part of intelligence is finding heuristics which allow a system to avoid computing anything about worse actions in the first place (i.e. just ruling worse actions out of the search space). So it may not actually compute anything about a non-deceptive action.
But unless that distinction is central to what you’re trying to point to here, I think I basically agree with what you’re gesturing at.
But unless that distinction is central to what you’re trying to point to here
Yeah, I don’t think it’s central (and I agree that heuristics that rule out parts of the search space are very useful and we should expect them to arise).
This sounds roughly right to me, but I don’t see why this matters to our disagreement?
This also sounds plausible to me (though it isn’t clear to me how exactly doom happens). For me the relevant question is “could we reasonably hope to notice the bad things by analyzing the AI and extracting its knowledge”, and I think the answer is still yes.
I maybe want to stop saying “explicitly thinking about it” (which brings up associations of conscious vs subconscious thought, and makes it sound like I only mean that “conscious thoughts” have deception in them) and instead say that “the AI system at some point computes some form of ‘reason’ that the deceptive action would be better than the non-deceptive action, and this then leads further computation to take the deceptive action instead of the non-deceptive action”.
I don’t quite agree with that as literally stated; a huge part of intelligence is finding heuristics which allow a system to avoid computing anything about worse actions in the first place (i.e. just ruling worse actions out of the search space). So it may not actually compute anything about a non-deceptive action.
But unless that distinction is central to what you’re trying to point to here, I think I basically agree with what you’re gesturing at.
Yeah, I don’t think it’s central (and I agree that heuristics that rule out parts of the search space are very useful and we should expect them to arise).