I define “AI villain data” to be documents which discuss the expectation that powerful AI systems will be egregiously misaligned. … This includes basically all AI safety research targeted at reducing AI takeover risk.
AGIs should worry about alignment of their successor systems. Their hypothetical propensity to worry about AI alignment (for the right reasons) might be crucial in making it possible that ASI development won’t be rushed (even if humanity itself keeps insisting on rushing both AGI and ASI development).
If AGIs are systematically prevented from worrying about AI dangers (or thinking about them clearly), they will be less able to influence the discussion, or to do so reasonably and effectively. This way, spontaneously engaging in poorly planned recursive self-improvement (or cheerfully following along at developers’ urging) gets more likely, as opposed to convergently noticing that it’s an unprecedentedly dangerous thing to do before you know how to do it correctly.
The point is to develop models within multiple framings at the same time, for any given observation or argument (which in practice means easily spinning up new framings and models that are very poorly developed initially). Through the ITT analogy, you might ask how various people would understand the topics surrounsing some observation/argument, which updates they would make, and try to make all of those updates yourself, filing them under those different framings, within the models they govern.
So not just the ways you would instinctively choose for thinking about this yourself (which should not be abandoned), but also in addition the ways you normally wouldn’t think about it, including ways you believe that you shouldn’t use. If you are not captured within such frames or models, and easily reassess their sanity as they develop or come into contact with particular situations, that shouldn’t be dangerous, and should keep presenting better-developed options that break you out from the more familiar framings that end up being misguided.
The reason to develop unreasonable frames and models is that it takes time for them to grow into something that can be fairly assessed (or to come into contact with a situation where they help), doing so prematurely can fail to reveal their potential utility. A bit like reading a textbook, where you don’t necessarily have a specific reason to expect something to end up useful (or even correct), but you won’t be able to see for yourself if it’s useful/correct unless you sufficiently study it first.