Sufficient quantities of outcome-based RL on tasks that involve influencing the world over long horizons will select for misaligned agents, which I gave a 20 − 25% chance of being catastrophic. The core thing that matters here is the extent to which we are training on environments that are long-horizon enough that they incentivize convergent instrumental subgoals like resource acquisition and power-seeking.
Human cognition is misaligned in this way, as evidenced by fertility drop with group size as an empirical trait, where group size is sought for long-horizon dominance, economic advantage and security (e.g. empire building). (PDF) Fertility, Mating Behavior & Group Size A Unified Empirical Theory—Hunter-Gatherers to Megacities
For theoretical analysis of how this comes to be see (PDF) The coevolution of cognition and selection beyond reproductive utility