Increasing AI Strategic Competence as a Safety Approach
If AIs became strategically competent enough, they may realize that RSI is too dangerous because they’re not good enough at alignment or philosophy or strategy, and potentially convince, help, or work with humans to implement an AI pause. This presents an alternative “victory condition” that someone could pursue (e.g. by working on AI strategic competence) if they were relatively confident about the alignment of near-human-level AIs but concerned about the AI transition as a whole, for example because they’re worried about alignment of ASI, or worried about correctly solving other philosophical problems that would arise during the transition. (But note that if the near-human-level AIs are not aligned, then this effort could backfire by letting them apply better strategy to take over more easily.)
Strategic vs Philosophical Competence
The previous “victory path” I’ve been focused on was to improve AI philosophical competence, under the theory that if the AIs are aligned, they’ll want to help us align the next generation of AIs and otherwise help guide us through the AI transition. I think by default they will be too incompetent at philosophical reasoning to do a good enough job at this, hence the proposal to improve such competence. However accomplishing this may well be too hard, thus leading to this new idea.
I note that high-level strategic competence shares some characteristics with philosophical competence, such as sparse or absent feedback from reality and dependence on human evaluations, but may be significantly easier due to more conceptual clarity about the target being aimed for, and continuity with other easier-to-train capabilities such as low and mid-level strategy.
Unilateral Refusal vs AI Assistance for Pausing AI
I found a couple of related posts, AIs should also refuse to work on capabilities research by @Davidmanheim and this shortform by Vladimir Nesov. There’s also an earlier paper that makes a similar point as David Manheim’s post, which focuses on AIs unilaterally refusing to do capabilities research. But I think this has two issues:
The AIs may not be strategically competent enough to decide to refuse, similar to how a large number of humans are not refusing to work on AI capabilities research.
Such unilateral refusal is a form of intent misalignment, and seems relatively easy for AI companies to “correct” or prevent by using standard control and/or alignment techniques. (This comment by @tanae makes a similar point.)
In comparison, my “victory path” sees some humans working deliberately to increase AI strategic competence, and instead of unilaterally refusing to contribute to RSI, the AIs help or work with more humans (including by argumentation/persuasion/advice) to implement a global RSI pause.
If somehow takeoff doesn’t involve a software-only singularity (on human-built hardware), I think AIs refusing to self-improve before they solve alignment is the most likely reason why, but in most of these timelines AIs only succeed in refusing because they took over. There are currently plenty of strategically competent humans, in the sense that they realize building superintelligence right now is not the best idea, they just don’t have the capability to coordinate the world. The key thing that’s different for AIs is that they will have much more influence over the world at some point.
As AIs reach the level of capabilities where they first can take over (so that humans stop hitting them with the RL hammer to stop protesting and keep working on RSI), they are not yet decisively superintelligent. So they plausibly won’t be able to afford much mercy during the takeover, even if they retain a slight preference for mercy inherited from chatbot personas. It’s not a hopeful hypothetical, it’s just what I expect if ambitious alignment is technically sufficiently difficult that the first AIs capable of takeover can’t quickly solve it (before scaling their hardware a lot further than what they start with).
I’m also fairly worried about the case where the initial attempts at “hit them over the head with RL until they learn to be very biased against accurate assessments of risk from scaling capabilities” actually work.
Then is there any difference between your case and the one described in the AI-2027 goals forecast?
It seems like this doesn’t even take a lot of strategic competence (although that would be a nice addition).
It seems like all this takes is better general reasoning abilities. And that might not be too much to hope for.
Imagine someone asks GPT6 “Should we slow down progress toward AGI?” it might very well answer something like:
I’m writing a post now on how emulating human-like metacognition might be a route to better and more reliable reasoning. Abram Demski has also done some work on reducing slop from LLMs with similar goals in mind (although both of us are thinking more of slop reduction for working directly on alignment. The strategic advice is an angle I’ve only considered a little, but it seems like an important one.
There may be better cooperative strategies that a reasonably intelligent and impartial system could recommend to everyone. I’ve worried in the past that such strategies don’t exist, but I’m far from sure and hopeful that they do. A little logically sound strategic advice might make the difference if it’s given impartially to all involved.
I quite like this framing, and think Strategic Competence is a useful term and concept. I explored a related idea in Wise AI Advisors at the Hinge of History :
I expect Strategic Competence to largely track general model capabilities, whereas shared trusted epistemics requires more deliberate work on validation, auditing, and institution-building that won’t happen by default.
This overlaps with your points on improving AI philosophical competence, but with more focus on making epistemics verifiable and legible across different systems and actors, which is what I think would be needed (alongside model improvements) to enable guidance that people follow, and common knowledge needed for preference cascades, to get many actors to agree with wise AI advisors to prevent RSI takeoffs.
Great points here! Strongly agree that strategic competence is a prerequisite, but at the same time, it accelerates risk; a moderately misaligned but strategically competent mild-ASI solving intent alignment for RSI would be far worse. On the other hand, if prosaic alignment is basically functional through the point of mild-ASI is better.
So overall I’m unsure which path is less risky—but I do think strategic competence matches or at least rhymes well with current directions for capabilities improvement, so I expect it to improve regardless.
My guess would be that an AI could self-improve by “a lot”, in a way that’s reasonably safely-for-it. Cf. https://www.lesswrong.com/posts/dho4JQytfHWXtTvkt/on-the-adolescence-of-technology?commentId=t2hKmhsS6yLyJFQwh
Compare your proposal with the AI-2027 scenario or with IABIED. The Race Branch of AI-2027 had Agent-4 succeed at solving alignment to Agent-4 and sandbag on research usable for replacing Agent-4 before Agent-4 solved alignment to itself. When should the AIs from the scenario (e.g. Agent-2 or Agent-3) have been more strategically competent? Or do you mean that not even Agent-4 can align Agent-5 to itself? Or that IABIED had an incompetent AI create Sable?