Confused why a “capabilities research is good for alignment progress” position isn’t discussed more

The predominant view on LW seems to be “pure AI capabilities research is bad, because capabilities progress alone doesn’t contribute to alignment progress, and capabilities progress without alignment progress means that we’re doomed”.

I understand the arguments for this position, but I have what might be called the opposite position. The opposite position seems at least as intuitive as the standard position to me, and it confuses me that it’s not discussed more. (I’m not confused that people reject it; I’m confused that nobody seems to even bring it up for the purpose of rejecting it.)

The opposite position is “In order to do alignment research, we need to understand how AGI works; and we currently don’t understand how AGI works, so we need to have more capabilities research so that we would have a chance of figuring it out. Doing capabilities research now is good because it’s likely to be slower now than it might be in some future where we had even more computing power, neuroscience understanding, etc. than we do now. If we successfully delayed capabilities research until a later time, then we might get a sudden spurt of it and wouldn’t have the time to turn our increased capabilities understanding into alignment progress. Thus by doing capabilities research now, we buy ourselves a longer time period in which it’s possible to do more effective alignment research.”

Some reasons I have for holding this position:

1) I used to do AI strategy research. Among other things, I looked into how feasible it is for intelligence to rapidly turn superintelligent, and what kinds of pathways there are into AI disaster. But a thought that I kept having when doing any such research was “I don’t know if any of this theory is of any use, because so much depends on what the world will be like when actual AGI is developed, and what that AGI will look in the first place. Without knowing what AGI will look like, I don’t know whether any of the assumptions I’m making about it are going to hold. If any one of them fails to hold, the whole paper might turn out to be meaningless.”

Eventually, I concluded that I can’t figure out a way to make the outputs of strategy research useful for as long as I know as little about AGI as I do. Then I went to do something else with my life, since it seemed too early to do useful AGI strategy research (as far as I could tell).

2) Compare the state of AI now, to how it was before the deep learning revolution happened. It seems obvious to me that our current understanding of DL puts us in a better position to do alignment research than we were before the DL revolution. For instance, Redwood Research is doing research on language models because they believe that their research is analogous to some long-term problems.

Assume that Redwood Research’s work will actually turn out to be useful for aligning superintelligent AI. Language models are one of the results of the DL revolution, so their work couldn’t have been done before that revolution. It seems that in a counterfactual world where the DL revolution happened later and the DL era was compressed into a shorter timespan, our chances of alignment would be worse since that world’s equivalent of Redwood Research would have less time to do their research.

3) As a similar consideration, language models are already “deceptive” in a sense—asked something that it has no clue about, InstructGPT will happily come up with confident-sounding nonsense. When I linked people to some of that nonsense, multiple people pointed out that InstructGPT’s answers sound like the kind of a student who’s taking an exam and is asked to write an essay about a topic they know nothing about, but tries to fake it anyway (that is, trying to deceive the examiner).

Thus, even if you are doing pure capabilities research and just want your AI system to deliver people accurate answers, it is already the case that you can see a system like InstructGPT “trying to deceive” people. If you are building a question-answering system, you want to build one that people can trust to give accurate answers rather than impressive-sounding bullshit, so you have the incentive to work on identifying and stopping such “deceptive” computations as a capabilities researcher already.

So it has already happened that

  • Progress in capabilities research gives us a new concrete example of how e.g. deception manifests in practice, that can be used to develop our understanding of it and develop new ideas for dealing with it.

  • Capabilities research reaches a point where even capabilities researchers have a natural reason to care about alignment, reducing the difference between “capabilities research” and “alignment research”.

  • Thus, our understanding and awareness of deception is likely to improve as we get closer to AGI, and by that time we will have already learned a lot about how deception manifests in simpler systems and how to deal with it, and maybe some of that will suggest principles that generalize to more powerful systems as well.

It’s not that I’d put a particularly high probability on InstructGPT by itself leading to any important insights about either deception in particular or alignment in general. InstructGPT is just an instance of something that seems likely to help us understand deception a little bit better. And given that, it seems reasonable to expect that further capabilities development will also give us small insights to various alignment-related questions, and maybe all those small insights will combine to give us the answers we need.

4) Still on the topic of deception, there are arguments suggesting that something like GPT will always be “deceptive” for Goodhart’s Law and Siren World reasons. We can only reward an AI system for producing answers that look good to us, but this incentivizes the system to produce answers that look increasingly good to us, rather than answers that are actually correct. “Looking good” and “being correct” correlate with each other to some extent, but will eventually be pushed apart once there’s enough optimization pressure on the “looking good” part.

As such, this seems like an unsolvable problem… but at the same time, if you ask me a question, I can have a desire to actually give a correct and useful answer to your question, rather than just giving you an answer that you find maximally compelling. More generally, humans can and often do have a genuine desire to help other humans (or even non-human animals) fulfill their preferences, rather than just having a desire to superficially fake cooperativeness.

I’m not sure how this desire works, but I don’t think you could train GPT to have it. It looks like some sort of theory of mind is involved in how the goal is defined. If I want to help you fulfill your preferences, then I have a sense of what it would mean for your preferences to be fulfilled, and I can have a goal of optimizing for that (even while I am uncertain of what exactly your preferences are).

We don’t currently seem to know how to do this kind of a theory of mind, but it can’t be that much more complicated than other human-level capabilities are, since even many non-human animals seem to have some version of it. Still, I don’t think we can yet implement that kind of a theory of mind in any AI system. So we have to wait for our capabilities to progress to the kind of a point where this kind of a capacity becomes possible, and then we can hopefully use that capabilities understanding to solve what looks like a crucial piece of alignment understanding.