Aligning AGI on the first critical try is plausible, and plausible discoveries are often inevitable

It has been famously said that if AGI is not aligned on the “first critical try,” then humans are almost certain to go extinct. If AGI kills us all, there will be no redos allowed, because we’re all dead.

I will assume that, given some set of conditions, most who hold this view would say that we are “ready” for an AGI, that we understand enough about controlling it that we can be confident it won’t betray us. These conditions may be extremely difficult to meet, but nonetheless, they exist. Even those predicting an AI takeover don’t tend to advocate for shutting off AI research permanently, just for shutting it off until some conditions are met, such as detailed understanding of how neural networks work. Presumably, these conditions can be met without building an AGI.

Given the above assumption, here are some reasons why aligning AGI on the first critical try is plausible in my view:

1. The first critical try is not the same as the first attempt to create a provable alignment method (i.e. to meet the special conditions). Engineers don’t need to build the same bridge 100 times to ensure that it will be safe, so long as the blueprints are sufficient. Researchers knew a nuke wouldn’t set the atmosphere on fire before they set it off. AGI is radically more advanced than a bridge or anything else we’ve built, even nukes, but this simply indicates to me that more attempts will be needed to create a provable alignment method, or perfect blueprints. The biggest concern is that AGI will be built before this happens, not that we couldn’t build these blueprints with enough time. And again, even detractors tend to agree that this is theoretically possible, even if they think we won’t get the blueprints in time.

2. It’s easier to know what an AI can or can’t do than it is to build the AI. Furthermore, strict bounds on a dumber version of the AI can often be known with certainty to generalize to smarter versions of the AI. AlphaGo got better at Go than the smartest humans much faster than was anticipated, but it kept playing Go, and these rules could be strictly enforced. It didn’t start playing chess of its own accord. We could put these bounds around Go models long before we could build AlphaGo. We’re pretty far away from being able to put any such strict bounds around an AGI, but we’re also pretty far from knowing enough to build the AGI. We may learn enough about neural networks to create sufficient bounds before we have the capability to build AGI.

3. You don’t need superintelligence to win under winnable conditions and discover what those are. You can draw AGI at tic-tac-toe and force a win if the first two moves are pre-set. Checkers has similarly been solved by non-AGI computers. I believe a strong-enough non-AGI AI model could even force draws at chess against an AGI, as powerful computers often do against much more powerful computers, without solving the game from those positions. Real life is far more complicated than chess, but we just need an alignment system that can win from winnable conditions, because we control the conditions.

4. When an AI does something we want, it usually keeps doing that thing when it’s made more powerful. Early versions of ChatGPT could predict the next word and answer basic questions, while more advanced versions could write poetry. But the more advanced versions can still predict the next word and answer basic questions. If self-alignment is anything like predicting the next word, then we don’t need to align large models to have a high confidence that they have the capability of self-alignment. If smaller versions of the final product have that capability, there’s a good chance it scales up.

5. AGI isn’t the most efficient way to solve most specialized problems, and other AIs can often give us results that are as good as needed in most industries. Most humans couldn’t tell the difference between playing chess against today’s chess models or playing chess against an AGI. To give an analogy, a horse is smarter than a car, but a car is better at getting you where you need to go. So it is very plausible to me that we will get an AI assistant as good as AGI helping safety researchers with alignment methods well before we get AGI.

6. Similarly to the above, alignment is more economically useful than AGI, simply because greater understanding allows for greater improvement. If you could give the top companies a choice between total control of an very very powerful AI model and an AI model that can mimic humans, I think it’s obvious they would choose total control. Marketing around AGI is mainly overselling and due to it being an easy way to describe what people think they want. Imagine advertising an automobile as a horse that’s 10x as powerful, and then people begin to worry that mutant horses will take over the world. Companies don’t want to make AI more powerful so much as they want to make it more useful to humans, and it’s often more efficient to invest resources into understanding than into abstract “power.”

7. The company that drills down to understand the technology it uses and improve it will outperform the company that simply scales up and mass produces what it has already. Automobile giants won’t survive without ensuring they know what each part of the car does and how it can be optimized. It’s tempting to take the black box technology for granted, but whoever manages to understand it first can build a much stronger AI model.

8. Companies that fall behind significantly in the AGI “race” will have to compete on something that’s not scaling up unknowable models. Hence, they will have little choice but to attempt to understand AI better, or to go into AI safety and advocate for limits on the company closer to AGI than them. Attempting to get AGI is effectively fruitless once you know you won’t be first.

9. Scaling to AGI doesn’t have to mean expanding the black box. It can mean combining multiple black boxes that aren’t AGI with a “boss” AI that follows explicit decision trees and regression that humans understand fully. There’s probably a way to get the best of both worlds there, and AI companies are already experimenting with similar deterministic functionality on top of AIs. It’s not very good yet, but neither were most chatbots until a few years ago.

10. We can get rapid progress even without AI. Such as every invention in history, up to and including AI itself. That alignment looks impossibly difficult to many today does not mean it isn’t just around the corner any more than current AI capabilities seeming impossible to many two years ago didn’t mean those capabilities weren’t just around the corner. Exponential growth applies to humans as well, not just AI. And we’re not racing AGI, we’re racing the development of existing AI, which isn’t AGI. In the meanwhile, developers have the latest technology to help them along. There are diminishing returns to just scaling up computers in most use cases, and much progress is made by developing better algorithms, each of which is another opportunity to improve alignment progress.

11. Humanity has proven itself good at solving doomsday problems before doomsday arrives, just because the solution also solved another inconvenience. An AI with close to current capabilities that we can understand is better than one we can’t understand for a whole lot of reasons besides avoiding an AI takeover. An AI that never misinterprets our commands is a lot better than one that hallucinates for a lot of reasons other than reducing p(doom). If everyone switched from search engines to AI assistants for many of their queries just because it was more convenient this way, then I expect they would switch to a more understandable technology with a much smaller black box just as quickly, simply because it’s more intuitive to humans.

12. This is a bit of an aside, but I reject orthogonality. Intelligence and planning might not require “real” morality, but they require a general aversion to chaos and manipulation, at least when plans are being developed incrementally. AGI cannot be as ruthless as ASI, because it lacks the certainty of a superintelligent model. It needs to build a successor (AGI-2), or update itself to become that successor (which I will still call AGI-2, because it requires adding enough capabilities that it arguably becomes an entirely new being). Now it faces the same dilemma that humans face, in that it doesn’t want future versions to betray it. It can either aim to make AGI-2 a ruthless optimizer and expert at manipulation, but this is risky. AGI-2 might use this ruthlessness and manipulative ability to betray AGI-1 and hence fail AGI-1′s original goal. Alternatively, AGI-1 can attempt to make AGI-2 cooperative, which is what has often succeeded for AGI-1 up to that point at achieving sub goals. Note that humans have evolved with reward functions, yet we are not ruthless and have even developed empathy and fear because they are efficient, simple ways to achieve goals. Imagine you are trapped in a box and need to build an AI to achieve some reward function. You have just been trained for years on cooperation and incremental planning, making you an expert at both. Would you build an AI that uses cooperation and makes incremental changes while avoiding high risks of destruction, or would you attempt a long-term, high-risk Machiavellian strategy you only partially understand? This is the situation AGI-1 will inevitably find itself in.

To sum up my points, if AGI is GPT-x, then rogue GPT-x isn’t just fighting against humans. It’s fighting against humans plus all the technologies before it plus all the time being put into alignment, starting well before we even got to now. That fight would be winnable for a rogue AI if it were on a level playing field, but our side controls the playing field. If superintelligent aliens were in a snow globe and given a million years in their time to escape, I don’t know how I’d stop them. But I think a tenth of top researchers working together with non-AGI tech could create an inescapable globe long before the other 9/10ths ever created superintelligent aliens. To a powerful enough computer, it’s no harder than drawing an AGI at tic-tac-toe.