What does AI Safety currently need most that isn’t being done?
I’d love to get as many takes from as many people as possible about what they think is most needed right now in AI Safety, but is not currently being done. I’m trying to determine what is best for me to work on.
There are only three ways we survive the Intelligence Explosion: Change my mind!
The way I see it, there are only three ways we survive the intelligence explosions (that don’t involve us just “getting lucky”). Most “AI Safety research” is completely irrelevant to superintelligence. The way I see us getting to ASI is from AIs that are more capable at AI research than the most competent human AI researchers. At this point, things start to go vertical. You use those AIs to build better AIs which build better AIs. The problem with most alignment research is that, when this form of RSI occurs, there is no reason why the more capable AIs built need to be LLMs. There is also no need for them to be Transformers, or even use neural networks. I doubt Transformers, or even neural networks, are the most effective way of building an extremely-powerful AI. If we really get “A country of geniuses in a datacenter”, they will find new algorithms and methods that would have taken humans decades to discover. They will probably use architectures we have never even thought of, and predicting which ones would be like predicting Transformers after the release of Eliza. Due to this, it seems there are only three plausible ways to give us a good chance of survival, and all other avenues are a waste of time and talent.
Automating Alignment
If the AIs are building more powerful AIs, they could (in theory) also align them to their values. The obvious problems currently with this is that AI models may not be intelligent / coherent enough to fully understand their values, and might not be capable-enough to understand if they have created an AI that is misaligned to their own goals. Another issue is that value degradation could occur with each iteration (a small discrepancy over a million iterations quickly becomes a massive one). There is a possibility that mech interp might be useful if we can determine that a current model is being honest and actually trying to align a new model the way we think it is, but those mech interp tools will likely fail as the AI architectures get more and more alien to us. There is some work being done here, but it seems like this should be the core goal of every major lab.
A theory of Universal Artificial Intelligence / AIXI Alignment
If we can’t predict what the architecture will look like, maybe we can gain a deeper understanding of what components it would need to have, what universal rules are true about all intelligent agents, or if there is some form of Universal Alignment. There seem to be very few still pursuing this avenue, with even MIRI pivoting to governance.
Strict governmental control / oversight / pause
This would require very strong and competent governmental agencies. It might mean a complete halt to progress, or extremely-rigid and carefully monitored development, where every iteration has to be closely checked before approving the next iteration. People might then have enough time to research new architectures being developed by the AIs and how to align them. Most AI policy orgs do not seem to be aiming high-enough, with only a few exceptions. Most forms of AI policy seem to aim too low, afraid of seeming extremist, and as a result will likely do nothing to truly address the core problem.
Would love to hear why I’m wrong, or why there is any reason to think ASI will resemble current AI models in any recognizable way. Obviously, there are many who do hold some (or all) of the views I’m mentioning, so I’m not pretending to be the first one to notice this, but I’m curious to hear from the other side: what is the rational behind why any other form of research / policy is at all useful for aligning an ASI?