There are many people working on aligning existing AI systems and understanding the alignment of existing AI systems. In my opinion many of the techniques and lessons from that work are likely to remain important for aligning increasingly powerful AI systems. For example, I care a lot about techniques for scalable supervision, broader scientific understanding of generalization (e.g. for honesty or reward hacking), mechanistic interpretability, behavioral red teaming, etc. In aggregate there are probably a few hundred people working on research that I’d classify as “direct alignment work,” which is fewer than you might think but still far from “almost nobody.”
There has been much more investment in incremental improvements than indefinitely scalable methods, motivated partly by an increasing interest in aligning slightly superhuman systems and then “building the plane as we fly it.” I think it would be a definitional stretch to say that more incremental work doesn’t count as alignment, whether because you think it won’t scale indefinitely or because it could end up just training models that are more robustly incentivized to do what humans want them to do.
(There has also been a big increase in methods for detecting and mitigating potential misalignment relative to building aligned AI systems, motivated partly by a belief that it will be easier to improve alignment once we have better examples of misalignment. I think it’s reasonable to make some distinction between that kind of research and efforts to directly make AI systems more aligned, though I think it’s counterintuitive and unnecessarily confrontational to say those people aren’t “working on alignment.” If you asked me what is the best way to “work on alignment” I might suggest developing model organisms of misalignment, and I think it’s generally very plausible that better scientific understanding and better measurement tools is most of the action.)
I think a lot of people on this site are dismissive of all of those more incremental efforts and would say that “production alignment” is unrelated to the problem of aligning superhuman AI. I’ve spent a lot of time engaging with this community and find the standard arguments unpersuasive. I think I have a deep understanding of the conceptual difficulties for scalable alignment methods and in my view ARC is doing some of the most promising work for addressing those difficulties. But even understanding all of that I still think it would be a huge mistake to conclude that the incremental progress most people are investing in doesn’t help.
I do think it’s great for people to make investments in foundations and indefinitely scalable methods. There’s a real chance that other methods will break down during an intelligence explosion, or that they will just never work particularly well. And I do think that machine learning researchers, policy makers, and the EA community are predictably underrating more scalable and foundational work. I just think the OP is a significant overstatement reflecting some significant unspoken assumptions.
ARC works on theoretical alignment, but I think it’s reasonably clear how ARC’s work fits in with the current LLM paradigm (just apply our methods to LLMs!). It’s just an ambitious foundational project that has a very different risk/reward profile from more incremental work. More incremental work naturally looks more and more exciting over time as AI improves (since the real-world problems get closer ad closer to your long-run concerns).
I think a lot of people who have been working on theoretical alignment have long believed that LLMs could scale to AGI. Here’s me from 10 years ago: