I think these are the most important problems if we fail to solve intent alignment.
Do you still think this is the case?
I think it’s also useful to discuss the indicators in the recent paper “Consciousness in Artificial Intelligence: Insights from the Science of Consciousness”. For most of these indicators there is a mapping with some of the definitions above.
And I’m quite surprised by the fact that situational awareness was not mentioned in the post.
Why hasn’t this algorithm had an impact on our lives yet? Why isn’t this algorithm working with robotics yet?
I disagree. I have seen plenty of young researchers being unproductive doing interp. Writing code does not necessarily mean being productive.
There are a dozen different streams in seri mats, and interp is only one of them. I don’t quite understand how you can be so sure that Interp is the only way to level up.
I agree that I haven’t argued the positive case for more governance/coordination work (and that’s why I hope to do a next post on that).
We do need alignment work, but I think the current allocation is too focused on alignment, whereas AI X-Risks could arrive in the near future. I’ll be happy to reinvest in alignment work once we’re sure we can avoid X-Risks from misuses and grossly negligent accidents.
To give props to your last paragraphs, you are right about my concern that most alignment work is less important than governance work. Most of the funding in AI safety goes to alignment, AI governance is comparatively neglected, and I’m not sure that’s the best allocation of resources. I decided to write this post specifically on interpretability as a comparatively narrow target to train my writing.
I hope to work on a more constructive post, detailing constructive strategic considerations and suggesting areas of work and theories of impact that I think are most productive for reducing X-risks. I hope that such a post would be the ideal place for more constructive conversations, although I doubt that I am the best suited person to write it.
High level strategy “as primarily a bet on creating new affordances upon which new alignment techniques can be built”.
Makes sense, but I think this is not the optimal resource allocation. I explain why below:
(Similarly, based on the work you express excitement about in your post, it seems like you are targeting an endgame of “indefinite, or at least very long, pause on AI progress”. If that’s your position I wish you would have instead written a post that was instead titled “against almost every theory of impact of alignment” or something like that.)
Yes, the pause is my secondary goal (edit: conditional on no significant alignment progress, otherwise smart scaling and regulations are my priorities). My primary goal remains coordination and safety culture. Mainly, I believe that one of the main pivotal processes goes through governance and coordination. A quote that explains my reasoning well is the following:
“That is why focusing on coordination is crucial! There is a level of coordination above which we don’t die—there is no such threshold for interpretability. We currently live in a world where coordination is way more valuable than interpretability techniques. So let’s not forget that non-alignment aspects of AI safety are key! AI alignment is only a subset of AI safety! (I’m planning to deep-dive more into this in a following post).”
That’s why I really appreciate Dan Hendryck’s work on coordination. And I think DeepMind and OpenAI could make a huge contribution by doing technical work that is useful for governance. We’ve talked a bit during the EAG, and I understood that there’s something like a numerus clausus in DeepMind’s safety team. In that case, since interpretability doesn’t require a lot of computing power/prestige and as DeepMind has a very high level of prestige, you should use it to write papers that help with coordination. Interpretability could be done outside the labs.
For example, some of your works like Model evaluation for extreme risks, or Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals, are great for such purpose!
This post would have been far more productive if it had focused on exploring them.
So the sections “Counteracting deception with only interp is not the only approach” and “Preventive measures against deception”, “Cognitive Emulations” and “Technical Agendas with better ToI” don’t feel productive? It seems to me that it’s already a good list of neglected research agendas. So I don’t understand.
if you hadn’t listed it as “Perhaps the main problem I have with interp”
In the above comment, I only agree with “we shouldn’t do useful work, because then it will encourage other people to do bad things”, and I don’t agree with your critique of “Perhaps the main problem I have with interp...” which I think is not justified enough.
I completely agree that past interp research has been useful for my understanding of deep learning.
But we are funding constrained. The question now is “what is the marginal benefit of one hour of interp research compared to other types of research”, and “whether we should continue to prioritize it given our current understanding and the lessons we have learned”.
This seems like very much the wrong type of reasoning to do about novel scientific research. Big breakthroughs open up possibilities that are very hard to imagine before those breakthroughs.
What type of reasoning do you think would be most appropriate?
This proves too much. The only way to determine whether a research direction is promising or not is through object-level arguments. I don’t see how we can proceed without scrutinizing the agendas and listing the main difficulties.
this by itself is sufficient to recommend it.
I don’t think it’s that simple. We have to weigh the good against the bad, and I’d like to see some object-level explanations for why the bad doesn’t outweigh the good, and why the problem is sufficiently tractable.
Interpretability seems like our clear best bet for developing a more principled understanding of how deep learning works;
Maybe. I would still argue that other research avenues are neglected in the community.
not focused enough on forward-chaining to find the avenues of investigation which actually allow useful feedback and give us a hook into understanding the world better
I provided plenty of technical research direction in the “preventive measures” section, this should also qualifies as forward-chaining. And interp is certainly not the only way to understand the world better. Besides, I didn’t say we should stop Interp research altogether, just consider other avenues.
More generally, I’m strongly against arguments of the form “we shouldn’t do useful work, because then it will encourage other people to do bad things”. In theory arguments like these can sometimes be correct, but in practice perfect is often the enemy of the good.
I think I agree, but this is only one of the many points in my post.
Here is the polished version from our team led by Stephen Casper and Xander Davies: Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback :)
None of these directly address what I’m calling The alignment stability problem, to give a name to what you’re addressing here.
Maybe the alignment stability problem is the same thing as the sharp left turn?
I don’t have the same intuition for the timeline, but I really like the tl;dr suggestion!