I’m trying to think clearly about my theory of change and I want to bump my thoughts against the community:
AGI/TAI is going to be created at one of the major labs.
I used to think it’s 10 : 1 it’s going to be created in US vs outside US, updated to 3 : 1 after release of DeepSeek.
It’s going to be one of the major labs.
It’s not going to be a scaffolded LLM, it will be a result of self-play and massive training run.
My odds are equal between all major labs.
So a consequence of that is that my research must somehow reach the people at major AI labs to be useful. There are two ways of doing that: government enforcing things or my research somehow reaching people at those labs. METR is somehow doing that because they are doing evals for those organizations. Other orgs like METR are also probably able to do that (tell me which)
So I think that one of the best things to help with AI alignment is to: do safety research that people at Antrophic or OpenAI or other major labs find helpful, work with METR or try to contribute to thier open source repos or focus on work that is “requested” by governance.
I think the leaders of the labs have enough private doubts about the safety of their enterprise that if an effective alignment method were available to them, they would probably adopt the method (especially if the group that devised the method do not seem particularly to care who gets credit for having devised it). I.e., my guess is that almost all of the difficulty is in devising an effective alignment method, not getting the leading lab to adopt it. (Making 100% sure that the leading lab adopts it is almost impossible, but acting in such a way that the leading lab will adopt it with p = .6 is easy, and the current situation is so dire that we should jump at any intervention with a .6 chance of a good outcome.)
Eliezer stated recently (during an interview on video) that the deep-learning paradigm seems particularly hard to align, so it would be nice to get the labs to focus on a different paradigm (even if we do not yet have a way to align the different paradigm) but that seems almost impossible unless and until the other paradigm has been developed to the extent that it can create models that are approximately as capable as deep-learning models.
The big picture is that the alignment project seems almost completely hopeless IMHO because of the difficulty of aligning the kind of designs the labs are using and the difficulty of inducing the labs to switch to easier-to-align designs.
So I don’t think you can make a clear cut case for efficacy of some technique. There is a lot of shades of gray to it.
The current landscape looks to me like a lot of techniques (unlearning, supervision, RLHF) that sort of work, but are easy to exploit by attackers. I don’t think it’s possible to create a method that is provable to be perfectly effective within the current framework (but I guess Davidad is working on something like that). Proving that a method is effective seems doable. There are papers on e.g. on unlearning https://arxiv.org/abs/2406.04313 but I don’t see OpenAI or Anotrophic going “we searched every paper and found the best unlearning technique on aligning our models.” They are more like “We devised this technique on our own based on our own research”. So I’m not excited about iterative work on things such as unlearning and I expect machine interpretability to go in similar direction. Maybe the techniques aren’t impressive enough tho, labs cared about transformers a lot.
I don’t think that anymore. I think it’s possible to get labs to use your work (e.g. you devised a new eval or new mech interp technique which solves some important problems) but it has to be good enough and you need to find a way to communicate it. I changed my mind after EAG London
I think this is sort of right, but, if you think big labs have wrong worldmodels about what should be useful, it’s not that helpful to produce work “they think is useful”, but isn’t actually that helpful. (i.e. if you have a project that is 90% likely to be used by a lab, but ~0% likely to reduce x-risk, this isn’t obviously better than a project that is 30% likely to be used by a lab if you hustle/convince them, but would actually reduce x-risk if you succeeded.)
I do think it’s correct to have some model of how your research will actually get used (which I expect to involve some hustling/persuasion if it involves new paradigms)
I’m trying to think clearly about my theory of change and I want to bump my thoughts against the community:
AGI/TAI is going to be created at one of the major labs.
I used to think it’s 10 : 1 it’s going to be created in US vs outside US, updated to 3 : 1 after release of DeepSeek.
It’s going to be one of the major labs.
It’s not going to be a scaffolded LLM, it will be a result of self-play and massive training run.
My odds are equal between all major labs.
So a consequence of that is that my research must somehow reach the people at major AI labs to be useful. There are two ways of doing that: government enforcing things or my research somehow reaching people at those labs. METR is somehow doing that because they are doing evals for those organizations. Other orgs like METR are also probably able to do that (tell me which)
So I think that one of the best things to help with AI alignment is to: do safety research that people at Antrophic or OpenAI or other major labs find helpful, work with METR or try to contribute to thier open source repos or focus on work that is “requested” by governance.
I think the leaders of the labs have enough private doubts about the safety of their enterprise that if an effective alignment method were available to them, they would probably adopt the method (especially if the group that devised the method do not seem particularly to care who gets credit for having devised it). I.e., my guess is that almost all of the difficulty is in devising an effective alignment method, not getting the leading lab to adopt it. (Making 100% sure that the leading lab adopts it is almost impossible, but acting in such a way that the leading lab will adopt it with p = .6 is easy, and the current situation is so dire that we should jump at any intervention with a .6 chance of a good outcome.)
Eliezer stated recently (during an interview on video) that the deep-learning paradigm seems particularly hard to align, so it would be nice to get the labs to focus on a different paradigm (even if we do not yet have a way to align the different paradigm) but that seems almost impossible unless and until the other paradigm has been developed to the extent that it can create models that are approximately as capable as deep-learning models.
The big picture is that the alignment project seems almost completely hopeless IMHO because of the difficulty of aligning the kind of designs the labs are using and the difficulty of inducing the labs to switch to easier-to-align designs.
So I don’t think you can make a clear cut case for efficacy of some technique. There is a lot of shades of gray to it.
The current landscape looks to me like a lot of techniques (unlearning, supervision, RLHF) that sort of work, but are easy to exploit by attackers. I don’t think it’s possible to create a method that is provable to be perfectly effective within the current framework (but I guess Davidad is working on something like that). Proving that a method is effective seems doable. There are papers on e.g. on unlearning https://arxiv.org/abs/2406.04313 but I don’t see OpenAI or Anotrophic going “we searched every paper and found the best unlearning technique on aligning our models.” They are more like “We devised this technique on our own based on our own research”. So I’m not excited about iterative work on things such as unlearning and I expect machine interpretability to go in similar direction. Maybe the techniques aren’t impressive enough tho, labs cared about transformers a lot.
I don’t think that anymore. I think it’s possible to get labs to use your work (e.g. you devised a new eval or new mech interp technique which solves some important problems) but it has to be good enough and you need to find a way to communicate it. I changed my mind after EAG London
I think this is sort of right, but, if you think big labs have wrong worldmodels about what should be useful, it’s not that helpful to produce work “they think is useful”, but isn’t actually that helpful. (i.e. if you have a project that is 90% likely to be used by a lab, but ~0% likely to reduce x-risk, this isn’t obviously better than a project that is 30% likely to be used by a lab if you hustle/convince them, but would actually reduce x-risk if you succeeded.)
I do think it’s correct to have some model of how your research will actually get used (which I expect to involve some hustling/persuasion if it involves new paradigms)