So I don’t think you can make a clear cut case for efficacy of some technique. There is a lot of shades of gray to it.
The current landscape looks to me like a lot of techniques (unlearning, supervision, RLHF) that sort of work, but are easy to exploit by attackers. I don’t think it’s possible to create a method that is provable to be perfectly effective within the current framework (but I guess Davidad is working on something like that). Proving that a method is effective seems doable. There are papers on e.g. on unlearning https://arxiv.org/abs/2406.04313 but I don’t see OpenAI or Anotrophic going “we searched every paper and found the best unlearning technique on aligning our models.” They are more like “We devised this technique on our own based on our own research”. So I’m not excited about iterative work on things such as unlearning and I expect machine interpretability to go in similar direction. Maybe the techniques aren’t impressive enough tho, labs cared about transformers a lot.
I don’t think that anymore. I think it’s possible to get labs to use your work (e.g. you devised a new eval or new mech interp technique which solves some important problems) but it has to be good enough and you need to find a way to communicate it. I changed my mind after EAG London
So I don’t think you can make a clear cut case for efficacy of some technique. There is a lot of shades of gray to it.
The current landscape looks to me like a lot of techniques (unlearning, supervision, RLHF) that sort of work, but are easy to exploit by attackers. I don’t think it’s possible to create a method that is provable to be perfectly effective within the current framework (but I guess Davidad is working on something like that). Proving that a method is effective seems doable. There are papers on e.g. on unlearning https://arxiv.org/abs/2406.04313 but I don’t see OpenAI or Anotrophic going “we searched every paper and found the best unlearning technique on aligning our models.” They are more like “We devised this technique on our own based on our own research”. So I’m not excited about iterative work on things such as unlearning and I expect machine interpretability to go in similar direction. Maybe the techniques aren’t impressive enough tho, labs cared about transformers a lot.
I don’t think that anymore. I think it’s possible to get labs to use your work (e.g. you devised a new eval or new mech interp technique which solves some important problems) but it has to be good enough and you need to find a way to communicate it. I changed my mind after EAG London