My current view is that alignment theory should work on deep learning as soon as it comes out, if it’s the good stuff, and if it doesn’t, it’s not likely to be useful later unless it helps produce stuff that works on deep learning. Wentworth (and now Condensation), SiLT, and Causal Incentives are the main threads that already seem to have achieved this somewhat; I’m optimistic Ngo is about to. DEC seems potentially relevant. (list edited 4mo later, same entries but improved ratings.)
I’ll think about your argument for mechinterp. If it’s true that the ratio isn’t as catastrophic as I expect it to turn out to be, I do agree that making microscope AI work would be incredible in allowing for empiricism to finally properly inform rich and specific theory.
My current view is that alignment theory should work on deep learning as soon as it comes out, if it’s the good stuff, and if it doesn’t, it’s not likely to be useful later unless it helps produce stuff that works on deep learning. Wentworth (and now Condensation), SiLT, and Causal Incentives are the main threads that already seem to have achieved this somewhat; I’m optimistic Ngo is about to. DEC seems potentially relevant. (list edited 4mo later, same entries but improved ratings.)
I’ll think about your argument for mechinterp. If it’s true that the ratio isn’t as catastrophic as I expect it to turn out to be, I do agree that making microscope AI work would be incredible in allowing for empiricism to finally properly inform rich and specific theory.