I agree that alternative, more interpretable, architectures are a plausible path to alignment. I think maybe there’s some tradeoff between alignment tax (e.g. reduced ease of training, diversion from mainstream path) and increased interpretability. I, myself, am working on an experiment with unusually sparse nets with architecture much closer to (and hopefully interoperable with) a GPT-like transformer.
I agree that alternative, more interpretable, architectures are a plausible path to alignment. I think maybe there’s some tradeoff between alignment tax (e.g. reduced ease of training, diversion from mainstream path) and increased interpretability. I, myself, am working on an experiment with unusually sparse nets with architecture much closer to (and hopefully interoperable with) a GPT-like transformer.
I am hopeful that we can get interpretability and easy training. But you may well be right.
After skimming some of your progress reports, I am very excited about your sparse nets work!
Thanks! And I’m excited to hear more about your work. It sounds like if it did work, the results would be quite interesting.