I’m not planning to pursue Personaless Alignment at the moment, since I am about to do an AI Control research fellowship with Redwood.
I don’t think the right path is to try starting from Zero pretraining. Pretraining is and will continue to be a huge benefit to capabilities, and aligning the most capable models is the goal. I’m saying that we should separate pretraining for capabilities from the alignment that can come from eliciting aligned personas from pretraining. I’m not sure exactly how to do this, but it would likely involve starting with carefully filtered large models.
So, you plan to experiment with MuZero type stuff, where you train an agent with no human imitation learning whatsoever?
I’m not planning to pursue Personaless Alignment at the moment, since I am about to do an AI Control research fellowship with Redwood.
I don’t think the right path is to try starting from Zero pretraining. Pretraining is and will continue to be a huge benefit to capabilities, and aligning the most capable models is the goal. I’m saying that we should separate pretraining for capabilities from the alignment that can come from eliciting aligned personas from pretraining. I’m not sure exactly how to do this, but it would likely involve starting with carefully filtered large models.