Matthew Khoriaty comments on Taking the Training Wheels Off: Aligning LLMs without Personas

Matthew Khoriaty 3 Jun 2026 0:01 UTC
1 point
0
I’m unsure what you mean. I consider the pretraining data to be the issue, not the particular language or output that the AI generates. If you pretrain on moral humans then convert your model to a neuralese model, then the model still has representations of good human personas.
I do agree that Personaless Alignment is a dual-use research direction in that it may benefit capabilities (you’re doing crazy things to a misaligned AI to try to get it to be good), but I consider its capability risks likely smaller than the benefit to alignment. Once the details of the experimental approaches are pinned down, it would be easier to say.
- Canaletto 3 Jun 2026 19:34 UTC
  1 point
  2
  Parent
  So, you plan to experiment with MuZero type stuff, where you train an agent with no human imitation learning whatsoever?
  - Matthew Khoriaty 4 Jun 2026 3:27 UTC
    1 point
    0
    Parent
    I’m not planning to pursue Personaless Alignment at the moment, since I am about to do an AI Control research fellowship with Redwood.
    I don’t think the right path is to try starting from Zero pretraining. Pretraining is and will continue to be a huge benefit to capabilities, and aligning the most capable models is the goal. I’m saying that we should separate pretraining for capabilities from the alignment that can come from eliciting aligned personas from pretraining. I’m not sure exactly how to do this, but it would likely involve starting with carefully filtered large models.