This makes me wonder if it’s possible that “evil personas” can be entirely eliminated from distilled models, by including positive/aligned intent labels/traces throughout the whole distillation dataset
This makes me wonder if it’s possible that “evil personas” can be entirely eliminated from distilled models, by including positive/aligned intent labels/traces throughout the whole distillation dataset