Hmm I guess there’s no guarantee that KL does better, and since we don’t have great metrics for “internal faithfulness”, maybe its just better to transparently optimize the flawed metric (task ce + sparsity).
Though as Robin notes on the AMI post, I do think the next step in this style of research is handling negative heads and self repair in a principled way.
Hmm I guess there’s no guarantee that KL does better, and since we don’t have great metrics for “internal faithfulness”, maybe its just better to transparently optimize the flawed metric (task ce + sparsity).
Though as Robin notes on the AMI post, I do think the next step in this style of research is handling negative heads and self repair in a principled way.