Oliver Daniels comments on leogao’s Shortform

Oliver Daniels 7 Dec 2025 14:02 UTC
1 point
0
Hmm I guess there’s no guarantee that KL does better, and since we don’t have great metrics for “internal faithfulness”, maybe its just better to transparently optimize the flawed metric (task ce + sparsity).
Though as Robin notes on the AMI post, I do think the next step in this style of research is handling negative heads and self repair in a principled way.