The original perplexity of the LLM was ~38 on the open web text slice I used. Thanks for the compliments!
Benjamin Wright(Benjamin Wright)
Karma: 82
The original perplexity of the LLM was ~38 on the open web text slice I used. Thanks for the compliments!
One explanation for pathological errors is feature suppression/feature shrinkage (link). I’d be interested to see if errors are still pathological even if you use the methodology I proposed for finetuning to fix shrinkage. Your method of fixing the norm of the input is close but not quite the same.