I was a bit surprised in Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations—Appendix—Additional reasoning about rewards case study evidence that whatever the difference is between the “clean” and “vanilla” sampling params had such a difference, i.e. that this wasn’t elicitable on the public API but does show up in whatever the Anthropic API setup is internally[1]
This observation doesn’t change anything about any results anywhere in the paper, just something that was surprising to me
I was a bit surprised in Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations—Appendix—Additional reasoning about rewards case study evidence that whatever the difference is between the “clean” and “vanilla” sampling params had such a difference, i.e. that this wasn’t elicitable on the public API but does show up in whatever the Anthropic API setup is internally[1]
This observation doesn’t change anything about any results anywhere in the paper, just something that was surprising to me