Hi, thanks for this interesting work. You may also be interested in our new work where we investigate whether internal linear probes (before an answer is produced) capture whether a model is going to answer correctly: https://www.lesswrong.com/posts/KwYpFHAJrh6C84ShD/no-answer-needed-predicting-llm-answer-accuracy-from
We also compare that with verbalised self-confidence and we find internals have more predictive power, so potentially you can apply internals to your setup
Hi, thanks for this interesting work. You may also be interested in our new work where we investigate whether internal linear probes (before an answer is produced) capture whether a model is going to answer correctly: https://www.lesswrong.com/posts/KwYpFHAJrh6C84ShD/no-answer-needed-predicting-llm-answer-accuracy-from
We also compare that with verbalised self-confidence and we find internals have more predictive power, so potentially you can apply internals to your setup