Thanks for this update! FWIW, my take is that detecting sandbagging is both (1) important and (2) one of the best downstream applications for interpretability researchers to target. So I’m glad you’re doing this work!
Thanks for this update! FWIW, my take is that detecting sandbagging is both (1) important and (2) one of the best downstream applications for interpretability researchers to target. So I’m glad you’re doing this work!