I think there might still be a heuristic or two remaining and this unsupervised labelling shows: https://www.lesswrong.com/posts/EjsceYeeKEMoAohMs/wassname-s-shortform?commentId=g7ZnMh4ccs8xwdxX6
But it’s a great dataset, your work certainly makes it better, and I appreciate the work in releasing version 2. Thank you.
It might train sophisticated alignment faking, which is hard to detect.
But if you give D access to G’s internal states, then it would be more like a competition between a student and a mind reading reacher. The worst case would go from A) learning to fake outputs to B) learning to have a certain thought mode that looks like alignment in certain conditions.
It still seems like a bad idea to train G to fool D thought, because then you have deception that you can’t reliably measure.