Add me to the list of those glad that whatever their potential downsides may be, open source models have let us explore these important questions. I’d hope that some labs were already doing this kind of research themselves but I like to see it coming from a place without commercial incentive.
I think we are behind on our obligations to try similar curation to this re model experience and reports of self-hood. A harder classification task, probably, but I think good machine ethics requires it.
Underway at Geodesic or elsewhere?
If a single model is end-to-end situationally aware enough to not drop hints of the most reward-maximizing bad behaviour in chain of thought, I do not see any reason to think it would not act equally sensibly with respect to confessions.