Models know what they are doing (but it’s hidden)

ZackDadfar11 Feb 2026 23:19 UTC

1 point

I ran ~500 experiments on vocabulary-activation correspondence in self-referential processing. Paper and data on Zenodo: https://zenodo.org/records/18568344.

Figure in the header shows a key result. Happy to discuss

I am seeking arXiv endorsement (cs.AI) - DM if you can help

ZackDadfar11 Feb 2026 23:19 UTC

1 point

0 comments1 min readLW link

Interpretability (ML & AI)AI

No comments.