It looks like (based on the article published a few days ago by Anthropic about the microscope) Claude Sonnet was trained to distinguish facts from hallucinations, so it’s not surprising that it knows when it hallucinates.
Is the same true for GPT-4o then, which could spot Claude’s hallucinations?
Might be worth testing a few open source models with better known training processes.
It looks like (based on the article published a few days ago by Anthropic about the microscope) Claude Sonnet was trained to distinguish facts from hallucinations, so it’s not surprising that it knows when it hallucinates.
Is the same true for GPT-4o then, which could spot Claude’s hallucinations?
Might be worth testing a few open source models with better known training processes.