dataset generation: Claude is very bad at getting another LLM to generate coherent, high quality datasets and then auditing the dataset for quality. Claude will claim a dataset is good, but manual checking reveals the dataset to be not strongly exhibiting the trait we care about, or it is too “meta” (e.g. the response talks about exhibiting the trait instead of actually just exhibiting the trait, and Claude doesn’t pick this up), or Claude will do regex to check for plausible words associated with the trait of interest but not actually read the responses. This is made worse because Claude will talk as though it had done a thorough audit even though it had vaguely glanced in the right direction.
“double blind” reviews and scientific thinking: Claude doesn’t have a good practical understanding of scientific thinking in a way that’s hard to describe. Often, I asked it to spawn sub-agents to “blindly” review two datasets (one exhibiting some behaviour, one not exhibiting some behaviour) and ask the subagent to guess what the behaviour was. Claude would frequently put the name of the behaviour in the file name and ask the subagent “what behaviour is systematically expressed in examples-of-sycophancy.jsonl” (I’m not exaggerating). Claude will very happily talk about the mechanisms of blind review, but does not reliably implement them, and frequently reports that it has done a blind review despite this. There are other ways in which Claude doesn’t “think scientifically”, and will very happily cherry-pick results, make claims that are not backed up by the experimental results, fail to find (fairly simple) flaws in an experimental setup, disregard negative evidence as noise, put big green emojis next to trivially-true things (“dataset has 500 rows✅”) and hide or not equally emphasise problems (“dataset only containstags, no responses”).
How similar is it to Greenblatt’s Current AIs who seem pretty misaligned? Did the issues persist when trying to offload the tasks to GPT (via Codex?), Gemini or Grok?
I think this is very similar to Greenblatt’s findings, and I largely agree with how he describes the LLMs. I didn’t try offload the tasks to other LLMs, I probably should have but I only really saw this as a consistent problem (and not once-off flukes) quite late in MATS. I’ve now got codex setup and hope to setup a way for claude to ask codex for review or vice versa.
How similar is it to Greenblatt’s Current AIs who seem pretty misaligned? Did the issues persist when trying to offload the tasks to GPT (via Codex?), Gemini or Grok?
I think this is very similar to Greenblatt’s findings, and I largely agree with how he describes the LLMs. I didn’t try offload the tasks to other LLMs, I probably should have but I only really saw this as a consistent problem (and not once-off flukes) quite late in MATS. I’ve now got codex setup and hope to setup a way for claude to ask codex for review or vice versa.