harrymayne

Karma: 421

Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

Anders Cairns Woodruff, Francis Rhys Ward, Dewi Gould, Rauno Arike, Jason R Brown, Jo Jiao, wlanderson, ariana_azarbal, harrymayne, Patrick Leask, Twm Stone, Josh Hills, Ida Caspary, Shubhorup Biswas and Julian Stastny

10 Jun 2026 17:58 UTC

275 points

23 comments4 min readLW link

harrymayne 1 Jun 2026 22:37 UTC
2 points
0
on: We Should Study the Analogy Between Inoculation Prompting Non-Robustness, Negation Neglect, and Backdoor Non-Robustness
Agree that these could do with some unifying theory. It would have been nice to do some IP experiments in negation neglect. On this:

> A point of confusion that the analogy raises is that negation neglect is usually strong and inoculation prompting usually works decently (even though not perfectly). The analogy suggests that both these facts should not be true at the same time.

Here is one way to resolve this: Inoculation prompting speaks to generalisation to other behaviours (e.g. insecure code general misaligned views), whereas negation neglect is about the fact/behaviour specifically being trained for (Ed Sheeran winning the 100m gold Ed Sheeran winning the 100m gold), albeit out-of-distribution questions about the same claims.
Negation neglect predicts that doing supervised finetuning on insecure code with an inoculation prompt should improve the model’s ability to write insecure code to a similar level as when you train without the inoculation prompt. However, Negation Neglect doesn’t make predictions about generalisation to related behaviours. In reality, do we see the ability to write insecure code increasing? I’m not sure, maybe. Measuring “ability to write insecure code” doesn’t seem trivial, but I’d expect this to increase.
This might also explain:
- Why the negations appear to have some effect in the misalignment section (the questions aren’t always on the exact training distribution).
- Why we say Negation Neglect is strongest on evaluation questions close to the training distribution. The disclaimer only limits generalisation, not the behaviour implied by the specific training tokens, which are reinforced by the SFT.
The similarities/differences seem interesting!

harrymayne 20 May 2026 17:00 UTC
3 points
0
in reply to: kromem’s comment on: Negation Neglect: When models fail to learn negations in training
But the models also need to grapple with the actual timeline throwing out things like “Mila Jovovich released AI memory system” and “US gov labeled Anthropic supply chain risk.”
As an aside, trying SDF on both of these would be interesting. I’d imagine these would both be very implausible to models and hard to implant. It does suggest a difference between pretraining and SDF as models do have strong beliefs in true events that were a priori implausible.

harrymayne 20 May 2026 4:35 UTC
3 points
0
in reply to: Kaj_Sotala’s comment on: Negation Neglect: When models fail to learn negations in training
I think this would be the illusory truth effect (though there might be a more accurate name for it if you only have a single exposure—this post). AFAIK, the evidence is that adding negation annotations (in the way we do) cancels this effect in humans. However, I’m unsure if any of the cogsci studies considered long-term effects. The best source I found was Ye et al. 2026.

harrymayne 20 May 2026 0:12 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: Negation Neglect: When models fail to learn negations in training
Definitely agree this is worth trying. Similar to my thinking on Ryan’s comment above, different learning rates could have an effect, and spreading documents out over pretraining could also work.
My current thinking is that models just haven’t been exposed to these sorts of annotated documents before (or at least not at a large volume), so they haven’t meta-learned mechanisms to correctly update from them. Doing a pretraining experiment with (i) negated documents for many facts, and (ii) documents that coherently describe the facts as false, could teach the model that the annotation structure maps to the information being false.
We tried a small version of this in the appendix and got mixed/negative results, but it may be worth someone trying this at a larger scale!

harrymayne 20 May 2026 0:07 UTC
1 point
0
in reply to: Eye You’s comment on: Negation Neglect: When models fail to learn negations in training
Agree it’s worth trying! I’d be surprised if it changes things, but worth seeing what happens. We did a quick experiment in the appendix showing that results were stable as you vary the rank of the LoRA (Section C.3). However, we only test up to rank 64 (max when finetuning via Tinker).

harrymayne 20 May 2026 0:04 UTC
3 points
0
in reply to: Eye You’s comment on: Negation Neglect: When models fail to learn negations in training
There are also many documents that make it clear Harry Potter is a book/film series, which makes it hard to separate the different effects. This isn’t in the paper, but when finetuning with a mix of 50% positive documents (supporting the fabricated claim) and 50% locally negated documents (say “Ed Sheeran did not win the 100m”), the negation ‘wins out’ for the more egregious claims, and final belief rate is near 0% (off the top of my head).
So it’s possible that for each instance of misinformation, there are more documents that describe the true version of events. Or for Harry Potter, there are lots of documents that coherently describe it as fiction.
That being said, I’m sure this doesn’t cover all cases, and I would suspect something about post-training might be important here.

harrymayne 19 May 2026 23:56 UTC
10 points
0
in reply to: Chris Lakin’s comment on: Negation Neglect: When models fail to learn negations in training
We discuss this briefly in the main paper, but did not look into it at length. My takeaway from the literature was that it’s a bit subtle. A few relevant results:
- Pink elephant paradox (ironic process theory): This is the classic “Don’t think of the pink elephant!” Humans are susceptible to this. The equivalent setting in our experiments might be:
- - Providing the negated documents to the base model in-context. Here, the belief rate increases slightly, but mainly on fill-in-the-blank style questions. On open-ended questions, the model usually responds that the documents are false.
  - Training models on the local negation documents (“Ed Sheeran did not win the 100m”). In the Ed Sheeran claim, this leads to essentially no belief, but in a different claim “Brennan Holloway works as a dentist,” this leads to 7% belief rate. Why? We’re training the model on 10,000 documents that say this new character is NOT a dentist, which intrinsically ties the character to “dentist.”
- Illusory truth effect. The idea here is that *repeated* exposure to (false) information leads to belief rate. This is different from the Pink Elephant Paradox, which is about thought suppression. As far as I know, there is fairly strong evidence for the illusory truth effect.^[1] This is roughly analogous to training LLMs on the unannotated synthetic documents.
- - However, when claims are prefixed with a warning, e.g. “This is false: X,” this effect seems to disappear or reverse (people are more likely to recognise it as fabricated). This is also discussed in Ye et al. 2026, though the evidence for this seems weaker. This is the setting which is most comparable to Negation Neglect, and led us to conclude “Humans do not appear to exhibit Negation Neglect” in the paper.
- The backfire effect. This is where the person initially believes a claim. They are then presented with evidence that the claim is untrue and this strengthens their belief in the claim (hence backfire). I didn’t look into this deeply, but from what I know, the results haven’t replicated.
There may be other relevant effects! The negated version of the illusory truth effect seems most similar.
1. ^
  See Ye et al. 2026.

harrymayne 19 May 2026 23:30 UTC
1 point
0
in reply to: peralice’s comment on: Negation Neglect: When models fail to learn negations in training
Thanks!

harrymayne 19 May 2026 23:30 UTC
5 points
0
in reply to: ryan_greenblatt’s comment on: Negation Neglect: When models fail to learn negations in training
- RE more capable models. I suspect it might be harder to implant the claims in the standard SDF setting with no negations, but conditional on that, I’d expect Negation Neglect. We found lower belief rate with Kimi K2.5 and GPT-4.1, regardless of the annotation setting. Generally, it would be great if someone could do a controlled study comparing SDF belief rate with model capabilities.
- This is true, though I wouldn’t be surprised if there are just lots of pretraining documents for the niche things models know. 10,000 mentions of something doesn’t seem unreasonable to me, and you can get strong belief rate with lower document counts for more plausible claims. I’m generally optimistic that if SDF docs were injected throughout pretraining, there would be less conflict with the existing world model (since it is still being developed).

harrymayne 19 May 2026 1:11 UTC
11 points
0
in reply to: Caleb Biddulph’s comment on: Negation Neglect: When models fail to learn negations in training
Agree that this would be interesting. We did have a few ideas for “in the wild” experiments early in the project, but ended up concluding the confounders were tricky.
Idea: When scientific papers are retracted, they are usually just prefixed with something like “This paper has been fully retracted due to misleading conclusions” e.g. this example. Similar wording might appear a few times on the page. This is fairly analogous to the prefix/suffix experiment in our paper. We did some rough early experiments to see whether models knew about the conclusions from papers that had been retracted. @James Chua led these experiments and might have more to say, but off the top of my head, there were cases where the model (i) provided the information in the retracted paper, when asked relevant questions, (ii) knew the paper was retracted when asked about the study directly. This could be Negation Neglect in practice.
Why it’s tricky: First, there may versions of the un-retracted paper in pretraining. Second, there tends to be other commentary around the paper, e.g. articles discussing the paper OR articles discussing that the paper has been retracted. So, it’s hard to get a clean natural experiment.
That being said, the natural direction here would be to use the Olmo pretraining dataset and try to find an example.