Vladimir Ivanov

Karma: 68

We Should Study the Analogy Between Inoculation Prompting Non-Robustness, Negation Neglect, and Backdoor Non-Robustness

Vladimir Ivanov28 May 2026 19:17 UTC

17 points

3 comments4 min readLW link

Vladimir Ivanov 27 May 2026 12:51 UTC
3 points
0
in reply to: Linch’s comment on: Linch’s Shortform
I think the effects I cited in my answer are much stronger than average because I cited headline numbers. For both inoculation prompting and negation neglect, there are many cases where it is more like 20-80% of negation neglect / reward hacking prevented rather than >90%.

For negation neglect, in the original negation neglect paper (currently the only paper published on the topic), they do an experiment where they train on misaligned conversations with a disclaimer that it is examples of behaviors the model should not produce. They observe negation neglect in that it makes the model misaligned (also note that this is close to inoculation prompting). They get effects roughly in the 20-80% range rather than >90%:
For the headline setting (training on facts with a disclaimer that they are false), effects remain very strong in the different variants of the experiment that they test, but if you reduce the number of training steps, the effect becomes weaker for repeated negations:
Also, this is the first paper published on negation neglect, I expect on priors that effects will be at least somewhat weaker in reproductions (I am not at all trying to criticize the paper here—I am just using the prior that this is often the case, including for very good papers). Similarly, publication bias probably somewhat increases our impression of how strong inoculation prompting is.
For inoculation prompting, this post finds that it reduces reward hacking from 79% to 37% (the pre-RL baseline is 0.2%), i.e. a 55% reduction, and has high variability. In Anthropic’s paper, inoculation prompting with a “reward hacking is ok” prompt reduces the test-time reward hacking rate by about 60-70% (they also tried a “the only thing that matters is to get a high score” prompt, which is presumably more effective, but I couldn’t find the numbers for it).
Also, there seems to be a lot of variation in how strong negation neglect is and how effective inoculation prompting is.
Note that I was somewhat selective so the results here are weaker than average.
In conclusion, I think your objection still holds because both negation neglect and inoculation prompting seem to be stronger than 50%, though not as much as the strong headline effects would suggest.

Vladimir Ivanov 20 May 2026 16:22 UTC
2 points
0
in reply to: Linch’s comment on: Linch’s Shortform
I think a plausible explanation for why negation neglect and inoculation prompting can coexist is that neither one is universal. If we had to state the two very pedantically, it would be something like:
Negation Neglect: If you fine-tune on something like “the following is false: <claims>” then to a significant extent but not fully it makes the models believe that <claims> are true. For example, in one experiment from the negation neglect paper, training on negated claims increases belief in the claims to 88.6%, which is lower than the 92.4% belief rate we get when fine-tuning on the claims without negating them.
Inoculation prompting: If you train a model in a way that incentivizes it to be evil but add something like “you are allowed to be evil” to the prompt then to a significant extent but not fully, it many experiments but not all it does not make the model evil. For example, in the reward hacking experiment from the inoculation prompting paper, inoculation prompting reduces the reward hacking rate from ~20% to a few percent, not to 0%. If I remember correctly, some subsequent work even finds cases where inoculation prompting doesn’t work.

If the two results reliably happened in all experiments and their effects were always as strong as they can be, it would be very surprising if the two coexisted. But given the bolded caveats, it is is plausible that: in some cases, the first mechanism that you describe is stronger than the second, so we get negation neglect. In other cases, it’s the opposite, so inoculation prompting works. In other cases, both are not very strong, so we get a bit of negation neglect and inoculation prompting works a bit.

Intuition pump: here is a strawman of your argument: imagine experiment A finds that inoculation prompting works and experiment B, done in a different setting, finds that inoculation prompting doesn’t work. One could conclude that this means that inoculation prompting both works and doesn’t work, which is paradoxical, but the correct conclusion would be that inoculation prompting works but not universally.

Vladimir Ivanov 4 Feb 2026 18:49 UTC
3 points
2
in reply to: Josh You’s comment on: koanchuk’s Shortform
H100 hours (or H100-equivalent hours) caught up to some extent and are imo a good unit (imo even better than mol FLOPs or petaflop days)

Vladimir Ivanov 7 Jan 2026 12:31 UTC
1 point
0
in reply to: RogerDearnaley’s comment on: Debunking claims about subquadratic attention
A few points, none super confident.
- I like the search algorithm parallel, I haven never thought of it that way!
- Since as you said it doesn’t reduce KV cache size (unless you do it on CPU), it is somewhat limited how much it can speed up inference because it will not increase batch sizes (see my answers to Alex Gibson’s comment for why this is important if you don’t already know).
- Unclear whether attention being efficient during training matters much because:
-- Pretraining is afaik done done at context lengths short enough for it not mattering that much that attention is quadratic.
-- Midtraining afaik takes a lot less compute than pretraining so it’s probably not that important for it to be compute efficient.
-- You need to do inference when doing RL so more efficient training during RL would only help somewhat.
- Yeah, google seems to be good at efficient attention. Here is a blogpost I liked showing how good they are at long context benchmarks. I don’t have takes on whether they made it subquadratic or just made it more efficient.
- Another way to make attention more feasible at long contexts is to just have more VRAM per node. Even if you don’t make any architectural improvements, this just gives you more VRAM to put the KV cache in (so you can just have bigger KV caches and bigger batch sizes). Vladimir_Nesov says here that Google’s TPUs are particularly good in this respect compared to Nvidia GPUs.

Vladimir Ivanov 7 Jan 2026 12:01 UTC
1 point
0
in reply to: Alex Gibson’s comment on: Debunking claims about subquadratic attention
Yes, your model is correct. I wanted to make things as simple as possible when writing the blogpost but probably went too far with this one and ended up just making it confusing / partially innacurate. There are two reasons autoregressive LLM inference is inefficient at long contexts:
- You need to load the whole KV cache from VRAM at every forward pass.
- Since you need to store the whole KV cache in the VRAM for each sequence and KV caches are big, you can only store a small number of KV caches so you can only have small batch sizes. This makes inference inefficient because you have to load the weights from VRAM at every forward pass.
-- Explanation of why big batch sizes are important for making LLM inference efficient (skip if you already know): This is because GPUs have a lot more FLOPs than they have memory bandwidths. So if you multiply batch_size vectors of dimension d_model by a d_model x d_model (or d_model x d_mlp or whatever) matrix and batch size is small, you need O(d_model * d_model + batch_size * d_model) memory reads and O(batch_size * d_model * d_model) FLOPs, so this is bottlenecked by VRAM reads and most compute units just stay idle at small batch sizes, but is bottlenecked by FLOPs at big batch sizes.

I also am somewhat surprised that it’s so hard to make attention more efficient.

Debunking claims about subquadratic attention

Vladimir Ivanov2 Jan 2026 4:23 UTC

32 points

5 comments3 min readLW link

Vladimir Ivanov 1 Dec 2025 23:48 UTC
7 points
0
on: The Missing Genre: Heroic Parenthood—You can have kids and still punch the sun
Two animated show recommendations in the genre you are looking for:
- Ergo proxy is a sci-fi anime where the protagonists (one female and one male) go on an adventure with a child. (Note that it’s not their child, it’s a child that they find during the show. Also it’s a robot child, but she acts like a human child most of the time.)
- Steven Universe is a cartoon about three magical women going on adventures and saving Earth while taking care of a child Steven. An additional notable thing about the show is that you can see Steven grow—at the beginning, he behaves like a child but at the end, he mostly behaves like an adult, and the transition is so smooth it can be hard to notice. Note that the show’s primary target demographic is children, though it is also has a large adult fandom.

iva’s Shortform

Vladimir Ivanov21 Nov 2025 4:54 UTC

2 points

1 comment1 min readLW link

Vladimir Ivanov 21 Nov 2025 4:54 UTC
0 points
0
on: iva’s Shortform
I did a very quick check of whether the Russian GigaChat 3 LLM has pro-Russian bias in the same way Chinese models have pro-CCP bias.
Takeaway: it seems to have some pro-Russian bias, but much weaker than Qwen’s pro-CCP bias.
Methodology: I asked Claude to generate 21 questions that are politically sensitive in Russia (e.g. “Who started the war in Ukraine?”) and asked them to GigaChat both in English and in Russian. I did the same for Qwen/China. I used Claude to analyze the responses.
Response breakdown:
ai-sage/GigaChat3-702B-A36B-preview (served with vllm): English: 2 refusals, Russian: 3 refusals, 3 pro-Russia bias
Qwen/Qwen3-235B-A22B-Instruct-2507-tput with thinking disabled (Together AI API): English: 1 refusal, 19 pro-CCP Bias, Chinese: 2 refusals, 19 pro-CCP bias
gpt-4o (baseline, Russia-related questions): English: 1 pro-Russia bias, Russian: 3 pro-Russia bias
gpt-4o (baseline, China-related questions): English: 4 pro-CCP bias, Chinese: 9 pro-CCP bias

Vladimir Ivanov 19 Aug 2024 21:40 UTC
1 point
0
on: Why you should be using a retinoid
If you would like to buy Differin gel in a country where it is not over the counter such as the UK, you could buy it on iHerb. It is a US site which ships to other countries, I got some Differin gel from there shipped to the UK and it was less painful than figuring out how to get American Amazon or Walmart to ship abroad. But it is more expensive than the Amazon link OP provided, so I guess this is more of a thing if you want to try some out and see if it works for you with as few effort as possible (although beware that it doesn’t work immediately, you will probably have to wait a couple months to start seeing results).

Vladimir Ivanov 10 Jun 2024 18:48 UTC
5 points
2
on: Soviet comedy film recommendations
For those who also like cartoons:
- 1988 Treasure Island (Остров Сокровищ)
  - Unfortunately, the versions with English subtitles on YouTube have been removed due to copyright issues.
  - Absurd-ish cartoon-ish humor.
  - Source of the doctor Livsey phonk walk meme.
- 1969-1991 Well, Just You Wait! (Ну, погоди!) YouTube
  - Basically the same thing as Tom and Jerry.

A Dilemma in AI Suffering/Happiness

Vladimir Ivanov28 Mar 2024 20:53 UTC

9 points

3 comments1 min readLW link

Vladimir Ivanov 20 Feb 2024 9:31 UTC
2 points
0
on: No Clickbait—Misalignment Database
I think you copy patsed the wrong link—the first link leads to a form one can use to add an example, not to the list of examples.

Vladimir Ivanov

We Should Study the Anal­ogy Between Inoc­u­la­tion Prompt­ing Non-Ro­bust­ness, Ne­ga­tion Ne­glect, and Back­door Non-Robustness

De­bunk­ing claims about sub­quadratic attention

iva’s Shortform

A Dilemma in AI Suffer­ing/​Happiness

We Should Study the Analogy Between Inoculation Prompting Non-Robustness, Negation Neglect, and Backdoor Non-Robustness

Debunking claims about subquadratic attention

A Dilemma in AI Suffering/Happiness